Reviewing Code Review #2240

bendnorman · 2023-01-27T18:44:24Z

bendnorman
Jan 27, 2023
Maintainer

Originally opened by @TrentonBush is the business repo. I'm moving this issue into a PUDL discussion so it can abe more visible and conversational.

Description

Review our existing process for code reviews, identify areas that could be improved, and draft requirements to fix them.

Code review is a means of quality assurance wherein a second person checks each piece of contributed code for various quality requirements (documentation, test coverage, etc). There is a tradeoff between the increased quality due to review and the costs of doing the review, and we want to make sure we choose the right tradeoff for us. Note that reviewing costs are not limited to the literal hours spent reviewing, but also include problems (extra merge conflicts, late delivery, etc) caused by any and all delays between opening a PR and the conclusion of the review.

Motivation

We want to minimize the maintenance burden of our code so that we can spend our resources on building new stuff to serve more users. We also want to avoid the long delays we have incurred in the past while waiting for someone to review new PRs.

This is an "important but not urgent" background problem for our organization.

Scope

How do we know when we are done?
This will be an ongoing process, but the first pass will be complete when we draft requirements for both submitting PRs and reviewing PRs. In particular we want to address:

For submitting:

guidelines for size of PRs. Most should be small, but there are some inevitable large PRs (such as architectural changes)
what documentation/explanation should the reviewee provide to ease the reviewing process
how to choose a reviewer

For reviewing:

time bounds for the reviewer to respond to the PR, whether positively or negatively
guidelines for scoping feedback. When is feedback a new issue vs a fix to this issue?
guidelines for test coverage and success
guidelines for code style and quality standards
guidelines for providing a constructive (vs adversarial) tone of review

For both:

can we fit all this in a template checklist or something easy to use
where do we document all this stuff?
assess the different requirements between engineering PRs vs analysis PRs
do the same rules apply to PRs on long lived branches (such as architectural changes)?

What is out of scope?

how to track reviewing time / story points in Zenhub. Leave that for a separate time tracking discussion

Anything else?

Some pieces are currently in various google docs:

bendnorman · 2023-01-27T19:18:51Z

bendnorman
Jan 27, 2023
Maintainer Author

@jdangerx and I decided to revive this discussion because of the large dagster PR I created. This PR converts the ETL portion of our codebase to use dagster and initializes the IO Managers we'll use for PUDL. The most recent commit put the PR at 35 files and +1,929 −839 lines changed! This is obviously not ideal from a collaboration and review perspective. I think the size of the PR deterred people from reviewing it.

My general plan for the dagster integration is to split it into two phases. Phase 1 is converting the ETL to use dagster. Everything from the ferc to sqlite conversion to the normalized tables being loaded into the pudl.sqlite. Phase 2 is converting the PudlTabl object code to use dagster. This phase will involve converting simple output tables to SQL views, converting larger output and analysis tables to assets, and converting the FERC 714 and EIA 861 interim ETLs to dagster.

Separating the work into these two phases is convenient because the ETL and the PudlTabl class are two halves of pudl that are loosely coupled. The PudlTable class receives a SqlAlchemy engine that points to the outputs of the ETL.

Phase 1

Phase 1 felt hard to break into smaller pieces. The entire ETL should be converted to dagster before merging the changes into dev. It wouldn't make sense to have part of our ETL converted to dagster. The downside of this is that it leads to a massive PR! I think if I was to recreate the PR, I could have created a primary branch off of dev and created new branches for each portion of the ETL I converted. I think this would have made the development process a bit smoother. Instead, I had one PR and requested "partial" reviews from folks. @zschira and @katie-lamb reviews back in December were helpful, but it's been challenging to rope other people in since then. If I had smaller PRs for each portion of the ETL, people would have been more willing to review the work.

How would other folks have structured this work? @jdangerx

There is still some work left for Phase 1; see #1570. We could create separate branches for the integration tests and documents.

Phase 2

Phase 2 feels much easier to structure and parallelize. Once Phase 1 is in the dev branch we can create a primary feature branch for Phase 2. There are dozens of tables that need to be converted to assets and SQL views. This phase could have a similar structure to the xbrl work where people open PRs off the primary feature branch that converts a single table or a group of tables. This will allow us to parallelize the work and keep PRs small.

1 reply

jdangerx Jan 27, 2023
Maintainer

x-posting from the PR comment:

I think in terms of how I would have structured this work to avoid a giant PR, I would have broken this work down further like this (caveat - all I've done is stare at this code for a couple hours, so I'm sure there are complications here that I'm not thinking of):

PR with dagster & a toy DAG just to make sure everything runs.
a. PR for SQLiteIOManager so people can start parallel work sooner?
PR with one EIA table (and all its gluey dependencies) alongside our existing code, outputting to a separate space in the workspace, and comparing the output with our existing ETL
PR for one FERC table (and all the corresponding dependencies there - the IO managers, ferc_to_sqlite), comparing output, etc.
a. maybe split ferc_to_sqlite out into its own PR
Separate PRs for remaining EIA + FERC tables
PR for EPA CEMS
Once we've verified that all the dagster ETLs work the same as our legacy ETL, 1 PR ripping out all the legacy ETL stuff.
A bunch of these steps could be executed in parallel, if we had more people working on this.

jdangerx · 2023-01-27T22:17:41Z

jdangerx
Jan 27, 2023
Maintainer

I'm going to start writing a "quick comment" and it's about to turn into a novel.

I think the overall goal of code review is context transfer: we want at least two people to understand what's going into the codebase, how it interacts with the rest of the system, and why we need this code.

Other things that I think are important, but not the most important goal that we want to keep in mind at all times:

making sure the code follows our style
making sure code gets tested
making sure the code is easy to maintain going into the future

Logical correctness, I think, is implicit in the context transfer piece - if two people both believe they understand the why/what/how then I assume they wouldn't let the code in unless those all were in good shape...

Review speed is super important! I think it's more properly addressed in task breakdown / project management land, though - a lot of what causes reviews to be slow are:

"I have my own stuff to be working on" -> removing silos so people can share responsibility for our code, not pre-assigning tons of tasks to people
"I don't have enough context to provide meaningful review" -> stop assigning people to work on things solo! Which means hard prioritization decisions upstream
"This is way too big, scary & hard to review thoroughly!" -> break down tasks into smaller chunks, again upstream of actually working on the PRs
"Uuuuggh" -> framing code review as "a fun learning opportunity" rather than "a chore where you check the work of your coworkers & community members" makes it feel less like a chore

Anyways - some proposed guidelines:
For reviewees:

provide the why/what/how: why do we need this change? what is the actual code that's changing? how does it fit into the rest of the system as you know it?
were there any hard decisions/tradeoffs you had to make? why did you make them?
how would one actually run this code manually?
ask explicit questions of your reviewers: what interactions are you unsure about? what do you want help with? do you need domain-level guidance, programming style/code structure guidance? test help?

A lot of times this can be split across the main PR comment and a self-review.

For reviewers:

actually run this code manually!
be clear about what's blocking & non-blocking
answer the questions, and make sure your suggestions actually work in code!
if you see an obvious problem (CI is failing because of some typo) just push a fix!

As for long lived branches, my official take is "long lived branches are not worth the heartache, so screw 'em." But that's a conversation for elsewhere.

As for the guideline about "be nice, not mean" I think that lives in a code of conduct - that should be table stakes for engaging at all.

I'm happy to whip up a PR template - our existing one in PUDL is a good start, but I think it's a bit verbose and has a bit of a "did you do your homework" adversarial framing tbh.

Some excellent resources that informed a lot of my strident opinions about code review:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Reviewing Code Review #2240

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Catalyst Cooperative

Reviewing Code Review #2240

bendnorman Jan 27, 2023 Maintainer

Description

Motivation

Scope

Anything else?

Replies: 2 comments · 1 reply

bendnorman Jan 27, 2023 Maintainer Author

Phase 1

Phase 2

jdangerx Jan 27, 2023 Maintainer

jdangerx Jan 27, 2023 Maintainer

bendnorman
Jan 27, 2023
Maintainer

Replies: 2 comments 1 reply

bendnorman
Jan 27, 2023
Maintainer Author

jdangerx Jan 27, 2023
Maintainer

jdangerx
Jan 27, 2023
Maintainer