Speeding up ETL in CI #2373

jdangerx · 2023-03-09T00:39:01Z

jdangerx
Mar 9, 2023
Maintainer

I've been noodling around trying to figure out how to make the CI tests run faster on the daz/parallelize-tests branch. An hour is an excruciating amount of time to wait for tests to pass.

Some thoughts:

It's easy to throw one runner at each tox env. Unfortunately, the lion's share of the time is in the integration tests, so that doesn't help us much - the CI still runs in about an hour. A representative run is here.

I tried using pytest-shard to split the integration tests across multiple GH actions runners. This Just Works, but also doesn't help us much. A representative run is here - each individual integration test run takes about 45 minutes. Which is better than an hour! But the main problem is that the test fixtures all require running the fast ETL.

I also tried pulling the fast ETL into its own job, that the integration test jobs then depend on. In theory the integration tests can then run with --live-dbs which should cut a lot of the time. Then when we split the integration tests into shards we will actually see gains. I haven't gotten the hookup between the fast ETL upload & the integration tests downloading that completely sorted, but here is the latest run. The problem here is that it takes 27 minutes to run the ETL and another 10 to upload the outputs.
- instead of uploading all of pudl-work I could just upload the outputs as artifacts. Uploading to GH actions cache seems to take much less time than uploading build artifacts - this would cut down the upload time to probably 4-5 min.
- running the fast ETL as a separate job lets us use execute_job instead of execute_in_process, which gets us a little bit of parallelism.
- if we wanted to spend money on this problem, we could pay for GCP Batch, where we'd trigger a Batch job via GH actions, wait for it to complete, and have that upload the outputs to some GCS bucket. We could give the box more resources than the runners, which would make things go fast. Then when we download the cached outputs we'll also have to pay for egress. With an estimate of .5h of machine time + 1G of egress that comes out to about $0.25 per run. We would have to limit this to internal PRs since external PRs don't get GCP access.
- any upgrades we make to the ETL speed (partitioning EPA CEMS, parallelizing EIA transforms, pre-converting XLSX to CSV) will help here, so that's nice!

I think the immediate next steps on the parallelize-tests branch are:

put pytest-shard back in
only upload the output data as a build artifact
download output data to the right path
remove that tox parallelization thing I put in, that obscures logs and only parallelizes across tox envs - so it's not even that useful.

Which should get us to something closer to 45 minute CI runs and not take that long to implement.

If we built a single Docker image on push and used that everywhere it might save us a few minutes too - that will matter more if we have more jobs running in series, but until we get to the <15 minute mark I'm not too worried about it.

I think promising future work after that is:

just speeding up the dang ETL - though since the GH runners are wimpy, speeding up wall time by increasing parallelization won't get us too many gains (though... if we could spin up a bunch of GH runners and use them as a dagit worker cluster...
trying GCP batch - this could let us throw more computation at the problem at still quite cheap rates. Considering how quickly the fast ETL runs on our laptops, I think this could get us down to <15min per run, which would be quite something!
- in some nebulous future we can run a persistent dagster deploy that kicks off our nightly jobs & also spins up extra workers for when we want to do CI. Don't think we really need that for a very long time.

Pie-in-the-sky:

https://cloud.google.com/batch/docs/create-run-job-gpus

cc @zaneselvans who I know has been frustrated by this in the past.

zaneselvans · 2023-03-09T05:29:42Z

zaneselvans
Mar 9, 2023
Maintainer

It seems like there are 3 contexts where speeding up the tests would be helpful:

On our laptops for local development.
In GitHub actions for CI
In the nightly builds, which I think currently spend ~3 hours doing integration tests and data validation.

But these contexts have different resources to work with:

On our laptops and the nightly builds, we typically have lots of memory and lots of CPUs to work with, so parallelism is helpful.
The GitHub runners are very smol, but in theory we can spin up lots of them.

There are several different opportunities for parallelism, which would help in different contexts:

In the ETL code itself (e.g. partitioning CEMS, breaking up eia_transform)
In pytest (with an extension like pytest-xdist)
In Tox (via the built-in parallel mode)
In the GitHub Actions workflows as you've been exploring

To me it looks like trying to parallelize in GitHub actions is the most complex, since it's not just trying to run the tests in parallel on one machine, but distributing tasks across several different nodes, which means sending data to other places and recollecting the coverage or other results back together at the end, and the maybe difficult task of creating independent work-units to hand out. It seems like this optimization also won't speed things up in the other contexts.

Making the ETL run faster through parallelism will help anywhere we've got multiple cores and enough memory, as would making pytest run in parallel (for the stuff that happens in tests specifically anyway). However, the fast ETL only takes ~5 minutes to run on 1 CPU, so speeding that up isn't going to help much with the hour-long CI. You can run it in isolation with:

pytest test/integration/etl_test.py::test_pudl_engine

Most of the time in CI is taken up by the computation of output dataframes for things like the EIA Plant Parts, the MCOE, the utility service territory compilations, and a bunch of probably slow pandas-playing-SQL operations to generate the output tables. Many of these operations are done using PudlTabl fixtures which are then shared between a bunch of tests, caching the results of the computations inside them, so they don't have to be regenerated (but there are several of them, with different settings for annual vs. monthly vs. no aggregation, and various other settings).

To me it sounds like it would be hard to split out the various pytest cases onto separate runners in such a way that they weren't often duplicating the same calculations.

I think a lot of work we are already planning to do in the immediate future may help with the slowness.

We already know we want to better decompose CEMS, and possibly some of the other asset groups.
Replacing the guts of PudlTabl with asset definitions and writing all those outputs into the DB will allow that work to be paralleized by Dagster and definitely keep them from being recalculated.

With some basic parallelism in pytest we could run lots of data validations in parallel, and when they're just reading tables / views directly out of the DB instead of 🐼 they'll also have less work to do. E.g.

pytest --live-dbs -n auto --dist worksteal test/validate

By running a couple of Tox environments in parallel, we could run all the linters and the unit tests at the same time, and the docs build + integration tests at the same time (can't run the docs build in with linters b/c the docs builds creates some temporary files the linters don't like). Not sure how to make this work with coverage generation though.

$ tox -e rstcheck,flake8,pre_commit,bandit,unit -p auto
pre_commit: OK ✔ in 1 minute 8.27 seconds
bandit: OK ✔ in 1 minute 9.36 seconds
flake8: OK ✔ in 1 minute 10.94 seconds
rstcheck: OK ✔ in 1 minute 14.93 seconds
  rstcheck: OK (74.93=setup[65.18]+cmd[9.75] seconds)
  flake8: OK (70.94=setup[65.56]+cmd[5.38] seconds)
  pre_commit: OK (68.27=setup[65.52]+cmd[1.10,0.25,0.22,0.25,0.23,0.26,0.28,0.16] seconds)
  bandit: OK (69.36=setup[65.56]+cmd[3.80] seconds)
  unit: OK (123.51=setup[65.38]+cmd[58.13] seconds)
  congratulations :) (123.72 seconds)

$ tox -e integration,docs -p auto

With Tox there's a bunch of overhead in setting up the virtual environments with pip (about 1 min per environment). It's currently set up to share the same python environment across all the test environments, but that precludes using parallel mode.

In the nightly builds, we could try running the full_integration and validate tox environments in parallel, but if we're already telling pytest to use multiple CPUs that might not be helpful (and would probably blow up memory usage). Validate is very easy break into little pieces, but the integration tests might not be.

I tried running:

pytest -n auto --dist worksteal test/integration

and it didn't really have any speed advantage over 1 CPU (16 min vs. 18 min), so that's probably not helpful.

I think if there's some easy test parallelism to turn on, we should do it! But probably not worry about trying to speed up the CI on GitHub until after we've finished the Dagster transition, since what makes the CI slow will be in a different place, and have different behaviors then. And if it's going to be complex or costly, but we can make the tests run fast locally and in the nightly builds, maybe it's not worth focusing on the GitHub CI speedup.

3 replies

zaneselvans Mar 9, 2023
Maintainer

I did some direct comparisons of the data validation with 1, 4, and 8 cores, and they actually all took ~30 min, so I think if we wanted to speed them up through parallelism we'd need to do some manual re-organization of the tests. I think they could be broken into 3 different big categories: unaggregated, monthly, and annual outputs, each of which depends on a different computationally expensive PudlTabl fixture that caches a bunch of EIA data and outputs.

I think there are probably a collection of other validations that could all be lumped together that don't depend on those fixtures, like the one checking the FERC Form 1 outputs.

jdangerx Mar 10, 2023
Maintainer Author

Yeah, I think separating out the three contexts is useful, and certainly focusing on stuff that speeds things up in all three contexts is most helpful! I think in particular the "revamp PudlTabl" stuff will be helpful across the board, since it's decreasing the amount of computation we have to do overall instead of doing a better job parallelizing the work.

Running multiple tox environments in parallel is cool, easy, and I think not that impactful - I think our problem is still that the integration environment takes a long time.

As for tying a bow on the work I've been doing in my "waiting-for-CI" time, I think it makes sense to merge in "explicitly split the workflow into doc/lint, unit, and integration jobs" and then call it a day. It’s more ergonomic that way - if bandit fails or something, we still get to see how the rest of the CI went.

We can also add the pytest-shard parallelization (it seemed a little easier to use than pytest-xdist, fwiw). I'm slightly worried that will mess with our zenodo caching in the medium term, since each runner isn't guaranteed to use all the fixtures.

One day when the ETL is faster, we can put the "fast etl" phase back in, and that will let us parallelize the integration tests without worrying about caching, since the first job that tries to access the GH actions cache will also populate it fully every time.

A lot of the other parallelization stuff I was fiddling with didn’t really go anywhere, though it certainly helped me get a good handle on how our CI works etc. so not totally wasted efort.

zaneselvans Mar 11, 2023
Maintainer

I was worried about splitting up the environment into different jobs because of the need to collect coverage data, but the build artifact upload/download seems really easy! This will definitely make the checks passed/failed more informative. It's always a little annoying to have to go poke around in thousands of lines of logs to figure out what happened.

We might consider splitting the docs out from linters since docs is now (hallelujah) the only thing with an occasionally flaky internet dependency, because it looks up the intersphinx endpoints for other projects we reference in our docs.

At odds with that, there are also some functions in PUDL that only get run during the docs build (that convert our metadata objects into RST in the documentation using Jinja templates), so we might want to collect coverage from the docs job.

rousik · 2023-08-16T20:59:38Z

rousik
Aug 16, 2023
Collaborator

Splitting this into run-etl and validate-outputs stages is in-line with our overall efforts to overhaul the CI/CD. It seems to me that one of the suggestions would be to use gcp/batch instead of gh action runners. This is indeed more costly option (given that gha runners are effectively free) but might allow us to speed things up.

While there are probably ways to speed things up even on gha runners, my suspicion is that this comes with non-trivial overhead (maintenance, debugging and taking care of this additional runtime environment) so if the cost is acceptable, I would vote for uniformity (run everything on batch) - perhaps there are ways to cut down on the costs there as well by picking smaller vm sizes.

1 reply

zaneselvans Aug 17, 2023
Maintainer

Do we have any idea how much a CI run would cost on batch? Or how many CI runs we do per month? GitHub provides reports for paid CI usage but I'm not seeing any data on our free CI usage. I'm worried that running the CI on paid infrastructure every time any of us pushes to a branch is going to be expensive.

How much of a speedup do we think we'd get from running the CI on bigger machines? I think the speedup is going to be pretty much proportional to the number of CPUs, given that almost all the processing is now in the DAG.

Maybe more imperative than speeding up the CI is the fact that we're currently bumping up against some GH runner resource constraint (I suspect it's the 7GB of memory) with the integration of the 2022 data, and the need to process examples of both the old DBF and new XBRL data from FERC, and the many potentially concurrent tasks that are getting run by Dagster.

rousik · 2023-11-08T03:26:12Z

rousik
Nov 8, 2023
Collaborator

I have been working on running the ETL on big runners, before the integration tests which does seem to speed things up somewhat (scaling with number of core, or a bit less than that). In the meantime, more complex processing landed in ferc_to_sqlite which brought the integration times up significantly.

Given this new reality, I think that parallelization of the ETL prior to integration tests might not be fully addressing the current situation. I would like to float some of the following:

reduce scope of what gets run as part of integration tests (this is accuracy/speed tradeoff, but might be acceptable as a way of making integration lower-fidelity but faster)
probabilistic sampling of input data (perhaps one year worth is too much, would it be possible to further reduce the scope by e.g. randomly (in a deterministic way) throwing away some percentage of input records so as to further reduce the amount of data we need to process
explore additional opportunities for caching (I'm not sure if there is any low hanging fruit; this may be tricky as some caching approaches might hide some problems, but we can always consider more layers of testing, and have "cached but quick" feedback that will be followed by slower but more robust evaluations (e.g. consider different points at which these are run; e.g. fast should be run for every commit, while slower might be delayed until the PR is ready to be merged)

I'm curious, in particular, about exploring the idea of probabilistic sampling of input data. That could be somewhat easily tunable and fairly generic mechanism (how much data to drop and where), which we could use to fine-tune the speed/accuracy/cost for early integration tests.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Speeding up ETL in CI #2373

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Catalyst Cooperative

Speeding up ETL in CI #2373

jdangerx Mar 9, 2023 Maintainer

Replies: 3 comments · 4 replies

zaneselvans Mar 9, 2023 Maintainer

zaneselvans Mar 9, 2023 Maintainer

jdangerx Mar 10, 2023 Maintainer Author

zaneselvans Mar 11, 2023 Maintainer

rousik Aug 16, 2023 Collaborator

zaneselvans Aug 17, 2023 Maintainer

rousik Nov 8, 2023 Collaborator

jdangerx
Mar 9, 2023
Maintainer

Replies: 3 comments 4 replies

zaneselvans
Mar 9, 2023
Maintainer

zaneselvans Mar 9, 2023
Maintainer

jdangerx Mar 10, 2023
Maintainer Author

zaneselvans Mar 11, 2023
Maintainer

rousik
Aug 16, 2023
Collaborator

zaneselvans Aug 17, 2023
Maintainer

rousik
Nov 8, 2023
Collaborator