Replies: 3 comments 4 replies
-
It seems like there are 3 contexts where speeding up the tests would be helpful:
But these contexts have different resources to work with:
There are several different opportunities for parallelism, which would help in different contexts:
To me it looks like trying to parallelize in GitHub actions is the most complex, since it's not just trying to run the tests in parallel on one machine, but distributing tasks across several different nodes, which means sending data to other places and recollecting the coverage or other results back together at the end, and the maybe difficult task of creating independent work-units to hand out. It seems like this optimization also won't speed things up in the other contexts. Making the ETL run faster through parallelism will help anywhere we've got multiple cores and enough memory, as would making pytest run in parallel (for the stuff that happens in tests specifically anyway). However, the fast ETL only takes ~5 minutes to run on 1 CPU, so speeding that up isn't going to help much with the hour-long CI. You can run it in isolation with: pytest test/integration/etl_test.py::test_pudl_engine Most of the time in CI is taken up by the computation of output dataframes for things like the EIA Plant Parts, the MCOE, the utility service territory compilations, and a bunch of probably slow pandas-playing-SQL operations to generate the output tables. Many of these operations are done using To me it sounds like it would be hard to split out the various pytest cases onto separate runners in such a way that they weren't often duplicating the same calculations. I think a lot of work we are already planning to do in the immediate future may help with the slowness.
With some basic parallelism in pytest we could run lots of data validations in parallel, and when they're just reading tables / views directly out of the DB instead of 🐼 they'll also have less work to do. E.g. pytest --live-dbs -n auto --dist worksteal test/validate By running a couple of Tox environments in parallel, we could run all the linters and the unit tests at the same time, and the docs build + integration tests at the same time (can't run the docs build in with linters b/c the docs builds creates some temporary files the linters don't like). Not sure how to make this work with coverage generation though. $ tox -e rstcheck,flake8,pre_commit,bandit,unit -p auto
pre_commit: OK ✔ in 1 minute 8.27 seconds
bandit: OK ✔ in 1 minute 9.36 seconds
flake8: OK ✔ in 1 minute 10.94 seconds
rstcheck: OK ✔ in 1 minute 14.93 seconds
rstcheck: OK (74.93=setup[65.18]+cmd[9.75] seconds)
flake8: OK (70.94=setup[65.56]+cmd[5.38] seconds)
pre_commit: OK (68.27=setup[65.52]+cmd[1.10,0.25,0.22,0.25,0.23,0.26,0.28,0.16] seconds)
bandit: OK (69.36=setup[65.56]+cmd[3.80] seconds)
unit: OK (123.51=setup[65.38]+cmd[58.13] seconds)
congratulations :) (123.72 seconds)
$ tox -e integration,docs -p auto With Tox there's a bunch of overhead in setting up the virtual environments with In the nightly builds, we could try running the I tried running: pytest -n auto --dist worksteal test/integration and it didn't really have any speed advantage over 1 CPU (16 min vs. 18 min), so that's probably not helpful. I think if there's some easy test parallelism to turn on, we should do it! But probably not worry about trying to speed up the CI on GitHub until after we've finished the Dagster transition, since what makes the CI slow will be in a different place, and have different behaviors then. And if it's going to be complex or costly, but we can make the tests run fast locally and in the nightly builds, maybe it's not worth focusing on the GitHub CI speedup. |
Beta Was this translation helpful? Give feedback.
-
Splitting this into run-etl and validate-outputs stages is in-line with our overall efforts to overhaul the CI/CD. It seems to me that one of the suggestions would be to use gcp/batch instead of gh action runners. This is indeed more costly option (given that gha runners are effectively free) but might allow us to speed things up. While there are probably ways to speed things up even on gha runners, my suspicion is that this comes with non-trivial overhead (maintenance, debugging and taking care of this additional runtime environment) so if the cost is acceptable, I would vote for uniformity (run everything on batch) - perhaps there are ways to cut down on the costs there as well by picking smaller vm sizes. |
Beta Was this translation helpful? Give feedback.
-
I have been working on running the ETL on big runners, before the integration tests which does seem to speed things up somewhat (scaling with number of core, or a bit less than that). In the meantime, more complex processing landed in Given this new reality, I think that parallelization of the ETL prior to integration tests might not be fully addressing the current situation. I would like to float some of the following:
I'm curious, in particular, about exploring the idea of probabilistic sampling of input data. That could be somewhat easily tunable and fairly generic mechanism (how much data to drop and where), which we could use to fine-tune the speed/accuracy/cost for early integration tests. |
Beta Was this translation helpful? Give feedback.
-
I've been noodling around trying to figure out how to make the CI tests run faster on the
daz/parallelize-tests
branch. An hour is an excruciating amount of time to wait for tests to pass.Some thoughts:
pytest-shard
to split the integration tests across multiple GH actions runners. This Just Works, but also doesn't help us much. A representative run is here - each individual integration test run takes about 45 minutes. Which is better than an hour! But the main problem is that the test fixtures all require running the fast ETL.--live-dbs
which should cut a lot of the time. Then when we split the integration tests into shards we will actually see gains. I haven't gotten the hookup between the fast ETL upload & the integration tests downloading that completely sorted, but here is the latest run. The problem here is that it takes 27 minutes to run the ETL and another 10 to upload the outputs.pudl-work
I could just upload the outputs as artifacts. Uploading to GH actions cache seems to take much less time than uploading build artifacts - this would cut down the upload time to probably 4-5 min.execute_job
instead ofexecute_in_process
, which gets us a little bit of parallelism.I think the immediate next steps on the parallelize-tests branch are:
Which should get us to something closer to 45 minute CI runs and not take that long to implement.
If we built a single Docker image on push and used that everywhere it might save us a few minutes too - that will matter more if we have more jobs running in series, but until we get to the <15 minute mark I'm not too worried about it.
I think promising future work after that is:
Pie-in-the-sky:
cc @zaneselvans who I know has been frustrated by this in the past.
Beta Was this translation helpful? Give feedback.
All reactions