-
-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
array facet: don't materialize unnecessary columns #2008
base: main
Are you sure you want to change the base?
Conversation
The presence of `inner.*` causes SQLite to materialize a row with all the columns. Those columns will be discarded later. Instead, we can select only the column we'll use. This lets SQLite's optimizer realize that the other columns in the CTE definition aren't needed. On a test table with 278K rows, 98K of which had an array, this speeds up the facet calculation from 4 sec to 1 sec.
I don't know how/if you do automated tests for performance, so I haven't changed any of the tests. |
Codecov ReportBase: 92.11% // Head: 91.78% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #2008 +/- ##
==========================================
- Coverage 92.11% 91.78% -0.34%
==========================================
Files 38 39 +1
Lines 5555 5599 +44
==========================================
+ Hits 5117 5139 +22
- Misses 438 460 +22
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Ack, this PR is broken. I see now that the That fixes the overcounting, but I think can undercount when the rows have the same data, eg a view like: SELECT '["bar"]' tags UNION ALL SELECT '["bar"]' will produce a count of If I guess a general solution would be to compute a window function so we have a distinct ID for each row. Will fiddle to see if I can get that working. |
Use a window function to generate unique IDs for each row to avoid having to do a DISTINCT on arbitrarily wide rows. Add a test to capture the case where multiple rows are the same.
I pushed a revised version which ends up being faster -- the example which currently takes 4 seconds now runs in 500ms. |
and more correct. progress on #34, see also simonw/datasette#2008
I absolutely love this performance boost - really nice find. One concern: this will be the first time Datasette ships a core feature that uses window functions. Window functions were added to SQLite in version 3.25.0 on 2018-09-15 - which means it's still very common for Datasette to run on versions that don't yet support them. So I see two options:
I like the first option a bit better. This also leads to a tricky CI challenge: Datasette needs to be able to run its test suite against more than one SQLite version to confidently test this feature going forward. I don't yet have a good GitHub Actions recipe for this, but I really need one - for Might be able to use this trick for that: https://til.simonwillison.net/sqlite/ld-preload |
We don't have any performance tests yet - would be a useful thing to add, I've not built anything like that before (at least not in CI, I've always done as-hoc performance testing using something like Locust) so I don't have a great feel for how it could work. I see not having to change the tests at all for this change as a really positive sign. If you find any behaviour differences between this and the previous that's a sign we should add a mother test or two specifying the behaviour we want. |
Performance tests - I think most places don't have them as a formal gate enforced by CI. TypeScript and scalac seem to have tests that run to capture timings. The timings are included by a bot as a comment or build check, and also stored in a database so you can graph changes over time to spot regressions. Probably overkill for Datasette! Window functions - oh, good point. Looks like Ubuntu shipped JSON1 support as far back as sqlite 3.11. I'll let this PR linger until there's a way to run against different SQLite versions. For now, I'm shipping this with Tests - there actually did end up being test changes to capture the undercount bug of the current implementation, so the current implementation would fail against the new tests. Perhaps a non-window function version could be written that uses |
Had an interesting conversation about this just now: https://fedi.simonwillison.net/@simon/109773800944614366 There's a risk that different runs will return different results due to the shared resource nature of GitHub Actions runners, but a good fix for that is to run comparative tests where you run the benchmark against e.g. both |
The presence of
inner.*
causes SQLite to materialize a row with all the columns. Those columns will be discarded later.Instead, we can select only the column we'll use. This lets SQLite's optimizer realize that the other columns in the CTE definition aren't needed.
On a test table with 278K rows, 98K of which had an array, this speeds up the facet calculation from 4 sec to 1 sec.
📚 Documentation preview 📚: https://datasette--2008.org.readthedocs.build/en/2008/