-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong information about SQLite #120
Comments
yes, the statement "In general, it’s impossible to read from a SQLite database without fetching the entire file’s content." is wrong. https://github.com/sozip/sozip-spec?tab=readme-ov-file#datasets-available-as-sozip points to a 18.8 GB SOZip-ed compressed GeoPackage that can be used remotely by QGIS in a rather decent smooth way. The SQlite BTree is not super cloud-optimized friendly, but it is not hostile either. |
How do you feel about changing "impossible" to "inefficient"? I had remembered reading https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/ and https://github.com/phiresky/sql.js-httpvfs, which seemed to say that there was some amount of fine-tuning that was useful before trying to do queries on a SQLite database over HTTP. It's not clear to me how much of the benefit in your examples is coming from SOZip on top of the SQLite database itself. |
I can unzip the GeoPackage (or you can do it yourself and place the file into any http server), but I can imagine that SOZIP makes queries a bit slower. Anyway, I do not know any other file format yet that can support small attribute queries with SQL over http any better than SQLite seems to support. I refer to this example
Here are more strings to test with. I wouldn't call it inefficient to find right parcels out of more that 3 million ones within a second, or faster. 72940400030049 |
probably modest when you extract a very small subset. If you extract more, then it reduces the amount of transferred data from 2x or 3x. Speaking here about vector GeoPackage. Raster GeoPackage or MBTiles store tiles in compressed image formats, so you should not use SOZip for such datasets. |
In https://guide.cloudnativegeo.org/pmtiles/intro.html#mbtiles it is written:
Same information appears also in https://guide.cloudnativegeo.org/glossary.html#geopackage:
I don't know much about MBTiles, but the following gdalinfo command proves that the claim is not true at least if the SQLite database is a GeoPackage with vector data. The size of the test file https://latuviitta.kapsi.fi/data/sozip/kiinteistojaotus.gpkg.zip is 6.1 GB. It is a sozipped https://github.com/sozip/sozip-spec GeoPackage database, thus SQLite. It is not important that GeoPackage is also sozipped, but due to smaller size it is more cloud-optimized.
A command that reads one feature from the 6.1 GB file and prints some debug info:
ogrinfo /vsizip/vsicurl/https://latuviitta.kapsi.fi/data/sozip/kiinteistojaotus.gpkg.zip -sql "select * from PalstanSijaintitiedot limit 1" --debug on --config cpl_debug=on --config cpl_timestamp=on
By the logs GDAL makes 6 HTTP requests and the total amount of downloaded data is
108288
bytes. That is less than6.1 GB
The GeoPackage on the server has an index on column "kiinteistotunnus" and therefore also this query works pretty well and finds 2 features out of 3741133 fast.
ogrinfo -ro /vsizip/vsicurl/https://latuviitta.kapsi.fi/data/sozip/kiinteistojaotus.gpkg.zip -sql "select * from PalstanSijaintitiedot where kiinteistotunnus='43441500010225'" --debug on --config cpl_debug=on --config cpl_timestamp=on
With this command GDAL makes 15 http range requests and reads totally 221971 bytes and with my computer it took 0.6570 seconds.
I do not claim that SQLite is a cloud-optimized format, but for sure it can be queried with http without downloading the whole database. Also spatial rtree index can be utilized for bounding box queries. Filtering with a 10x10 km bbox that finds 1308 features took 24 seconds.
ogr2ogr -f gpkg testi2.gpkg /vsizip/vsicurl/https://latuviitta.kapsi.fi/data/sozip/kiinteistojaotus.gpkg.zip -spat 389000 6960000 399000 6970000 PalstanSijaintitiedot --debug on --config cpl_debug=on --config cpl_timestamp=on
From FlatGeobuf timing shows 1.73 seconds (number of features differs because the datasets are not identical)
ogr2ogr -f gpkg testi3.gpkg /vsicurl/https://latuviitta.kapsi.fi/data/kiinteisto/palstansijaintitiedot.fgb -spat 389000 6960000 399000 6970000 PalstanSijaintitiedot --debug on --config cpl_debug=on --config cpl_timestamp=on
And from parquet, that was created with GDAL creation options
-lco sort_by_bbox=yes -lco row_group_size=3000
timing was 12,54 seconds.ogr2ogr -f gpkg testi4.gpkg /vsicurl/https://latuviitta.kapsi.fi/data/kiinteisto/palstansijaintitiedot2.parquet -spat 389000 6960000 399000 6970000 PalstanSijaintitiedot2 --debug on --config cpl_debug=on --config cpl_timestamp=on
In summary, by utilizing the GDAL virtual file system (VSI) it is possible to query data from SQLite databases without downloading the whole file. A regular GeoPackage is not hopelessly slow for bbox queries, and attribute queries are pretty fast if GeoPackage has a corresponding index. FlatGeobuf is very fast with bbox queries, but it does not support attribute queries at all.
FYI @rouault
The text was updated successfully, but these errors were encountered: