Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong information about SQLite #120

Open
jratike80 opened this issue Nov 8, 2024 · 4 comments
Open

Wrong information about SQLite #120

jratike80 opened this issue Nov 8, 2024 · 4 comments

Comments

@jratike80
Copy link

jratike80 commented Nov 8, 2024

In https://guide.cloudnativegeo.org/pmtiles/intro.html#mbtiles it is written:

In general, it’s impossible to read from a SQLite database without fetching the entire file’s content.

Same information appears also in https://guide.cloudnativegeo.org/glossary.html#geopackage:

... the entire file must be downloaded in order to read any part of the file.

I don't know much about MBTiles, but the following gdalinfo command proves that the claim is not true at least if the SQLite database is a GeoPackage with vector data. The size of the test file https://latuviitta.kapsi.fi/data/sozip/kiinteistojaotus.gpkg.zip is 6.1 GB. It is a sozipped https://github.com/sozip/sozip-spec GeoPackage database, thus SQLite. It is not important that GeoPackage is also sozipped, but due to smaller size it is more cloud-optimized.

A command that reads one feature from the 6.1 GB file and prints some debug info:

ogrinfo /vsizip/vsicurl/https://latuviitta.kapsi.fi/data/sozip/kiinteistojaotus.gpkg.zip -sql "select * from PalstanSijaintitiedot limit 1" --debug on --config cpl_debug=on --config cpl_timestamp=on

By the logs GDAL makes 6 HTTP requests and the total amount of downloaded data is 108288 bytes. That is less than 6.1 GB

Downloading 6553010176-6553020165 
Downloading 0-16383 
Downloading 6549667840-6549684223 
Downloading 16384-32767 
Downloading 32768-65535 
Downloading 6549651456-6549667839 

The GeoPackage on the server has an index on column "kiinteistotunnus" and therefore also this query works pretty well and finds 2 features out of 3741133 fast.

ogrinfo -ro /vsizip/vsicurl/https://latuviitta.kapsi.fi/data/sozip/kiinteistojaotus.gpkg.zip -sql "select * from PalstanSijaintitiedot where kiinteistotunnus='43441500010225'" --debug on --config cpl_debug=on --config cpl_timestamp=on
With this command GDAL makes 15 http range requests and reads totally 221971 bytes and with my computer it took 0.6570 seconds.

I do not claim that SQLite is a cloud-optimized format, but for sure it can be queried with http without downloading the whole database. Also spatial rtree index can be utilized for bounding box queries. Filtering with a 10x10 km bbox that finds 1308 features took 24 seconds.

ogr2ogr -f gpkg testi2.gpkg /vsizip/vsicurl/https://latuviitta.kapsi.fi/data/sozip/kiinteistojaotus.gpkg.zip -spat 389000 6960000 399000 6970000 PalstanSijaintitiedot --debug on --config cpl_debug=on --config cpl_timestamp=on

From FlatGeobuf timing shows 1.73 seconds (number of features differs because the datasets are not identical)

ogr2ogr -f gpkg testi3.gpkg /vsicurl/https://latuviitta.kapsi.fi/data/kiinteisto/palstansijaintitiedot.fgb -spat 389000 6960000 399000 6970000 PalstanSijaintitiedot --debug on --config cpl_debug=on --config cpl_timestamp=on

And from parquet, that was created with GDAL creation options -lco sort_by_bbox=yes -lco row_group_size=3000 timing was 12,54 seconds.

ogr2ogr -f gpkg testi4.gpkg /vsicurl/https://latuviitta.kapsi.fi/data/kiinteisto/palstansijaintitiedot2.parquet -spat 389000 6960000 399000 6970000 PalstanSijaintitiedot2 --debug on --config cpl_debug=on --config cpl_timestamp=on

In summary, by utilizing the GDAL virtual file system (VSI) it is possible to query data from SQLite databases without downloading the whole file. A regular GeoPackage is not hopelessly slow for bbox queries, and attribute queries are pretty fast if GeoPackage has a corresponding index. FlatGeobuf is very fast with bbox queries, but it does not support attribute queries at all.

FYI @rouault

@rouault
Copy link

rouault commented Nov 10, 2024

I do not claim that SQLite is a cloud-optimized format, but for sure it can be queried with http without downloading the whole database.

yes, the statement "In general, it’s impossible to read from a SQLite database without fetching the entire file’s content." is wrong. https://github.com/sozip/sozip-spec?tab=readme-ov-file#datasets-available-as-sozip points to a 18.8 GB SOZip-ed compressed GeoPackage that can be used remotely by QGIS in a rather decent smooth way. The SQlite BTree is not super cloud-optimized friendly, but it is not hostile either.

@kylebarron
Copy link
Member

How do you feel about changing "impossible" to "inefficient"?

I had remembered reading https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/ and https://github.com/phiresky/sql.js-httpvfs, which seemed to say that there was some amount of fine-tuning that was useful before trying to do queries on a SQLite database over HTTP.

It's not clear to me how much of the benefit in your examples is coming from SOZip on top of the SQLite database itself.

@jratike80
Copy link
Author

I can unzip the GeoPackage (or you can do it yourself and place the file into any http server), but I can imagine that SOZIP makes queries a bit slower. Anyway, I do not know any other file format yet that can support small attribute queries with SQL over http any better than SQLite seems to support. I refer to this example

ogrinfo -ro /vsizip/vsicurl/https://latuviitta.kapsi.fi/data/sozip/kiinteistojaotus.gpkg.zip -sql "select * from PalstanSijaintitiedot where kiinteistotunnus='43441500010225'" --debug on --config cpl_debug=on --config cpl_timestamp=on

Here are more strings to test with. I wouldn't call it inefficient to find right parcels out of more that 3 million ones within a second, or faster.

72940400030049
72940800100121
72940400030045
72940400030050
72940400030052
22640700050007
72940200060105
72940200060106
72989500000636

@rouault
Copy link

rouault commented Nov 11, 2024

It's not clear to me how much of the benefit in your examples is coming from SOZip on top of the SQLite database itself.

probably modest when you extract a very small subset. If you extract more, then it reduces the amount of transferred data from 2x or 3x. Speaking here about vector GeoPackage. Raster GeoPackage or MBTiles store tiles in compressed image formats, so you should not use SOZip for such datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants