Address performance issues #142

surchs · 2024-04-14T18:09:39Z

With the changes in #83, query performance has tanked to unacceptable levels - so much so, that we had to remove timeout limits from the APIs (e.g. neurobagel/api#304).

I made a simple time benchmark of these steps locally:

Step	Code	Current (s)	Optimized (s)
API sends query to graph
Graph runs query (internal)	code	~ 79 s	~ 3 s
Graph responds		~ 2 s	~ 2 s
API unpacks response into pandas df	code	~ 0.3 s	~ 0.3 s
API assembles its own response	code	~ 10 s	~ 9 s
User retrieves API response		~ 12 s	~ 8 s
Total		104 s	22 s

Note:

Optimized refers to a query optimization that I think makes sense and that I will submit (needs more testing)
User retrieval time is the total time for a curl run minus the internal time logged by the API itself (i.e. when the API is done minus when the user is done)
Everything runs locally on one machine, so anything going over the network is likely a best case scenario!
This uses the current (new) OpenNeuro data
The size of the response bundle that the API sends back is ~ 15 - 18 Megabyte!!!

Conclusions

the current query time is unacceptable and we have to optimize the SPARQL template. The good news is: that's very doable
even then, we are still pretty slow
even if we accept a total query time of 20 s, sending a response blob of 18MB is super inefficient / will make the query tool struggle
in theory caching can help to remove the API internal time (i.e. ~ 14 s -> 0), but will not address the response download time or the size of the response
in practice, caching will be irrelevant because it only works for identical queries and unless we want to keep the cache around forever, very few queries can be expected to overlap before cache invalidates.

All that is to say: I think we should take this moment to split the query flow into two parts:

Cohort query that only looks for

which datasets have subjects that fit my criteria?
how many subjects in each dataset fit my criteria

For these specific datasets, give me the subjects that fit my criteria

The first phase can be highly optimized and will have a very small response.
The second phase will also be faster (because we don't have to search the entire graph),
and so the response will be smaller unless someone really wants to get all the data at once.

The text was updated successfully, but these errors were encountered:

surchs · 2024-05-22T17:59:40Z

also see https://github.com/ad-freiburg/qlever/wiki/QLever-performance-evaluation-and-comparison-to-other-SPARQL-engines for a very nice overview of RDF performance!

surchs added the Milestone Used to track other issues that are required to complete the milestone. label Apr 14, 2024

surchs added this to Neurobagel Apr 14, 2024

surchs added the flag:discuss Flag issue that needs to be discussed before it can be implemented. label Apr 14, 2024

surchs mentioned this issue Apr 14, 2024

Improve inefficient SPARQL query template neurobagel/api#307

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address performance issues #142

Address performance issues #142

surchs commented Apr 14, 2024 •

edited

Loading

surchs commented May 22, 2024

Address performance issues #142

Address performance issues #142

Comments

surchs commented Apr 14, 2024 • edited Loading

Conclusions

surchs commented May 22, 2024

surchs commented Apr 14, 2024 •

edited

Loading