Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address performance issues #142

Open
surchs opened this issue Apr 14, 2024 · 1 comment
Open

Address performance issues #142

surchs opened this issue Apr 14, 2024 · 1 comment
Labels
flag:discuss Flag issue that needs to be discussed before it can be implemented. Milestone Used to track other issues that are required to complete the milestone.

Comments

@surchs
Copy link
Contributor

surchs commented Apr 14, 2024

With the changes in #83, query performance has tanked to unacceptable levels - so much so, that we had to remove timeout limits from the APIs (e.g. neurobagel/api#304).

I made a simple time benchmark of these steps locally:

Step Code Current (s) Optimized (s)
API sends query to graph
Graph runs query (internal) code ~ 79 s ~ 3 s
Graph responds ~ 2 s ~ 2 s
API unpacks response into pandas df code ~ 0.3 s ~ 0.3 s
API assembles its own response code ~ 10 s ~ 9 s
User retrieves API response ~ 12 s ~ 8 s
Total 104 s 22 s

Note:

  • Optimized refers to a query optimization that I think makes sense and that I will submit (needs more testing)
  • User retrieval time is the total time for a curl run minus the internal time logged by the API itself (i.e. when the API is done minus when the user is done)
  • Everything runs locally on one machine, so anything going over the network is likely a best case scenario!
  • This uses the current (new) OpenNeuro data
  • The size of the response bundle that the API sends back is ~ 15 - 18 Megabyte!!!

Conclusions

  • the current query time is unacceptable and we have to optimize the SPARQL template. The good news is: that's very doable
  • even then, we are still pretty slow
  • even if we accept a total query time of 20 s, sending a response blob of 18MB is super inefficient / will make the query tool struggle
  • in theory caching can help to remove the API internal time (i.e. ~ 14 s -> 0), but will not address the response download time or the size of the response
  • in practice, caching will be irrelevant because it only works for identical queries and unless we want to keep the cache around forever, very few queries can be expected to overlap before cache invalidates.

All that is to say: I think we should take this moment to split the query flow into two parts:

  1. Cohort query that only looks for
  • which datasets have subjects that fit my criteria?
  • how many subjects in each dataset fit my criteria
  1. For these specific datasets, give me the subjects that fit my criteria

The first phase can be highly optimized and will have a very small response.
The second phase will also be faster (because we don't have to search the entire graph),
and so the response will be smaller unless someone really wants to get all the data at once.

@surchs surchs added the Milestone Used to track other issues that are required to complete the milestone. label Apr 14, 2024
@surchs surchs added the flag:discuss Flag issue that needs to be discussed before it can be implemented. label Apr 14, 2024
@surchs
Copy link
Contributor Author

surchs commented May 22, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flag:discuss Flag issue that needs to be discussed before it can be implemented. Milestone Used to track other issues that are required to complete the milestone.
Projects
Status: No status
Development

No branches or pull requests

1 participant