-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can the Neurobagel data structure and query interface be customized (or how complicated would it be to do so? #307
Comments
Hey @jsheunis, thanks for the questions! They're all very good ones :). Let me first say something about the high-level idea. Neurobagel's goal is to make the 80% use case for cohort discovery (and the corresponding "how do I get this cohort onto my HDD now" use case) easy to use across datasets - and to demonstrate that it works. The way we have built this initial demonstration by
Now that we can demonstrate that the cross-dataset query (and for some the "download this cohort" bit) work, we want to expand the list of query variables based on other use cases. Very likely, what constitutes a good use case will depend a bit on the scientific community you are in (e.g. neurodegenerative disorder group looks for different specialized query params than someone focused on visual attn and so on). But even across such sub-communities, we want to encourage everyone to have the same "common" core bits, e.g. the basic demographics, maybe diagnosis, imaging modalities, etc - so that across sub-communities they can discover each other's data as well, even if just on the superficial level that overlaps. Or said another way: our goal is to grow a common data model, and to extend the common model with sub-community specific extensions that cover specific things (e.g. specific clinical stages). But even in sub-communities we'd consider the extension to be about a use case that's shared by several sites. All that is to say: on the technical side we didn't start this out as something that's fully configurable / where you can just use the tools and swap in a different data model and deploy it on your own. To support the sub-community extensions I mentioned, we definitely want to make the tools more configurable (so communities can build these extensions without opening issues for the core team), but we aren't currently considering a use case where you just take the tools and swap out everything about the data model internally to deploy in a one-off fully custom way. For some tools, like the annotation tool, that'll be quite easy to do. For others like the query tool and federation API, it'll require a bit more work or thought (e.g. because we currently have an internal SPARQL query template that you populate when running a query, and the data model is implicitly encoded in this template). But this is something we're planning to do, and depending on how much customization your use case needs, it might be not very tricky.
The schema might be a little confusing because it's designed to be valid for a BIDS-only data dictionary AND for a BIDS+annotations data dictionary. E.g. the
Could you say more about what you're trying to do? At the moment, you can have either just "participant-identifier" or "participant-identifier AND session-identifier" columns. We know that for some datasets, a single identifier for participant is not sufficient, e.g. because in that dataset there are multiple ID systems. But I'm not sure if that's what you are asking about here? Generally speaking: the data model is participant-centric, so we always need to know what a participant is, i.e. what their unique identifier is. If you have a dictionary with
We follow the BIDS spec for data dictionaries to maximize compatibility. There is a second model for what the graph looks like that we create from the data dictionaries, and that's independent of BIDS.
Yes
That depends. If you care about raw imaging, the only way we learn about availability of raw imaging data atm is by asking pyBIDS (in fact the bagel-cli for imaging data is mostly a pyBIDS wrapper). For derivatives/processed data, because there is to our knowledge no standard for that, we rely on a tabular format for input (that's currently in development) with an existing schema. If you skip the whole "annotate -> data dictionary -> bagel-cli -> jsonld/graph file" workflow and just create the graph file directly according to our schema here it's essentially up to you how you want to decide / encode if a subject has imaging data available - you would only need to make sure that you use the same controlled terms in the graph to refer to imaging modalities etc, otherwise the queries / APIs / query tool will not work with your graph files. The reason we rely on pyBIDS is that it makes it easier for us to then do the part where we tell datalad which files to provision when someone finds the subject in a cohort query. But that's the only direct link to BIDS. I feel like we'd need to chat about that a little more so I understand what your constraints / goals are.
Yeah, that would be rough to map to tsv+dictionary. The short answer is: you cannot search DNA sequence info / variant info with Neurobagel (yet). We are chatting quite a bit with the GA4GH folks who have thought about how to discover such things in a lot of detail, and I would go from the standards and protocols they have in mind when we do add DNA info.
At the moment, you can't customize the tools to use different data models. We want to add that so we can support use cases for sub-communities (e.g. with our friends who do PD research ...). Generally such extensions of the data model are easy to trivial from the graph's perspective, and somewhere between very easy to moderately involved for the other tools (e.g. we're just adding ability to model "has this subject been preprocessed by freesurfer 7.3.2?" - and that's pretty simple). Since you seem to have a couple of use cases that fall outside of what you can do right now, I think it'd be good to chat with you about how we can make these extensions easier. |
Happy to continue this conversation on https://hub.datalad.org/datalink/org/issues/2#issue-21 if you tell me how to sign up for that server :) |
Thanks for the detailed response!
Understandable, and good to hear that some level of configurability is the eventual goal.
Ok, good to know this explicitly.
I gave a stupid example. The point that I was trying to understand is whether the schema is, like you say, participant-centric and whether other data entities need to map onto that in order to yield valid graph-ready data. In the consortia that we work with, this will often be the case, but not always. And users might not want to query a node with a focus on participants, but rather on e.g. samples. I think a good example of a consortium we might work with is one we have actually worked with: https://www.crc1451.uni-koeln.de/. We have a data catalog of data contributed from different groups in the consortium, https://data.sfb1451.de/, and it could eventually be good to be able to query metadata of this catalog via neurobagel. If you browse through the catalog you'll see that there are imaging datasets from patients/participants, but also data collected from individual or groups of neurons, where participants aren't even mentioned. Or e.g. microarray analysis on dissections of the CNS of mice (if that even makes sense, I'm no expert), or spike times of stimulation of cockroach brains. Users might want to find all datasets containing spike time measurements, irrespective of the type of animal or cell it was taken from. That's why I think I focused on the question of identifier columns, because we might model a measurement or a sample or "data entry" that receive specific IDs.
Agreed, let's do that. Will contact you. Thanks again! |
We want to keep our issues up to date and active. This issue hasn't seen any activity in the last 75 days.
|
@surchs Could you please take a look and close this issue if it has been addressed? |
Hey folks! I've been reading more about and testing a local instance of Neurobagel. Firstly, nice work! The node was pretty straightforward to set up.
Now we've started exploring use cases, and I am uncertain about if or how Neurobagel can deal with what we have in mind. Basically, can Neurobagel tooling practically take any data dictionary we cook up? And can the query interface respond to this?
PS: I wasn't sure where exactly to create this issue, since it relates to multiple components; so I just selected the query-tool, but please feel free to move the issue wherever is appropriate
The data dictionary defines the semantic annotations of the columns in the Neurobagel TSV file, so my understanding is that we could technically include any arbitrary columns and annotations as long as we stick to the data dictionary specification (i.e. only categorical columns, continuous columns, or identifier columns). What I am not sure about is the nature of the identifier columns. From my understanding of the docs about the Neurobagel TSV file, rows are equivalent to "particiant-sessions", i.e. there are only two identifier columns (
Identifies: participant
andIdentifies: session
). Is this a hard requirement forbagel-cli
and the query tool? Or can we include an arbitrary amount of identifier columns (a single one, or many)? If possible, how will the query interface deal with this? Automatically, or will it need development to deal with the changes? I assume that e.g.Identifies: participant
has some internal mapping used in the process of generating graph-ready data, so if we e.g. sayIdentifies: sample
orIdentifies: cuteLittlePuppy
the process will fail?As noted at the end of this comment datalink/org#2 (comment), my understanding is that neurobagel has its own internal schema for subjects, sessions, images, etc., which I assume follows BIDS to a major extent. I understand that the
bagel-cli
can be used to generate phenotypic-only graph-ready data, i.e. a BIDS dataset does not have to accompany the process. But what happens if we still have an accompanying scientific dataset that does not conform to BIDS but we still want to make some/all of its aspects/content findable in neurobagel node via the query interface. E.g. DNA sequencing or flow cytometry data. Some aspects might be able to be mapped onto the "TSV-file/data-dictionary" paradigm as new columns, but others not.So in summary, will neurobagel components be able to deal with this. If not out of the box, how complicated would it be to be customized? Or would it not be customizable at all?
The text was updated successfully, but these errors were encountered: