-
Notifications
You must be signed in to change notification settings - Fork 16
Semantic Motif Searching in Knetminer
Knetminer uses a combination of graph patterns and traditional search ranking techniques to estimate how genes are relevant to search words, which, of course, is used to rank and select the genes to show as a search result.
Details are available here. We define a semantic motif a graph path (or a pattern matching a path) from a gene to another entity in a Knetminer knowledge graph. An example (in an informal syntax):
Gene - encodes -> Protein - interacts-with (1-2 links) -> Protein <- mentions <- Publication
which links protein-mentioning publications to other interacting proteins and genes that encode the latter.
Knetminer can link genes to other entities by means of multiple motifs like the above. Every dataset/specie that makes up an instance can be configured with a set of motifs, which are used against genes in the datasets to find relevant gene-related entities.
That matching is performed by what we call graph traverser. Currently, there are two ways to perform semantic motif searches in Knetminer, each having two different languages to define the motifs, and different sets of configuration options. Each of such ways has its own graph traverser, which means you can choose which type of semantic motif search you want to use, and thus the corresponding graph pattern language, by defining the right traverser in a configuration file. Details are given in this document.
# The Data Model for the Knetminer Knowledge Graphs
Both the graph traversers used in Knetminer (or any other traverser, for what matters) allows for the definition of graph patterns by referring to the node type names and node link names used in the underlining Knetminer dataset. This is essentially a knowledge graph, namely a property graph, and those names are based on a predefined schema. The reference for such schema is a metadata file included in Ondex. Examples of of it are given in our paper about the Knetminer backend. The same metadata are automatically translated into our BioKNO ontology, and sample queries in SPARQL are presented in our SPARQL endpoint.
All the examples in the hereby document are based on the same metadata.
Historically, the so-called state machine traverser (SM) has been the first developed within the Ondex project. This allows to define semantic motifs according to a graph of transitions between node types (concept classes in Ondex terms) and relation types which you want to hold between nodes.
For instance, this is what we use for the arabidopsis dataset
Where we're saying, for example, that we want to match a gene with any trait that co-occurs (cooc_wi)
with the gene (in the sense of text mining occurrence), and both relations Gene - cooc_wi -> Trait
and Trait - cooc_wi -> Gene
will be matched (non-directional link). As another exmaple, look again the
figure and find the chain Gene - enc - Protein - genetic|physical -> Protein
, which includes self-loops
on the first protein, mixed directed and undirected links, multiple releation types that are valid to
link from a protein to the next.
State machine can be defined by a simple flat file format. The file defining the SM in figure is here. Let's look at an example:
#Finite States *=start state ^=end state
1* Gene
2^ Publication
3^ MolFunc
...
7^ Protein
8^ Gene
9 Gene
10 Protein
...
10-10 ortho 4
10-10 xref 4
10-10 genetic 6 d
...
1-10 enc
10-7 physical 6 d
10-7 genetic 6 d
...
The format is very simple:
- Every file row defines either a node or a transition, each line has different fields separated by tab characters
- Node types are numbered in a first section of the file (type names must match Ondex concept classes)
- Node numbers are used to define transitions between nodes (similarly to nodes, transition names must match Ondex relation types)
- A transition has the format:
Where limit is the max "distance" of a path that is found between the gene and
<node1>-<node2> <name> [limit] [d]
<node2>
. In calculating such a distance, the first gene counts 1, every following link or node counts 1. For instance, the distance ofprotein1
in the pathgene0 -> encodes -> protein0 -> ortho -> protein1
is 5. Based on that, every matched self-loop adds 2 to the count (the link plus the target node). - The optional flag 'd' is used to define directional transitions. Note that these are usually faster to
match. So, even if a directional transition isn't defined for
enc
(we don't expect data where proteins encode genes), adding the flag would speed up things in case of problems.
The SM configuration is part of the configuration required to setup a Knetminer instance, which is
described in our wiki. Our pre-configured datasets are examples of it.
Details are:
-
In
maven-settings.xml
, leave the Maven propertyknetminer.api.graphTraverserClass
empty, ie, don't add it to your dataset-specific settings, which will inherit the default empty value. This corresponds to picking the default traverser class,net.sourceforge.ondex.algorithm.graphquery.GraphTraverser
, which is the SM traverser. (so, defining it explicitly would achieve the same result). Note that the Maven property is injected into the Knetminer configuration file that is generated from a template. -
Define the state machine for your dataset in the file
<dataset>/ws/SemanticMotifs.txt
, using the format explained above. This is the path set in the Knetminer configuration file. An alternative to this would be placing your owndata-source.xml
file in the dataset directory and define a different path/name. Unless, you've special needs, we don't recommend it. -
In
maven-settings.xml
, define the rightknetminer.specieTaxId
(comma-separated list of NCBITax codes), that is used to pick the genes of your species of interest. Semantic motifs are applied to these genes during the Knetminer initialisation, in order to have pre-computed data to start searches from. These are named 'seed genes'. As an alternative to using the specie ID to select them, you can define a list of seed genes explicitly, by setting the propertyknetminer.backend.seedGenesFile
in maven-settings.xml, see this example.
The SM traverser is usually rather efficient, without having much to configure/tune. However, there are a few factors that affects its performance:
- Obviously, the more seed genes you have, the slower the Knetminer initialisation is
- Similarly, bigger state machines (in terms of total number of nodes + transitions) take more time, but the traversal is parallel and usually scales well.
- The biggest impact on performance is on what you match. For instance, if you have self-loops of
protein->xref->protein
, this can easily hangs in trying to match long chain of cross-references, especially if there are loops in the graph. In a case like this, you should always define a low-enough limit constraint (see above). - Both the Knetminer traversers save their initialisation results in memory (to be reused during the application lifetime), if you have many paths to match, you might need more RAM (see the Docker documentation). Increasing memory can also make the initialisation stage faster, since this limits the frequency of intermediate result cleaning operations (ie, the garbage collector overhead).
Flat files can be visualised like the figure above, see the section about the state machine converter below.
The Cypher/Neo4j traverser (shortened as Cypher traverser) is part of our efforts to publish Knetminer data as machine-readable, standardised data, which can be accessed by third parties applications, including, for instance, your scripts. Details about this general perspective are on our above-mentioned backend paper.
With this traverser, you can define semantic motif paths by means of graph queries based on the Cypher query language. The idea of this is that a Knetminer dataset (available as OXL file produced by Ondex) is converted to a Neo4j database and such Neo4j database is made available for applications like the Cypher traverser.
Cypher as a language to define semantic motifs is more expressive and offer more advanced constructs.
The queries can initially be tried straight in the Neo4j browser, either from your own Neo4j instance (see below), or using the endpoints we provide for some datasets.
Note that the Neo4j database used by the Cypher traverser doesn't replace the OXL file that Knetminer uses for most of its operations. The two have to be aligned, the Neo4j database has to be generated from the OXL conversion, as explained below.
The Cypher trverser is implemented in the backend project.
Before looking at the details about the Cypher traverser configuration, let's talk about the format required for its queries. The traverser supports the Neo4j flavour of the Cypher language (but we could add support to other Open Cypher databases in future), however, there are some restrictions that are required for a query to make sense in the context of semantic motifs. This is an example, about the pattern discussed in the [State Machine section][TODO] above:
MATCH path = (gene_1:Gene)
- [enc:enc] - (protein_10:Protein)
- [rel_10_10:h_s_s|ortho|xref*0..1] - (protein_10b:Protein)
- [rel_10_7:genetic|physical*1..2] -> (protein_7:Protein)
WHERE gene_1.iri IN $startGeneIris
RETURN path
We can take this example to talk about several rules:
-
The traverser is configured with a list of queries. Every query represents a graph pattern that matches a path from a gene to some other entity, listing all the intermediate links in the chain.
-
The query must return matching paths, not single nodes or relations. These are used in combination with in-memory OXL data, to reconstruct the resulting paths in a form that is compatible with what Knetminer components expect (ie, Ondex interface instances).
-
The query must contain a clause like
WHERE gene_1.iri IN $startGeneIris
. At run time, the traverser instantiates$startGeneIris
with a list of seed gene URIs. Usually, usually are sliced into batches of a few thousand genes and such list contains the batch, not all the seed genes (this is done for performance reasons). -
IRIs are unique identifiers that can be created during the OXL/RDF/Neo4j conversion (see below). In our context, IRI and URI can be considered as synonyms. As a result of such conversion, OXL and Neo4j data have these IRIs aligned, so that they can be used to map results from the Cypher traverser with in-memory data from the OXL.
-
It doesn't make sense to use operators like
ORDER BY
, paths are searched in parallel and the ordering just wastes time. -
A query must be still valid when
SKIP
andLIMIT
clauses are added to it. In fact, the traverser performs paginated queries by automatically adding these to your query, so you should never use these keywords. -
All the queries used in a configuration should always return distinct paths, including paths from different queries. If that requirement isn't fulfilled, you'll occupy RAM with redundant data and, worse, you'll spoil search results in Knetminer.
-
Nodes are named and their type is constraint with the syntax
(name:Type)
. Relations are named and bound to types with the syntax[name:type]
. Links are given with the format:(node) - rel -> (node)
, undirected links can be specified by simply omitting the '>'. 'Backward' relations can also be defined, eg,(prot:Protein) <- [m:mentions] - (pub:Publication)
. An entire end-to-end chain can be specified this way. -
This is nothing but the Cypher syntax, so other more complex constructs could be used (eg, node or relation attribute filters, defined in the WHERE clause)
-
Cypher supports multiple relation types in the relation constraints, with the pipe
|
operator. -
In Cypher, you can restrict the number of links a relation should match by means of the syntax:
prot1 - [xr:xref*min..max]
. This is different than the distance criterion used in the State Machine traverser, explained above. -
Self-loops can be described in Cypher by splitting the node involved in the loop, as it has been done in this example for the Protein 10 node coming from the previous state machine example, which was split into the matching nodes
protein_10
andprotein_10b
. This works because there is no need for them to match different nodes. Again, this is nothing more than Cypher applied to Knetminer knowledge graphs. -
As [explained above][TODO], the names to be used for node and relation types comes from our Ondex metadata vocabulary. This also lists the attributes you might find attached to nodes or relations. For instance, this is an example of a pattern against text mining results, filtering by significance score:
MATCH path = (gene:Gene) - [occ:occ_in] -> (pub:Publication) WHERE toFloat ( occ.TFIDF ) > 20 AND WHERE gene.iri IN $startGeneIris RETURN path
Note that all attributed are stored as string, and hence functions like
toFloat()
. This is a current limit of Ondex data in Neo4j format, which we will address in future.
As for the state machine traverser, the configuration of the Cypher traverser is part of a Knetminer dataset configuration. Basics work like this:
- In
maven-settings.xml
, set the Maven propertyknetminer.api.graphTraverserClass
to the valueuk.ac.rothamsted.knetminer.backend.cypher.genesearch.CypherGraphTraverser
, which, of course is the class implementing the Cypher traverser. - The Cypher traverser uses its own configuration file. The default location for this is
<dataset>/ws/neo4j/config.xml
. While we don't recommend that you change that, if you need it, the Maven property defining it isknetminer.neo4j.configFile
, which is used to inject its value into the Knetminer configuration file. - Similarly, Knetminer usually builds a traverser configuration file starting from a template, where Maven properties (from defaults or dataset-specific settings) are injected. Some values can be injected from other sources too (eg, command line parameters). Details are explained below. This file can be copied into your dataset directory and customised as needed. If you do so, you might break the injection mechanism (and rely on your values instead).
- The traverser configuration file is a Spring Beans configuration file, so the Java entities it names are defined in the Knetminer code.
- A first thing that such file defines is the location of the Cypher queries to be used as semantic
motifs (which have to be based on the format explained above). Look for
semanticMotifsQueries
in the traverser config file. This, ultimately defines a list of query strings, which can come from multiple sources, like a list of files (one query defined per file), a file containing all the queries, Cypher strings written straight in the config file (we don't recommend it, it will become very difficult to read). Look at the template mentioned above for examples. - Another important thing that the config file defines is the bean named
neoDriver
, ie, the Neo4j client, which, among other things, contains the coordinates and credentials to reach the Neo4j server where the Neo4j version of the Knetminer OXL is expected to be found. When working with Docker, most of these parameters can be set via command line, see the documentation.
If you want to migrate from semantic motifs defined through a state machine flat file (described above) to the corresponding Cypher queries, we have a state machine to Cypher converter utility. As you can see, for the time being this is a prototype, only available through the Maven Exec plug-in (ie, requires Maven and that you download the backend codebase).
As part of its output, the converter produces files in .dot and .svg formats, which encode a representation of the SM. The sample figure presented above, in the section about the SM traverser, was achieved by means of that.
For the developers, the tool uses this class, which can be invoked programmatically.
If you use one of our data dumps, we will provide you with both an OXL file and a Neo4j datadump that was generated from the OXL (and the RDF dump too, which, in addition to being useful in itself, is an intermediate). If you want to work with your own dataset, you'll have to convert your OXL.
This is how it works:
- Starting from the OXL, use the OXL-to-Neo4j converter. This is included in the Ondex workflow
tool, and embeds both our rdf2neo tool, a generic tool for mapping RDF to Neo4j, and
the OXL/RDF converter, which is used to convert an OXL to RDF and based on our Ondex ontology,
as explained above. rdf2neo requires the RDF library Apache Jena, which you need to download separately, and setup via the
JENA_HOME
environment variable. - Both the Ondex Mini tool and the OXL/RDF converter contain a tool to add a URIs to the entities of an
OXL. This can be either run during the workflow that builds the OXL (search for
text_mining_wf.xml
), or as a command line tool over an existing OXL (seeadd-uris.sh
). Whatever way, you need to run Knetminer with an OXL that has the same URIs that are used in the RDF and Neo4j data (the Neo4j conversion always add node/relation URIs found in the RDF). - Once you have a URI-equipped OXL, run the OXL/Neo4j tool against it. This requires that you setup the later with the connection parameters to reach a a target Neo4j server (similarly to Knetminer).
Several of the considerations made for the state machine traverser apply to the Cypher traverser as well. Namely, the more paths you match, the more time you need, similarly for the number of seed genes or the number of semantic motifs (ie, Cypher queries).
Similarly, you should keep the length of paths under control, in particular when defining self-loops.
In addition to that, you've to be careful with using Cypher. As most of expressive query languages, it's easy to write badly performing queries. Have a look at literature like this or this. Regarding this point, the OXL converter indexes the iri attributes and those Ondex attributes that have the flag 'index' set.
There are a bunch of other Cypher traverser parameters that can be fine-tuned and which affect its
performance by changing aspects like the degree of parallelism, the gene batch and page size used
for Cypher queries, a timeout that trades a percentage of failed queries (missed semantic motifs) with
speed. Such parameters are described in the configuration file used with Cypher tests. Beware
that the defaults we already provide are based on extensive tests and we expect they will be good
in most cases where you use the community edition of Neo4j. If you've the enterprise edition instead,
which exploits all its available cores without commercial restrictions, you might want to try bigger
values for the queryThreadPoolSize
.
By default, the cypher traverser query writes a performance report when it finishes it job. This is a CSV table, listing things like how much time each query took to run, the number of paths it retrieved, the average path length, how many times a query timed out. This can be very useful to check if all queries complete correctly and to optimise performance.
timeoutReportPathTemplate
is a similar option, which can be used to obtain detailed reports for each
query that times out.
In addition to working with configuration and log files, you can debug/troubleshoot the queries passed to
Cypher traverser by means of a debugger tool. This can be enabled by setting the
knetminer.backend.cypherDebugger.enabled
Maven property in your dataset Maven settings (which, as
usually, is injected into data-source.xml
). When this is active, point your browser to
http://<server-prefix>/client/cydebug/
and you'll see an interface where you can define a new set of
queries and re-initialise Knetminer with them, to obtain a performance report at the end.
The Knetminer running on the same server will now base its searches on the semantic motifs computed from te queries you passed it via the Cypher Debugger.
WARNING: if not already clear, this tool destroys the server-configured queries and re-generate semantic motifs based on the new ones. If the latter are just a bunch of tests, likely, the resulting Knetminer will not yield the results you expect. This also means that enabling the Cypher debugger on a production server is a security threat, since the corresponding web interface isn't protected at all (it's supposed to be used in a trusted intranet) and hence anyone could try to use it to empty the semantic motif queries and their results that your Knetminer relies on, which of course will disrupt the application badly.