forked from egonw/rrdf-paper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy patharticle.tex
541 lines (465 loc) · 21.3 KB
/
article.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
\documentclass[12pt]{article}
\usepackage{graphicx, subfigure, float, amsmath, amssymb, color}
\usepackage[margin = 1.0 in]{geometry}
\usepackage[backend=biber,firstinits=true,sorting=none]{biblatex}
\addbibresource{article.bib}
\title{Accessing biological data in R with semantic web technologies}
\author{Egon L.\ Willighagen$^{1,2}$}
\begin{document}
\date{\today}
\maketitle
\noindent
$^1$ Department of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, Maastricht, The Netherlands \\
$^2$ Division of Molecular Toxicology, IMM, Karolinska Institutet, Stockholm, Sweden
\bigskip
\noindent
$^*$Corresponding author\\
$\phantom{^*}$Email: [email protected]\\
$\phantom{^*}$Twitter: @egonwillighagen\\
$\phantom{^*}$ORCID: http://orcid.org/0000-0001-7542-0286
\bigskip
\noindent
Manuscript type: research article
\bigskip
\noindent Keywords: semantic web, workflow
\newpage
\begin{abstract}
\noindent
\textbf{Background}. Semantic Web technologies are increasingly used in biological database systems.
The improved expressiveness show advantages in tracking provenance and allowing knowledge to be
more explicitly annotated. The list of semantic web standards needs a complementary set of tools to
handle data in those formats to use them in bioinformatics workflows. \\
\textbf{Methods}. The approach proposed in this paper uses the Apache Jena library to create an
environment where semantic web technologies can be use in the statistical environment R. The
code is exposed as two R packages available from the Comprehensive R Archive Network (CRAN).
The RJava library and a custom convenience class is used to bridge between R and the Jena library.\\
\textbf{Results}. We here present three examples showing how the Resource Description Framework
(RDF) and SPARQL query standards can be employed in R. The first example takes input on BRCA1 SNPs
from a BioMart and converts this into a RDF data set. The second example runs a query on an
experimental remote SPARQL end point provided by Uniprot, and searches textual annotations of
proteins encoded by the BRCA1 gene. The third example shows how the package can be used to handle
RDF returned by OpenTox web services. \\
\textbf{Discussion}. The two provided library bring basic semantic web technologies to R.
While only a subset of Apache Jena is currently exposed, it provides key methods to deal with
RDF data and resources. The libraries are freely available from the CRAN under the
Affero GNU Public License version 3: \url{http://cran.r-project.org/web/packages/rrdf/}.
\end{abstract}
\section{Introduction}
Semantic Web technologies are finding their way into biological
databases~\cite{Belleau2008,Chen2010,Samwald2011,Williams2012,Willighagen2013}. The recent launch of the
Resource Description Framework (RDF) services at the European Bioinformatics Institute is a good example
of the impact it has on the field, despite the adoption taking place for longer already~\cite{EBIRDFPlatform}.
At the same time, most new databases do not yet use semantic web technologies, which may be partly
caused by the lack of availability of tools to handle RDF data.
RDF allows the combination of a machine readable framework with the use of ontologies that allow
sharing data and the meaning in the some data exchange format. This is enabled by a set of open
specifications, such as the underlying RDF, RDF Schema, the Web Ontology Language, and serialization formats
such as RDF/XML, N-TRIPLES, Notation3 (N3), and Turtle. Crucial, however, seems the specification of the SPARQL
query language that allow extracting data from local or remote RDF triple stores~\cite{Seaborne2009}. In fact,
federated SPARQL queries allow extracting data from multiple triple stores at the same time.
At the same time, R has established itself as a key tool for statistics, particularly in the biological
sciences. Many packages have been developed for this domain and are available from CRAN and Bioconductor,
allowing to combine various bioinformatics and data analysis approaches. These packages include,
for example, biomaRt, which have the purpose of extracting information from a remote database.
We here introduce two packages, rrdflibs and rrdf, for the statistics environment R, based on the Apache
Jena library and the RJava interface to Java~\cite{RMan,Jena,RJava}. Compared to the existing SPARQL
package, this package adds general RDF data handling~\cite{VanHage2013}. Using this package we show how
data aggregation can be performed from remote SPARQL end points, and how local data in other formats
can be converted into RDF to querying with SPARQL.
\section{Materials \& Methods}
The rrdf packages provides RDF and SPARQL functionality by exposing that functionality from the
Apache Jena library~\cite{Jena}. The rrdflibs package contains the Apache Jena libraries and
nothing more. This package is close to 7~MiB large but does not change frequently. The rrdf
package contains the R functions that wrap around the Jena functionality and convert data structures
where needed. This package is about 112~kiB but changes more frequently. Using this approach,
the Comprehensive R Archive Network (CRAN) server has the least amount of download throughput.
The Jena functionality is accessible using the rJava library~\cite{RJava} which handles loading
of the Java archives and provides methods to instantiate classes and call methods.
The source code of the package is available from GitHub at \url{http://github.com/egonw/rrdf/} and
binary packages are available from CRAN at \url{http://cran.r-project.org/web/packages/rrdf/}.
The first patch is from March 2011 while the most recent patches are from late 2013 when the
rrdflibs package was updated for Apache Jena 2.11.
\section{Results and Discussion}
The rrdf package provides basic and less basic functionality for aggregation and analysis of RDF data.
The package can be install and loaded with these commands:
\begin{footnotesize}
\begin{verbatim}
install.packages(c("rrdf", "rrdflibs"))
library(rrdf)
\end{verbatim}
\end{footnotesize}
\subsection{Triple stores}
To handle triples, we first need a triple store. At this moment only in-memory stores are
supported, though Jena's TDB also provides on-disk triple stores; this puts limitations to
the amount of data you can analyze in one study.
The package supports two kinds of stores, one that has minimal ontology support, and one
basically just handles triples. Both can be created with the \textit{new.rdf} command:
\begin{footnotesize}
\begin{verbatim}
ontStore = new.rdf()
store = new.rdf(ontology=FALSE)
\end{verbatim}
\end{footnotesize}
When data is not read from a file or downloaded, but created from data in another format,
such as an R matrix, individual triples can be added to a store. This works for both
object properties and data properties. The former option links a subject to an object
resource, while the latter links a subject to a literal value:
\begin{footnotesize}
\begin{verbatim}
add.triple(store,
subject="http://example.org/Subject",
predicate="http://example.org/Predicate",
object="http://example.org/Object"
)
add.data.triple(store,
subject="http://example.org/Subject",
predicate="http://example.org/Predicate",
data="Literal value"
)
\end{verbatim}
\end{footnotesize}
We can also take advantage from the fact that R assigns values to method parameters based on
the order in which they are given. Therefore, we can simplify the above code to:
\begin{footnotesize}
\begin{verbatim}
add.triple(store,
"http://example.org/Subject",
"http://example.org/Predicate",
"http://example.org/Object"
)
add.data.triple(store,
"http://example.org/Subject",
"http://example.org/Predicate",
"Literal value"
)
\end{verbatim}
\end{footnotesize}
Data can be loaded from file by providing a file name and a format ("RDF/XML", "TURTLE",
"N-TRIPLES", and "N3"):
\begin{footnotesize}
\begin{verbatim}
store = load.rdf("file.n3", format="N3")
\end{verbatim}
\end{footnotesize}
\subsubsection{Example}
As an example, we will here create RDF for a small data set with information on five single nucleotide
polymorphisms (SNPs) in the BRCA1 gene, extracted with biomaRt~\cite{Durinck2005}:
\begin{footnotesize}
\begin{verbatim}
library(biomaRt)
mart = biomaNoy2009Rt::useMart(biomart="snp", dataset="hsapiens_snp")
brca1 = c("rs16940","rs16941", "rs16942", "rs799916", "rs799917")
attribs = c(
"refsnp_id", "chr_name", "chrom_start",
"validated", "pmid_20137", "allele"
)
data = biomaRt::getBM(
attributes=attribs, filters=c("snp_filter"), values=brca1, mart=mart
)
\end{verbatim}
\end{footnotesize}
Following the guidelines for generating RDF outlined by Open PHACTS~\cite{Haupt2013,Williams2012}, we identify concepts
and matching ontology terms, using BioPortal~\cite{Noy2009}: SNP, gene, chromosome,
article, and allele. Additionally, there are the following properties: a SNP identifier, a PubMed identifier,
a chromosome number, a chromosomal sequence position, and a list of validations (e.g. HapMap, 1000Genomes).
The latter will not be further semantically specified, but just be converted into a RDF literal as it is
provided by the BioMart.
\begin{table}
\caption{A list of ontology terms for the concepts in the BRCA1 SNP example. Ontology acronyms and the prefixes
used are explained in the text.}
\begin{center}
\begin{footnotesize}
\begin{tabular}{l|l|l|l}
\textbf{concept} & \textbf{ontology} & \textbf{term} & \textbf{URI} \\
\hline
SNP & SIO & snp & sio:SIO\_010027 \\
chromosome & GO & chromosome & obo:GO\_0005694 \\
article & BIBO & article & bibo:Article \\
allele & SO & allele & obo:SO\_0001023 \\
SNP identifier & TMO & snp identifier & tmo:TMO\_0161 \\
PubMed identifier & CHEMINF & PubMed identifier & sio:CHEMINF\_000302\\
chromosome number & TMO & chromosome number & tmo:TMO\_0157 \\
chromosomal sequence position & TMO & chromosomal sequence position & tmo:TMO\_0122 \\
\end{tabular}
\end{footnotesize}
\end{center}
\end{table}
Taking advantage of existing ontologies. The choice is somewhat arbitrary and no ontological
analysis has been done, e.g. on specified domain and ranges of predicates and implied
properties of classes. The following ontologies are used:
SIO is the Semanticscience Integrated Ontology;
TMO is the Translational Medicine Ontology~\cite{Luciano2011};
%SO is the Sequence Types and Features Ontology;
GO is the Gene Ontology;
BIBO is the Bibliographic Ontology~\cite{d2009bibliographic}; and,
CiTO is the Citation Typing Ontology~\cite{Shotton2010}.
Prefixes used in the term Uniform Resource Identifiers (URIs) include
\textit{sio} for \url{http://semanticscience.org/resource/},
\textit{tmo} for \url{http://www.w3.org/2001/sw/hcls/ns/transmed/},
\textit{obo} for \url{http://purl.obolibrary.org/obo/},
\textit{bibo} for \url{http://purl.org/ontology/bibo/}, and
\textit{cito} for \url{http://purl.org/spar/cito/}.
We can add these prefixes to the store. The following code examples
shows the full syntax in the first addition, but leaves out the
parameters names in the following additions:
\begin{footnotesize}
\begin{verbatim}
snpStore = new.rdf(ontology=FALSE)
add.prefix(snpStore,
prefix="sio", namespace="http://semanticscience.org/resource/"
)
add.prefix(snpStore, "tmo", "http://www.w3.org/2001/sw/hcls/ns/transmed/")
add.prefix(snpStore, "obo", "http://purl.obolibrary.org/obo/")
add.prefix(snpStore, "bibo", "http://purl.org/ontology/bibo/")
add.prefix(snpStore, "cito", "http://purl.org/spar/cito/")
add.prefix(snpStore, "snp", "http://example.org/snp/")
add.prefix(snpStore, "art", "http://example.org/article/")
add.prefix(snpStore, "loc", "http://example.org/location/")
add.prefix(snpStore, "all", "http://example.org/allele/")
add.prefix(snpStore, "ex", "http://example.org/onto/")
add.prefix(snpStore, "pubmedid", "http://example.org/pubmed/")
add.prefix(snpStore, "snpid", "http://example.org/snpid/")
\end{verbatim}
\end{footnotesize}
Furthermore, we can define a number of classes and predicate, for reduced code:
\begin{footnotesize}
\begin{verbatim}
snpClass = "http://semanticscience.org/resource/SIO_010027"
chromosomeClass = "http://purl.obolibrary.org/obo/GO_0005694"
pubmedIdClass = "http://semanticscience.org/resource/CHEMINF_000302"
snpIdClass = "http://www.w3.org/2001/sw/hcls/ns/transmed/TMO_0161"
articleClass = "http://purl.org/ontology/bibo/Article"
alleleClass = "http://purl.obolibrary.org/obo/SO_0001023"
rdfTypePred = "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
isDescribedBy = "http://purl.org/spar/cito/isDescribedBy"
onChromosome = "http://example.org/onto/onChromosome"
hasAttribute = "http://semanticscience.org/resource/CHEMINF_000200"
hasStart = "http://example.org/onto/hasStart"
hasValue = "http://semanticscience.org/resource/SIO_000300"
hasValidation = "http://example.org/onto/hasValidation"
hasAllele = "http://example.org/onto/hasAllele"
hasLocation = "http://example.org/onto/hasLocation"
\end{verbatim}
\end{footnotesize}
Using these ontology terms we can define the translations of the BioMart-extracted
data into RDF:
\begin{footnotesize}
\begin{verbatim}
createEntry <- function(row) {
snpID = row[1]
chr = row[2]
chrStart = row[3]
validation = row[4]
pubmedID = row[5]
allele = row[6]
snpSubject = paste("http://example.org/snp/", snpID, sep="")
pubmedObject = paste("http://example.org/article/a", pubmedID, sep="")
alleleObject = paste("http://example.org/allele/", snpID, sep="")
chrObject = paste("http://example.org/location/", snpID, sep="")
snpIDObject = paste("http://example.org/snpid/", snpID, sep="")
pubmedIDObject = paste("http://example.org/pubmed/a", pubmedID, sep="")
add.triple(snpStore, snpSubject, rdfTypePred, snpClass)
add.triple(snpStore, snpSubject, hasAttribute, snpIDObject)
add.triple(snpStore, snpIDObject, rdfTypePred, snpIdClass)
add.data.triple(snpStore, snpIDObject, hasValue, snpID)
# validation information
sapply(unlist(strsplit(validation, split=",")),
function(validationItem) {
if (!is.null(validationItem)) {
add.data.triple(snpStore,
snpSubject, hasValidation, validationItem
)
}
}
)
# allele information
add.triple(snpStore, snpSubject, hasAllele, alleleObject)
add.triple(snpStore, alleleObject, rdfTypePred, alleleClass)
add.data.triple(snpStore, alleleObject, hasValue, allele)
# chromosome location information
add.triple(snpStore, snpSubject, hasLocation, chrObject)
add.triple(snpStore, chrObject, rdfTypePred, chromosomeClass)
add.data.triple(snpStore, chrObject, onChromosome, chr)
add.data.triple(snpStore, chrObject, hasStart, chrStart)
# pubmed information
add.triple(snpStore, snpSubject, isDescribedBy, pubmedObject)
add.triple(snpStore, pubmedObject, rdfTypePred, articleClass)
add.triple(snpStore, pubmedObject, hasAttribute, pubmedIDObject)
add.triple(snpStore, pubmedIDObject, rdfTypePred, pubmedIdClass)
add.data.triple(snpStore, pubmedIDObject, hasValue, pubmedID)
}
\end{verbatim}
\end{footnotesize}
We can then just iterate over the actual data with and safe it to a file:
\begin{footnotesize}
\begin{verbatim}
apply(data, MARGIN=1, FUN=createEntry)
save.rdf(snpStore, filename="test.n3", format="N3")
\end{verbatim}
\end{footnotesize}
Or output it as a Notation3 string with:
\begin{footnotesize}
\begin{verbatim}
cat(asString.rdf(snpStore))
\end{verbatim}
\end{footnotesize}
\subsection{SPARQL}
Local (in-memory) triple stores can be queried with the \emph{sparql.rdf} method by providing
it with the store to query and a string with the SPARQL query. For example, to get all resource
types, we can use this:
\begin{footnotesize}
\begin{verbatim}
sparql.rdf(store,
paste(
"SELECT DISTINCT ?type WHERE {",
" [] a ?type ",
"}"
)
)
\end{verbatim}
\end{footnotesize}
Remote SPARQL end point can be queried in a similar fashion: we replace the triple store with the
URL pointing to the SPARQL end point. For example, if we want to
extract all properties of the CHEMBL615603 assay from ChEMBL~\cite{Gaulton2012,Willighagen2013}
we can use the following remote query:
\begin{footnotesize}
\begin{verbatim}
results = sparql.remote(
"http://rdf.farmbio.uu.se/chembl/sparql",
paste(
"SELECT DISTINCT ?predicate ?object WHERE {",
" ?assay <http://www.w3.org/2000/01/rdf-schema#label> \"CHEMBL615603\" ;",
" ?predicate ?object .",
"}"
)
)
\end{verbatim}
\end{footnotesize}
Jena parses the SPARQL query too, which requires the query to be valid
against both Jena and the SPARQL end point. Because the SPARQL specification has tool specif
extensions and that not all tools support the full of SPARQL 1.1, it can sometimes be tricky
to find the mutually support SPARQL command subset. The rrdf package also allows to bypass
Jena and query the end point directly by using \emph{sparql.remote(..., jena=FALSE)},
removing the problem and allowing you to use extensions of the end point.
The results object is a matrix with the variable names from the SPARQL query as column names.
This allows us to get a predicates with:
\begin{footnotesize}
\begin{verbatim}
results[,"predicate"]
\end{verbatim}
\end{footnotesize}
\subsubsection{Example}
Remote SPARQLing can be used to get more information about biological entities of interest. The following
example retrieves information from the Uniprot database using the experimental SPARQL end point. In
particular, it retrieves annotations for proteins encoded by the BRCA1 gene:
\begin{footnotesize}
\begin{verbatim}
findAnnotations = paste(
"prefix up: <http://purl.uniprot.org/core/>",
"prefix ens: <http://purl.uniprot.org/ensembl/>",
"prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>",
"SELECT DISTINCT ?protein ?mnemonic ?comment WHERE {",
" ?transcript up:transcribedFrom ens:ENSG00000012048 .",
" ?protein a up:Protein ;",
" rdfs:seeAlso ?transcript ;",
" up:mnemonic ?mnemonic ;",
" up:annotation ?annot .",
" ?annot rdfs:comment ?comment .",
"}"
)
annotations = sparql.remote(
endpoint="http://beta.sparql.uniprot.org/",
sparql=findAnnotations, jena=FALSE
)
\end{verbatim}
\end{footnotesize}
\subsubsection{Example}
A second example combines regular REST or -like services that return RDF with a local SPARQL query.
For example, this is the approach used in Bioclipse to interact with many of the OpenTox
services~\cite{Willighagen2013}, but can be used for the Open PHACTS Linked Data API
too~\cite{Williams2012}. For example, we can get data from an OpenTox data set using
RCurl~\cite{Hardy2010,Lang2013}, in this example data set 112 with 320 compounds,
which originates from the ToxCast project~\cite{Dix2007}:
\begin{footnotesize}
\begin{verbatim}
library(RCurl)
rdfContent = getURL(
paste(
"http://apps.ideaconsult.net:8080/ambit2/dataset/112/compounds",
"media=text/n3",
sep="?"
),
write=basicTextGatherer()
)
store = fromString.rdf(rdfContent, format="N3")
compounds = sparql.rdf(store,
paste(
"PREFIX ot: <http://www.opentox.org/api/1.1#>",
"SELECT DISTINCT ?compound WHERE {",
" ?compound a ot:Compound",
"}"
)
)
\end{verbatim}
\end{footnotesize}
In fact, we can continue getting further detail about these compound by repeating the same
approach on the compound URIs. Let's assume we are further interested in the SMILES string.
This feature has the URI \url{http://apps.ideaconsult.net:8080/ambit2/feature/21753}. We
add this as the list of features to retrieve:
\begin{footnotesize}
\begin{verbatim}
feature = "http://apps.ideaconsult.net:8080/ambit2/feature/21753"
getOptions = paste(
paste("feature_uris[]", curlEscape(feature), sep="="),
paste("media", curlEscape("application/rdf+xml"),sep="="),
sep="&"
)
\end{verbatim}
\end{footnotesize}
We can then use the R apply() method to retrieve the SMILES for each structure with:
\begin{footnotesize}
\begin{verbatim}
result = apply(compounds, MARGIN=1, function(x) {
compound = x["compound"];
compoundURI = paste(compound, getOptions, sep="?")
print(paste("Downloading",compoundURI))
cmpdContent = getURL(
compoundURI, write=basicTextGatherer()
)
fromString.rdf(cmpdContent, format="RDF/XML", appendTo=store)
})
\end{verbatim}
\end{footnotesize}
Finally, we use SPARQL again to list all the SMILES strings for the compounds:
\begin{footnotesize}
\begin{verbatim}
query = paste(
"PREFIX ot: <http://www.opentox.org/api/1.1#> ",
"SELECT DISTINCT ?compound ?value WHERE {",
" ?s ?o [ ot:feature <", feature, "> ;",
" ot:value ?value ",
" ] ;",
" ot:compound ?compound . ",
"}",
sep=""
)
compoundSMILESes = sparql.rdf(store, query)
\end{verbatim}
\end{footnotesize}
This results in a matrix with compound URIs in one column and SMILES strings in another, ready for
use by other R packages.
\section{Conclusions}
The rrdflibs and rrdf packages bring basic semantic web technologies to R statistical environment.
While only a subset of Apache Jena is currently exposed, it provides key methods to deal with
RDF data and resources. The examples shed some light on how it can find a place on more advanced
analyses workflows.
\section{Acknowledgements}
The people who have provided feedback on the rrdf package are kindly thanked for their efforts in
getting into contact with me.
The research leading to these results has received funding from the European Community’s Seventh
Framework Programme (FP7/2007–2013) under Grant Agreement n° [267042], from Cosmetics Europe,
and from the Innovative Medicines Initiative Joint Undertaking under grant agreement no. 115191,
resources of which are composed of financial contribution from the European Union's Seventh Framework
Programme (FP7/2007-2013) and EFPIA companies’ in-kind contribution.
\printbibliography
\end{document}