-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathFAQ.txt
192 lines (158 loc) · 16.1 KB
/
FAQ.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
Frequently Asked Questions
What is a binning method?
A binning method assigns an identifier to every sequence in a sequence sample, where the total number of identifiers is ideally less than the total number of sequences. Thus the act of binning places the sequences in to broader categories. A bin includes all the sequences with the same identifier. If these identifiers identify taxa from a taxonomy, the method is a taxonomic binning method.
What is a profiling method?
A profiling method returns an estimate for the frequencies of different taxa in a sequenced microbial community based on analysis of the sequence sample. The main output is a vector with relative abundances for the different sample taxa. The relative abundances of taxa from the same 'rank' of the taxonomy (e.g. superkingdom, including archaea, bacteria and eukaryotes) cannot sum up to more than 1.
What is an assembly method?
An assembly method returns longer nucleotide sequences derived by puzzling together individual sequencing reads. These sequences are assumed to represent contiguous stretches from one genome included in the microbiome sample that was sequenced.
Is CAMI open and transparent?
Yes! The development process is done in an open community. Everybody is invited to participate. People who intend to participate will not be involved in any part of the development process that provides any information or advantage for the actual competition.
What about reproducibility?
CAMI encourages participants to submit reproducible results by providing their executable software in docker containers, along with submission of their predictions (see details below). CAMI has worked together with bioboxes to define formats for a standardized setup and execution of profiling, binning and assembly tools in docker containers. This will make the results of these tools reproducible, and also facilitate a continuous monitoring of their performance on future test data sets.
Where do I find the relevant information when I want to participate in the CAMI contest?
You can find all information on the CAMI webpage, on the page entitled <a href="http://cami-challenge.org/participate">Contest participation</a>.
Why should I participate in the contest?
There are various reasons why you might want to participate.
- First, you will receive feedback for your methods performance on a larger number of data sets.
- Facilitated future benchmarking. By submitting your tool in a docker container, you will be able to continuously monitor its performance against other tools that have been submitted this way. It will be possible to automatically run all tools on new data sets and include new performance metrics easily. If you update your tool, you simply have to update your docker container that you submitted.
- Coauthorship. Contest participants who would like to disclose their identities and deliver reproducible results have the option to be authors on a joint CAMI evaluation publication.
- We will distribute three participation awards of thousand Euros each among participants submitting reproducible results and performing better than a baseline method across all categories.
Why do I have to register?
If there is a problem with the submitted files, we have the chance to contact the contestant. In a later stage of the competition we want to show evaluation metrics with an anonymized name for the results (assembly, profiling, binning) provided by the contestant. Participants can choose to have their results displayed anonymously only, or reveal their identities for participation in a joint publication.
When will the contest start?
The first CAMI challenge has started on March 27th.
How long will the CAMI contest last?
CAMI will be open overall for 3 months. First, six weeks for the assembly methods, and methods that process read data sets. After six weeks, a gold standard assembly will be provided for methods that work on assembled data sets. At this point, the assembly competition will end.
How large are the data sets?
There are three data sets, consisting of 15 to 75 Gbp of sequence each, and between 1 and 5 samples.
What are the data access policies for the CAMI challenge?
CAMI data sets have been generated from unpublished genome data provided by multiple laboratories who want to control the data release date. Therefore, the CAMI data sets will become fully available, together with the gold standard, after the competition has ended and once the data contributors have released their data. Until then, a data recipient is not allowed to publish the data, deposit the sequences in any database or use it for other purposes than participating in the CAMI challenge. The data recipient agrees to delete and/or trash the data in all forms when the CAMI Challenge is ended to the time until it has been officially released.
If I do not have access to sufficient compute time on my own, what are my options?
The Pittsburgh Supercomputing Center (PSC) is making compute time and storage available on their SGI UV1000 system, Blacklight to do genome assembly for the CAMI challenge. Blacklight consists of three nodes with 8-16 TB of shared memory and up to 2,048 cores per node. If request these resources during registration, PSC will contact you to determine your software and computing needs and to set up your account on Blacklight.
What kind of data sets will be provided?
CAMI is providing simulated metagenome data sets created from more than 700 predominantly unpublished (contributed) isolate genome sequences, with as much realism built in as we could manage. For instance, they will include multiple strains from the same species.
Why are we providing new simulated data sets and not using existing public ones?
For real metagenome samples, we do not know which read comes from which genome, and in our simulations, this should be the same way. All public simulated samples also have the correct solution also available, which would make the contest less realistic. Furthermore, we want to simulate different relevant scenarios that are used in metagenomics for which no simulated data sets exist. The gold standards for these data sets will be provided after the contest has ended.
How can I download data?
You have to do the following steps.
1.<a href="https://data.cami-challenge.org/accountLoginView">Login</a> or <a href="https://data.cami-challenge.org/accountRegisterView">Register</a>
2.Go to the <a href="https://data.cami-challenge.org/participate">Up- and Download</a> section.
3.Find the dataset you want to download.
4.If you want to download datasets from the restricted download section, you will have to read and accept the <a href="https://data.cami-challenge.org/participantTermsOfCondition">terms and conditions</a>. After accepting the terms you are allowed to download the samples. If you want to download a dataset from the public download section, you can click on the "public download section" button and download any sample you want.
I have problems finding out which output format I should use, what should I do?
You can contact us to get advice.
How do I submit my results?
The best way will be if you provide your tool along with the prediction files. The tool should be installed in a docker container following our instructions on https://data.cami-challenge.org/#dockerInfo, which writes the output to a file with a specific name in a specific directory. You can upload this docker container to the contest website and we will rerun your tool to see if the results are entirely reproducible. If your tool requires standard reference databases to run, many of them will be provided on the website. How you link to these is explained in the instructions.
Should I use the "all samples" or the "specific sample" option?
It is recommended that you submit a coassembly for all samples of a data set. If you cannot coassemble due to technical or methodological reasons, please submit individual assemblies for all samples using the "specific sample" option.
Can I submit the results without providing my tool?
If you want to submit your results without them being reproducible, you can do so and get the feedback for your personal information. In this case, your tool will only be considered to be included in a joint result publication, if it is a publicly accessible webtool.
How can I submit my results without providing my tool?
You have to do the following steps after the download of a dataset and the execution of your pipeline on it.
1.<a href="https://data.cami-challenge.org/accountLoginView">Login</a> or <a href="https://data.cami-challenge.org/accountRegisterView">Register</a>
2.Go to the <a href="https://data.cami-challenge.org/participate">Up- and Download</a> section.
3.Find the dataset you want to submit for.
4.Click the Add Assembly, Add Binning or Add Profiling button.
5.Fill in the form. For the fingerprint field you have to download our <a href="https://data.cami-challenge.org/camiClient.jar">cami client jar</a> (See next step).
6.Run the cami client: java -jar camiClient.jar with the parameter -af for assembly files -bf for binning files or -pf for profiling files. Now enter the returned fingerprint in the input field.
<br>
<b>Note!</b> You have to execute the cami client jar and submit the form every time you change the file.
<br>
7.After submitting the form you will get a credentials file that allows you to upload the file in the next 36 hours. If you want to submit your file to a later time, just submit the form and you will get new credentials.
Upload your file with: java -jar camiClient.jar -u credentials_file file_to_upload. You can upload up to 36 hours after the competition end.
8.(Optional) Register a docker container for reproduceability. <a href="https://data.cami-challenge.org/#dockerInfo">How to build/submit a docker container?</a>
How to use the cami Client?
The java based CAMI client jar is for the validation and upload of assembly, binning and profiling files.
Requirements are a Unix-like operating system (e.g. Linux, OSX, Solaris, *BSD) and Java 7 (e.g. Oracle JDK 7, OpenJDK 7).
usage: java -jar camiClient.jar [-af < assembly_file >] [-bf < binning_file extracted_taxnomy_db_path >] [-d < url destination >] [-h] [-pf < profiling_file extracted_taxnomy_db_path >] [-u < credentials_file file_to_upload >] [-version]
Validates and uploads binning, profiling and assembly files.
<table border="1">
<thead>
<tr>
<th>Command</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>-af,--assemblyFingerprint assembly_file</td>
<td>Computes fingerprint of an assembly file.</td>
</tr>
<tr>
<td>-bf,--binningFingerprint binning_file extracted_taxnomy_db_path</td>
<td>Validates binning file and computes fingerprint.(download the taxonomy_db from https://data.cami-challenge.org/participate (databases section))</td>
</tr>
<tr>
<td>-pf,--profilingFingerprint profiling_file extracted_taxnomy_db_path</td>
<td>Validates profiling file and computes fingerprint.(download the taxonomy_db from https://data.cami-challenge.org/participate (databases section))</td>
</tr>
<tr>
<td>-u,--upload credentials_file file_to_upload </td>
<td>You can get the credentials file from the cami website. File to upload is the assembly, binning or profiling file you want to upload.</td>
</tr>
<tr>
<td> -d,--download url destintation</td>
<td> Downloads data from S3. </td>
</tr>
<tr>
<td>-h,--help</td>
<td>Print the help of the application.</td>
</tr>
<tr>
<td>-version,--v</td>
<td>Print the version of the application.</td>
</tr>
</tbody>
</table>
Where do I get the fingerprint?
You can get the fingerprint with our <a href="https://data.cami-challenge.org/camiClient.jar">cami client jar</a>. Run the cami client: java -jar camiClient.jar with the parameter -af for assembly files -bf for binning files or -pf for profiling files.
Your files must be in a specific format, which is explained in the following questions:
<b>What is the format of the CAMI assembly file?</b>
<b>What is the format of the CAMI binning file?</b>
<b>What is the format of the CAMI profiling file?</b>
<br>
<b>Note!</b> You have to execute the cami client jar and submit the form every time you change the file.
What is the format of the CAMI assembly file?
The CAMI assembly format is a FASTA-formatted contig and scaffold file.
<u>1.The header section</u>
The header section of the FASTA files must follow the standard FASTA guidelines and must start with ">" character followed by any alphanumeric "[A-Za-z]", whitespace (spaces and tabs), brackets "[]", underscores "_", colons "[:;]", commas ",", periods ".", pipes "|", or hypens "-" ending with a newline character.
<br>
Examples:
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
ATCTG
>scaffold-1
ATCTG
>Contig 2, Node 1
ATCTG
<br>
<u>2.The sequence section</u>
Sequences must contain only capitalized nucleotide characters, including the ambiguous character 'N': /[ATCGN]+/.
There is no requirement on the number of nucleotides per line, but community best practices advises limiting the amount of nucleotides to 120 characters.
<u>3.Contigs vs. scaffolds</u>
Contigs must not contain any stretches of greater than 5 ambiguous bases (Ns) in a row. Contigs must be split by the user at these points into smaller contigs. There are no restrictions on the amount of contiguous ambiguous bases in scaffolds. Contigs should be provided in a file called: contigs.fna Scaffolds should be provided in a file called: scaffolds.fna
<u>4.Examples</u>
>NODE_1331_length_1852_cov_20.7969_ID_2661
CTATTTTTAAATATCCGCGTACTTACTGAGTTATCGTGGCAGTTTTTGCTAATTTTGCGATTTTTGCCACGATAACTTGT
TAAAAAATATGACGTCTGCAATTCTTGTTGCTTTTACACTACAAAAAAGATAAAATATTAGTATGAAAGCAACAGAAGTG
>54257
AAAATAGACGCACAAGCTGGCACATCACAACGAATAACCAAGCATAGAAATGAAGAAAAAGAAAATAATCATGGGAGTAG
GAACAGGAATCCTTTTGGCTGCTGTTGCTTTTTGGGGGTGGCATTCCACACAAGCAACTTCGACAGAAATAACAAATGAA
What is the format of the CAMI binning file?
The CAMI format for binning is specified in our CAMI <a href="https://github.com/CAMI-challenge/contest_information/blob/master/file_formats/CAMI_B_specification.mkd">Github repository</a>.
What is the format of the CAMI profiling file?
The CAMI format for profiling is specified in our CAMI <a href="https://github.com/CAMI-challenge/contest_information/blob/master/file_formats/CAMI_TP_specification.mkd">Github repository</a>.
I cannot get the docker installation to work, what should I do?
Please contact us, we are happy to help.
Can I use special Hardware for a Dockers container (like GPUs, ...)?
Yes, all hardware platforms are supported. If you don't have access to the required hardware, please contact us regarding compute resources the Pittburgh Supercomputing Center could provide to you. This blog post describes how to run CUDA code in a Docker container: http://stackoverflow.com/questions/25185405/using-gpu-from-a-docker-container
How does the speed of a method contribute to the result? Is the contest mainly focused on precision and –recall?
CAMI will use multiple evaluation metrics for assessing performance. The most suitable ones to represent relevant performance aspects to both users and developers will be determined in a meeting after the contest.
I don't want to upload my unpublished tool to Docker Hub, what should I do?
We can provide a private Docker Hub, so that just the CAMI team has access to the unpublished tool.
Please contact [email protected], we will tell you the next steps.
Do I have to submit results for binning, assembly and profiling?
You are free to submit for one or multiple tasks, there is no need to submit results for all of them.
Do I have to submit my codes in any particular language?
You can submit your codes / software in any language you like. It should be installed ready to execute in a docker container. The instructions how to install it we will make available soon. We will also offer help with this, if wanted, via Skype.
What is the purpose of the google+-group?
We use it for video hangouts in planning the CAMI contest because we are working with people from around the world. We want to get your feedback and invite tool developers to interact with the CAMI team here in defining the CAMI contest already in the development phase. You do not have to join the group if you want to participate in the contest.