-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
295 lines (238 loc) · 16.3 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
GENe BLAST Automation and Gene Cataloger - June 2014
Nathan Owen, [email protected]
Github and Source Files - https://github.com/newOnahtaN/Genie-GENe-Cataloger
========================================
This program was originally created for use by the Biology Department of the
College of William and Mary. The GENe program was developed knowing the
difficulty and tediousness of having to Blast many thousands of sequences and
then having to sort through the hits of these Blasts by hand. The program
confronts that challenge and is fully capable of handling as many sequences as
necessary. GENe serves as a middleman between the user and NCBI's Blast servers
or NCBI's local Blast tool, BLAST+, and accomplishes both of these tasks with
the help of the open source Biopython module. This program works first and
foremost as a recipient of excel files that contain a list of sequences that
need to be BLASTed. Users may choose to Blast these sequences either locally
or on NCBI's servers one by one, but both options have drawbacks that will be
discussed in more detail in this README.
Version Compatibility
=====================
Software for use:
-----------------
NCBI-BLAST+-2.2.29
Software for development:
-------------------------
NCBI-BLAST+-2.2.29
Biopython, numPy
wxPython
Python Excel : wlwt , wlrd
Table of Contents
=================
1. Common Use and Basic Instructions
2. NCBI Server Blast
3. BLAST+ Local Blast
4. Xenopus Laevis Algorithm / Top three Blast hits
5. Backup Data and Error Checking
1) Common Use and Basic Instructions
====================================
Installation:
Windows: For windows, download the GENe Setup.exe file from github that resides in
the GENe_Windows folder. You can click directly on this file and then click on
‘view raw data’ in order to download it directly. You do not need to worry about
any other files in this folder including the installer script unless you plan on
writing any more to my code. If you cannot ‘view raw data’, you will need to download
the entire repository, and then only use the .exe in the specified directory.
Mac: For Mac users, download the GENe.zip file from github that resides in the
GENe_Mac folder. You can click directly on this file and then click on ‘view raw
data’ in order to download it directly. You do not need to worry about any other files
in this folder unless you plan on writing any more to my code. If you cannot
‘view raw data’, you will need to download the entire repository, and then only use
the zip file in the specified directory.
The first operations that are necessary for use of this program are found in the
top left corner of the window. The 'Open Excel File' button has the user choose
an excel workbook to read sequences out of. It is possible that the user may
choose a CSV file or variant that can be read by Excel and therefore has an
Excel icon, but in order for GENe to read the file, it has to be saved as an
Excel workbook. After the user selects an excel file to read from, the scroller
to the right labeled 'Column Containing Sequences' should be adjusted so that
the number showing is representative of the column in the excel spreadsheet that
contains all of the sequences that are to be BLASTed. It should be kept in mind
that the leftmost column is considered 'Column 0' in this arrangement.
The user must then choose a directory that they wish to save the results to. It
is imperative that this file end with .xls - if the user does not type it at the
end of the filename they choose, it will be added for them. They must not,
however, choose a filename that ends in some other extension than .xls or the
program will error. If the user has mistakenly misnamed their save directory and
did not realize until the end of a long calculation, a backup of their data can
be retrieved. More information about this is in section 5 of this readme.
At this point, the user must decide whether to run a local or NCBI server Blast,
whether to Use the Xenopus laevis algorithm or not, a maximum e-value, the type
of Blast they would like to perform, and over which database they would like to
do it. The defaults for these settings are server Blast because it requires no
installation other than this program, and the rest are set to accomplish the
task of querying the specific set of sequences that this program was originally
created for.
The details of each of these settings besides the e-value maximum will be
discussed in sections 2,3, and 4.
When all the settings are set to the user's satisfaction, the Run GENe button
should be pushed. In most cases, the program should run appropriately until
finished, but if the program errors or terminates early, the reasons why will
be written into an error log that exists in a folder called 'GENe' that should
be in your 'User' directory, also known as your Home Directory.
The Excel file that is the result of this program is organized to show to the
user three 'hits' for each sequence that were the results of the blast. How
these hits are chosen is discussed in section 4. The leftmost column is the
sequence that was blasted, and each hit has it's name (or title) listed, the
-value it scored, the accession number it has been assigned by NCBI, and,
hopefully, its shortened gene name. Collecting the shortened gene name as
metadata proved unsuccessful so I created a simple heuristic to scrape it from
the full name of the gene. It is almost certain that only about 3/4 of hits will
have a short gene name recorded, and among these it is likely that some are not
entirely correct. The two rightmost columns are updated for each gene if a
server Blast is being performed to give the user an idea of how long each query
is taking, and if a local blast is being performed, then only the very last row
will have this information and it will describe the length of the entire
operation.
If the user would like to check on their results before GENe has completed, they
should copy the file that they choose to write the results to an open up the
copy. They can also view the file directly, but if they do so, GENe will pause
until the file is closed again.
2) NCBI Server Blast
====================
This option is the default ticked because once this program is installed on a
computer, this is the only option that is ready to run. Using this option is
only advised for lists of sequences that are short or that did not return any
results when Blasted over a local database. This is because using the server
blast is incredibly slow and processes each sequence one by one. The process
time of a sequence is entirely dependent on the status of NCBI's servers and
whether or not other users are querying it at the same time. During the evening
and on weekends, an individual sequence query time can be as short as 5-15
seconds, but during peak hours it can reach as high as 1000 seconds or higher.
That being said, this option is definitely more reliable than running blast
locally, and likely will crash less often. On top of that, the results from this
option are almost always more complete than the local blast option because the
queries are run over a much greater portion of NCBI's databases by default,
whereas the local blast's capability to yield a result for each sequence is
dependent on how much of NCBI's database the user downloads to their machine.
Even so, I highly advise that if a list of sequences is any longer than 500
entries, a local Blast be used.
The Blast search types for both local and server queries are the same
- 'blastn, blastp, blastx, tblastn, tblastx'. Information about these types of
searches can be found here: http://blast.ncbi.nlm.nih.gov/Blast.cgi. The
databases that these searches run over, however, differ greatly between server
and local queries. For this reason, only the database that is specified for the
type of search that GENe is performing (server or local) will be considered when
it runs. The server databases are described in detail here:
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide#db .
While many of the databases that appear on that page are represented in the GENe
program, I cannot guarantee that all will work. The only server database that I
can be sure works is the 'nr' database. The ability to search over the rest of
the databases relies upon whether or not BioPython supports them or not. Most
should, but there are likely a few that do not work.
3) BLAST+ Local Blast
=====================
Blasting sequences locally, while requiring some set up, pays off exponentially
(quite literally) as the list of sequences that need to be Blasted grows. Local
blast, unlike the server blast, process all sequences in a batch request, and is
much faster than the server option for this reason. The main drawback is that
while the server query option can quietly run in the background of a system
sending of requests slowly but surely, BLAST+ is very much a brute force method
and will consume all of the system's memory and processing power until finished.
This means that while this program much faster, it leaves the system unusable
for other purposes until it has finished. That cost is well worth the payoff
however as a sample of 32,000 sequences can be BLASTed locally in just over 24
hours whereas 32,000 sequences with the server option may take weeks, even months.
Depending on the databases selected for local blast, it is likely that percentage
of sequences will return no results. It is in this case that I advise that those
sequences that did not return anything from a local blast then be blasted on the
server, in order to obtain complete data.
VERY IMPORTANT: If the user wishes to close out of GENe while a local Blast is
executing in order to use their system, closing GENe WILL NOT close out the
local blast process. If GENe terminates before the local blast does, the user needs
to open their task manager and local the blast process manually in order to end it.
Much of the information about the BLAST+ Local Command Line Applications can be
found at this web address: http://www.ncbi.nlm.nih.gov/books/NBK1763/ , but I wi
ll describe here only what needs to be set up in order to use GENe. The BLAST+
suite is a set of tools that allow users to perform regular blast queries using
their operating system's command line. For users that are uncomfortable using
the command line, GENe is a great GUI alternative.
In order to use the BLAST+ option, the user must install it from the webpage in
the paragraph above. There are instructions that are specific to each operating
system, and they must be followed carefully. When BLAST+ is installed, settings
known as environment variables must be appropriately adjusted so that the
operating system's command line can access the BLAST+ executables and also the
databases that it is supposed to run over. Specific instructions on how to do
this are again found in the link above in the installation section.
In addition to having BLAST+ installed, a set of databases must be downloaded an
kept up to date, or custom databases may be created. This webpage
- ftp://ftp.ncbi.nlm.nih.gov/blast/db/ - contains all of NCBI's most current
databases. Descriptions of each database on that page are found here:
ftp://ftp.ncbi.nlm.nih.gov/blast/db/README. Each databases is split up into many
.##.tar.gz files that are each about a gigabyte large. The prefix before the
.##.tar.gz extension is the name of the database, and in order for that database
to be used, all of the files from 0 to the topmost ## must be downloaded and
extracted using an extracting tool like WinZip. The folder that contains the
extracted portions of the database(s) you have chosen needs to have a path
variable set to it so that the command line can directly access it. On windows,
a new system variable needs to be created named BLASTDB that directs to this
folder's directory. In MacOS and Unix systems, instructions on how to do this
can be found in the installation instructions referenced earlier.
When a database has either been downloaded or a custom database has been set up
and BLAST+ is fully configured, GENe is ready to run. Before running, the user
must type their database(s) into the 'Local Database:' box. If the user is only
using one database, then it should be the name of that database as it appears
before the .##.tar.gz extension. For example, if you want to use the nt database,
just write nt in the box. Same thing for any other database. If the user would
like to use multiple databases, then it is as simple as writing each one out with
a single space between each, no commas necessary. For example, in order
to blast the nr, sprot, and trembl databases at the same time, one would
write - nr sprot trembl - in the box (no hyphens). Keep in mind that when blasting
multiple databases, all databases must be of the same molecule type or BLAST+ will error.
If a user creates a custom database, instructions on how to blast it can be
found in the manual above. It should be as simple as just writing out the entire path
directory to wherever you have saved the database, ie. C:\Local Disk\.......\Database
4) Xenopus Laevis Algorithm / Top Three Blast Hits
==================================================
As stated before, this program was originally written for a team of researchers
at the College of William and Mary's Biology Department whose interest was in
cataloging several thousands sequences from the organism Xenopus laevis, the
African Clawed Frog. The original intention for this program that did not get
implemented to due to time constraints was to allow the user to customize their
own algorithm for sorting through the hits of a BLAST. A specialized algorithm
was instead developed for this team of researchers, and it works as follows:
Always record three hits:
If there exists a hit that is from the organism Xenopus laevis, let it be
recorded first. Do not record any more of this type.
If there exists a hit that is from the organism Xenopus tropicalis, let it be
recorded second. Do not record any more of this type.
After searching to find hits that are either from Xenopus laevis or tropicalis,
fill any blank spaces with the rest of the hits ordered by lowest e-value first.
This algorithm will always record at least three hits unless there are less than
three hits in the first place. It gives priority to hits from the expected
organisms, but also accounts for off-the-wall hits that have low e-values.
A user or developer has two alternatives to using this algorithm: they may
either simply uncheck the box that is labeled "Use Xenopus laevis algorithm" to
just record the top three hits, or they may access this programs source code and
construct a new algorithm. I have left the GENe code wide open for this specific
development - all that one would need to do is to local the .filterNames()
method in the GENe class located in GENe.py, and using the documentation from
Biopython that describes their BlastRecord class (found here in chapter 7.3
http://biopython.org/DIST/docs/tutorial/Tutorial.html) create a new sorting
algorithm. If you would like me to personally create a new sorting algorithm for
you, please feel free to contact me. I can't guarantee that I'll have the time
to help, but I will be eager to help if I do.
5) Backup Data and Error Checking
=================================
If an excel file is lost, there will always be a backup of the most recent GENe
operation performed saved as an excel file that will be present in the GENe folder
which is located in your 'User' directory, also known as your Home Directory.
This is only the case if you are on a Windows machine unfortunately because the
Mac will not allow programs like this one to create directories. Instead, the Mac
version saves a copy to this program’s working directory, which is very difficult
to access.
If the program terminates early, or errors for any reason, the error message is
written into a file called 'GENe Error Log.txt' that can also be found in the
same directory as the backup file if you are on Windows, and if you are on MacOS
then you can open it directly from the GUI. If GENe errors, the error log will
not update until you close GENe, so when it does error on Mac, close it down,
open it up again, and then check the error log button If an error persists and
cannot be resolved, please feel free to contact me at [email protected]