README.txt

GENe BLAST Automation and Gene Cataloger - June 2014
Nathan Owen, ncowen@email.wm.edu
Github and Source Files - https://github.com/newOnahtaN/Genie-GENe-Cataloger
========================================

This program was originally created for use by the Biology Department of the 
College of William and Mary. The GENe program was developed knowing the 
difficulty and tediousness of having to Blast many thousands of sequences and 
then having to sort through the hits of these Blasts by hand. The program 
confronts that challenge and is fully capable of handling as many sequences as 
necessary. GENe serves as a middleman between the user and NCBI's Blast servers 
or NCBI's local Blast tool, BLAST+, and accomplishes both of these tasks with 
the help of the open source Biopython module. This program works first and 
foremost as a recipient of excel files that contain a list of sequences that 
need to be BLASTed. Users may choose to Blast these sequences either locally 
or on NCBI's servers one by one, but both options have drawbacks that will be 
discussed in more detail in this README.  


Version Compatibility
=====================
Software for use:
-----------------
NCBI-BLAST+-2.2.29 

Software for development:
-------------------------
NCBI-BLAST+-2.2.29
Biopython, numPy
wxPython
Python Excel : wlwt , wlrd


Table of Contents
=================

1. Common Use and Basic Instructions
2. NCBI Server Blast
3. BLAST+ Local Blast
4. Xenopus Laevis Algorithm / Top three Blast hits
5. Backup Data and Error Checking


1) Common Use and Basic Instructions
====================================

Installation: 

Windows: For windows, download the GENe Setup.exe file from github that resides in
the GENe_Windows folder.  You can click directly on this file and then click on
‘view raw data’ in order to download it directly.  You do not need to worry about
any other files in this folder including the installer script unless you plan on 
writing any more to my code. If you cannot ‘view raw data’, you will need to download
the entire repository, and then only use the .exe in the specified directory.

Mac: For Mac users, download the GENe.zip file from github that resides in the 
GENe_Mac folder. You can click directly on this file and then click on ‘view raw
data’ in order to download it directly. You do not need to worry about any other files
in this folder unless you plan on writing any more to my code.  If you cannot 
‘view raw data’, you will need to download the entire repository, and then only use
the zip file in the specified directory.

The first operations that are necessary for use of this program are found in the 
top left corner of the window. The 'Open Excel File' button has the user choose 
an excel workbook to read sequences out of.  It is possible that the user may 
choose a CSV file or variant that can be read by Excel and therefore has an 
Excel icon, but in order for GENe to read the file, it has to be saved as an 
Excel workbook. After the user selects an excel file to read from, the scroller 
to the right labeled 'Column Containing Sequences' should be adjusted so that 
the number showing is representative of the column in the excel spreadsheet that 
contains all of the sequences that are to be BLASTed.  It should be kept in mind 
that the leftmost column is considered 'Column 0' in this arrangement.

The user must then choose a directory that they wish to save the results to.  It 
is imperative that this file end with .xls - if the user does not type it at the 
end of the filename they choose, it will be added for them.  They must not, 
however, choose a filename that ends in some other extension than .xls or the 
program will error. If the user has mistakenly misnamed their save directory and 
did not realize until the end of a long calculation, a backup of their data can 
be retrieved. More information about this is in section 5 of this readme.

At this point, the user must decide whether to run a local or NCBI server Blast, 
whether to Use the Xenopus laevis algorithm or not, a maximum e-value, the type 
of Blast they would like to perform, and over which database they would like to 
do it. The defaults for these settings are server Blast because it requires no 
installation other than this program, and the rest are set to accomplish the 
task of querying the specific set of sequences that this program was originally 
created for. 

The details of each of these settings besides the e-value maximum will be 
discussed in sections 2,3, and 4. 

When all the settings are set to the user's satisfaction, the Run GENe button 
should be pushed.  In most cases, the program should run appropriately until 
finished, but if the program errors or terminates early, the reasons why will 
be written into an error log that exists in a folder called 'GENe' that should 
be in your 'User' directory, also known as your Home Directory. 


The Excel file that is the result of this program is organized to show to the 
user three 'hits' for each sequence that were the results of the blast. How 
these hits are chosen is discussed in section 4. The leftmost column is the 
sequence that was blasted, and each hit has it's name (or title) listed, the 
-value it scored, the accession number it has been assigned by NCBI, and, 
hopefully, its shortened gene name. Collecting the shortened gene name as 
metadata proved unsuccessful so I created a simple heuristic to scrape it from 
the full name of the gene. It is almost certain that only about 3/4 of hits will 
have a short gene name recorded, and among these it is likely that some are not 
entirely correct. The two rightmost columns are updated for each gene if a 
server Blast is being performed to give the user an idea of how long each query 
is taking, and if a local blast is being performed, then only the very last row 
will have this information and it will describe the length of the entire 
operation. 

If the user would like to check on their results before GENe has completed, they 
should copy the file that they choose to write the results to an open up the 
copy.  They can also view the file directly, but if they do so, GENe will pause 
until the file is closed again. 


2) NCBI Server Blast
====================

This option is the default ticked because once this program is installed on a 
computer, this is the only option that is ready to run. Using this option is 
only advised for lists of sequences that are short or that did not return any 
results when Blasted over a local database.  This is because using the server 
blast is incredibly slow and processes each sequence one by one.  The process 
time of a sequence is entirely dependent on the status of NCBI's servers and 
whether or not other users are querying it at the same time.  During the evening 
and on weekends, an individual sequence query time can be as short as 5-15 
seconds, but during peak hours it can reach as high as 1000 seconds or higher.  
That being said, this option is definitely more reliable than running blast 
locally, and likely will crash less often. On top of that, the results from this 
option are almost always more complete than the local blast option because the 
queries are run over a much greater portion of NCBI's databases by default, 
whereas the local blast's capability to yield a result for each sequence is 
dependent on how much of NCBI's database the user downloads to their machine. 
Even so, I highly advise that if a list of sequences is any longer than 500 
entries, a local Blast be used.

The Blast search types for both local and server queries are the same 
- 'blastn, blastp, blastx, tblastn, tblastx'.  Information about these types of 
searches can be found here: http://blast.ncbi.nlm.nih.gov/Blast.cgi.  The 
databases that these searches run over, however, differ greatly between server 
and local queries.  For this reason, only the database that is specified for the 
type of search that GENe is performing (server or local) will be considered when 
it runs. The server databases are described in detail here: 
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide#db . 
While many of the databases that appear on that page are represented in the GENe 
program, I cannot guarantee that all will work.  The only server database that I 
can be sure works is the 'nr' database. The ability to search over the rest of 
the databases relies upon whether or not BioPython supports them or not. Most 
should, but there are likely a few that do not work.


3) BLAST+ Local Blast
=====================

Blasting sequences locally, while requiring some set up, pays off exponentially 
(quite literally) as the list of sequences that need to be Blasted grows. Local 
blast, unlike the server blast, process all sequences in a batch request, and is 
much faster than the server option for this reason. The main drawback is that 
while the server query option can quietly run in the background of a system 
sending of requests slowly but surely, BLAST+ is very much a brute force method 
and will consume all of the system's memory and processing power until finished. 
This means that while this program much faster, it leaves the system unusable 
for other purposes until it has finished. That cost is well worth the payoff 
however as a sample of 32,000 sequences can be BLASTed locally in just over 24 
hours whereas 32,000 sequences with the server option may take weeks, even months.
Depending on the databases selected for local blast, it is likely that percentage
of sequences will return no results.  It is in this case that I advise that those
sequences that did not return anything from a local blast then be blasted on the 
server, in order to obtain complete data. 

VERY IMPORTANT: If the user wishes to close out of GENe while a local Blast is 
executing in order to use their system, closing GENe WILL NOT close out the 
local blast process. If GENe terminates before the local blast does, the user needs
to open their task manager and local the blast process manually in order to end it. 

Much of the information about the BLAST+ Local Command Line Applications can be 
found at this web address: http://www.ncbi.nlm.nih.gov/books/NBK1763/ , but I wi
ll describe here only what needs to be set up in order to use GENe. The BLAST+ 
suite is a set of tools that allow users to perform regular blast queries using 
their operating system's command line.  For users that are uncomfortable using 
the command line, GENe is a great GUI alternative.  

In order to use the BLAST+ option, the user must install it from the webpage in
the paragraph above. There are instructions that are specific to each operating 
system, and they must be followed carefully. When BLAST+ is installed, settings 
known as environment variables must be appropriately adjusted so that the 
operating system's command line can access the BLAST+ executables and also the 
databases that it is supposed to run over. Specific instructions on how to do 
this are again found in the link above in the installation section.

In addition to having BLAST+ installed, a set of databases must be downloaded an 
kept up to date, or custom databases may be created. This webpage 
- ftp://ftp.ncbi.nlm.nih.gov/blast/db/ - contains all of NCBI's most current 
databases. Descriptions of each database on that page are found here: 
ftp://ftp.ncbi.nlm.nih.gov/blast/db/README. Each databases is split up into many 
.##.tar.gz files that are each about a gigabyte large. The prefix before the 
.##.tar.gz extension is the name of the database, and in order for that database 
to be used, all of the files from 0 to the topmost ## must be downloaded and 
extracted using an extracting tool like WinZip. The folder that contains the 
extracted portions of the database(s) you have chosen needs to have a path 
variable set to it so that the command line can directly access it. On windows, 
a new system variable needs to be created named BLASTDB that directs to this 
folder's directory. In MacOS and Unix systems, instructions on how to do this 
can be found in the installation instructions referenced earlier.

When a database has either been downloaded or a custom database has been set up 
and BLAST+ is fully configured, GENe is ready to run.  Before running, the user 
must type their database(s) into the 'Local Database:' box. If the user is only 
using one database, then it should be the name of that database as it appears
before the .##.tar.gz extension. For example, if you want to use the nt database,
just write nt in the box.  Same thing for any other database. If the user would 
like to use multiple databases, then it is as simple as writing each one out with 
a single space between each, no commas necessary. For example, in order 
to blast the nr, sprot, and trembl databases at the same time, one would 
write - nr sprot trembl - in the box (no hyphens). Keep in mind that when blasting
multiple databases, all databases must be of the same molecule type or BLAST+ will error. 

If a user creates a custom database, instructions on how to blast it can be 
found in the manual above. It should be as simple as just writing out the entire path
directory to wherever you have saved the database, ie. C:\Local Disk\.......\Database

4) Xenopus Laevis Algorithm / Top Three Blast Hits
==================================================

As stated before, this program was originally written for a team of researchers 
at the College of William and Mary's Biology Department whose interest was in 
cataloging several thousands sequences from the organism Xenopus laevis, the 
African Clawed Frog. The original intention for this program that did not get 
implemented to due to time constraints was to allow the user to customize their 
own algorithm for sorting through the hits of a BLAST.  A specialized algorithm 
was instead developed for this team of researchers, and it works as follows:


Always record three hits:

If there exists a hit that is from the organism Xenopus laevis, let it be 
recorded first. Do not record any more of this type.

If there exists a hit that is from the organism Xenopus tropicalis, let it be 
recorded second. Do not record any more of this type.

After searching to find hits that are either from Xenopus laevis or tropicalis, 
fill any blank spaces with the rest of the hits ordered by lowest e-value first.


This algorithm will always record at least three hits unless there are less than
three hits in the first place.  It gives priority to hits from the expected 
organisms, but also accounts for off-the-wall hits that have low e-values. 


A user or developer has two alternatives to using this algorithm: they may 
either simply uncheck the box that is labeled "Use Xenopus laevis algorithm" to 
just record the top three hits, or they may access this programs source code and 
construct a new algorithm.  I have left the GENe code wide open for this specific 
development - all that one would need to do is to local the .filterNames() 
method in the GENe class located in GENe.py, and using the documentation from 
Biopython that describes their BlastRecord class (found here in chapter 7.3 
http://biopython.org/DIST/docs/tutorial/Tutorial.html) create a new sorting 
algorithm. If you would like me to personally create a new sorting algorithm for 
you, please feel free to contact me.  I can't guarantee that I'll have the time 
to help, but I will be eager to help if I do.


5) Backup Data and Error Checking
=================================

If an excel file is lost, there will always be a backup of the most recent GENe 
operation performed saved as an excel file that will be present in the GENe folder 
which is located in your 'User' directory, also known as your Home Directory. 
This is only the case if you are on a Windows machine unfortunately because the
Mac will not allow programs like this one to create directories. Instead, the Mac
version saves a copy to this program’s working directory, which is very difficult 
to access.

If the program terminates early, or errors for any reason, the error message is 
written into a file called 'GENe Error Log.txt' that can also be found in the 
same directory as the backup file if you are on Windows, and if you are on MacOS 
then you can open it directly from the GUI.  If GENe errors, the error log will 
not update until you close GENe, so when it does error on Mac, close it down, 
open it up again, and then check the error log button If an error persists and 
cannot be resolved, please feel free to contact me at ncowen@emailwm.edup