Progress Log.txt

GENe RNA Catalogger
===================

The Biology Department of the College of William and Mary is conducting research using many samples of RNA from the species Xenopus Laevus, otherwise known as the African clawed frog. The research group is headed by Dr. Saha, and previously they were using a software tool known as BLAST provided by the National Center for Biotechnolgoy Information in order to catalog each of their unidentified RNA sequences. Each sequence of RNA needed to be submitted to this tool individually, and one of the researchers needed to identify the best candidate of the many that BLAST provides. After initially attempting to do this by hand with their 32,000 samples, Dr. Saha decided she would like this process automated. This will be my first extracurricular software development project as a rising sophomore at the College of William and Mary, and I hope to keep frequent updates.

Required Software for Use and Development
=========================
Required for Development:
Biopython
numPy
wxPython
Python Excel : xlwt , xlrd , xlutils

Required for Use:
In Order to Run Locally, Including ability to update: Perl, BLAST+, nr and nt databases, and miscelleneous NCBI databases depending on use of this program's local BLAST abilities.


Outline
=======
The program will need to perform three tasks;

1) Implement the decision making process that the biologists themselves would use for candidate selection. This will be achieved using a decision tree. All relevant information to the candidates chosen will be recored in a csv/excel file. This README should eventually contain a visual diagram in the form of a flow chart that represents the decison tree for the user's reference.

2) Remotely send requests to the BLAST servers for query. This will be acheived using the python library Biopython. This libary will also aid in the parsing of the returned candidates for data. 

3) Represent itself to the user as a comprehensive GUI, complete with loading bar for inevitable long run times, file selection gui, and column selection options - the program only needs a list of sequences to run, and if the user submits a large excel file with many columns where only one is a list of sequences, this program should be able to handle that-. The implementation of this portion of the project should be relatively simple, but it is listed here as a major feature because it will be the first time that I have created a GUI. I will use wxPython for this.

*4) Potential features depending on time: Toggle switches for certain features like setting statiscal significance e-values, number of candidates recorded, an entry bar to replace the default search words 'xenopus laevus' and 'xenopus tropicalus' with other search words, and options for the types of data to be recorded.  An ideal but difficult feature to implement would be to have some way for the users to define and change the existing data structure, in the case that their research interests change or another research team would like to use this software program

Version One
-----------
Version one will only have tasks one and two above implemented.

Update 1:  The implementation of this process proved to be much simpler than expected.  However, one large problem remains - queries are incredibly slow. 17-100 seconds per query, which would make a potential run time for this program over all 32,000 sequences close to a month.  I'm thinking that the fact that the main portion of the program was easier to implement than expected will allow for me to put in feature that will allow for the queries to either be run online or on a local database.  It would require a large amount of storage to do so, so I will leave it up to Dr. Saha to decide which is more valuable to her - time, or storage.  Progress so far is encouraging.

Update 2:  Today's work incorporated the ability to read from and write into excel files.  Right now, the exact directory has to be supplied in the code for it to read from the right place, and it also has to be an excel file with the entire first column being sequences.  This will be incredibly easy to change when development of a gui begins. I have uploaded an example of what the return excel file should look like when all is said and done.

Update 3:  Here I will document a portion of correspondence between myself and the Biologists: 

	Good afternoon Sam,

	All day yesterday, I was wrangling with some really tough problems with this program. Unfortunately, the program takes a VERY long time, and no matter what I seem to do, or how I approach the 		problem, it seems to be this way.  I'm going to need you to pick the best of too inconveniences, and feel free to run it by Dr. Saha if you need to. 

	After much research, it seems entirely impossible to send batch queries to the NCBI database, not because it is too difficult to set up, but becasue NCBI has specifically set up provisions to keep 		people from doing so.  As a result, if one wishes to automate this process as we are, they have to send the sequences one by one, and wait in line in the portal with everyone else who wishes to do 		so.  As far as I can tell, this portal is entirely different from the one hosted on the NCBI website, hence why a query there takes but a few seconds, but a query through the provided server portal 		takes hundreds. To get an idea of exactly how long each query is taking, I have attached an excel spreadsheet where the two rightmost columns are load times for each sequence.  You'll see that these 		daytime queries are minutes long.  When this program was first written, I saw times below 20 seconds frequently, but recently, times like this have been displayed.  What this essentially means above 		anything else is that the server is entirely unpredictable.  It will get the job done, but the time it takes is extremely variable.

	The alternative is something you already know.  In a vain attempt to combat load times, I set up the database locally on my computer, which, while difficult, was not as impossible as we first 		thought. Some thoughts on this alternative; a query on a computer such as mine (I have a very high end computer) takes about a minute. That seems to be a bit faster than the online query at most 		times, but occasionally, as I expect to see on  weekends, the online query process is faster than this.  If you wanted to set it up locally on your mac, it would take a bit more time than it would to 	set up on a windows PC because my program is written for windows and the software that allows one to run Blast locally uses the command line in order to execute.  Here's the big one: unlike the 		server queries, which just sends off an http request in the background and uses no more than 30 mb of ram, this process would essentially need a dedicated processor because it eats up 100% of all 		resources on the computer.  Like I said, my computer is rather high end and it locked up almost immediately while running the BLAST algorithm. Essentially, this alternative may speed up the process, 		but only marginally, and not without some effort.   

	I am very sorry to say that while this program will definitely work, regardless of the route taken, it seems like it is going to take a very very long time, and on top of that, that amount of time 		cannot be easily predicted.  I hope that you will be able to instruct me on which route you would like me to take, server queries or local queries.  I will do my best to make whatever you choose 		perform the best it can.

As was stated in this correspondence, the program works well, but I am having difficulty speeding it up. Almost two days went into setting up the NCBI databases locally to run a local version of the BLAST alogrithm, but in the end all of the effort seemed to be wasted because it really didn't speed up the process very much at all.

Version Two
-----------
This version implements 3) above and is object oriented.

Update 1: After yesterday's frustration came a lot of progress for today.  Among this list of improvements:  rearranged GENe.py to be object oriented, created the beginning of the GUI file, enabled capability for both local and server queries instead of a choice between the two, is able to find the short gene names of ~50% of the sequences. I am satisfied with today's accomplishments as the program now is much more robust and is nearing completion.

Update 2: The GUI now accepts an open file and a save file, and can run the program using that information.  At this point, the program essentially works to satisfaction, but needs to be tidied up quite a bit. Because of unexpected circumstances, I will need to be finishing this project early, and because of this, most of the features mentioned in point 4 of the outline will go unimplemented as well as the load bar, which upon implementation of the local blast is not feasible anyway. What will be implemented is the e-value cap, and though there will be no way for the user to change the sorting algorithm for the blast hits, they will have the option to choose between Dr. Saha's algorithm and simply returning the top three hits. That portion of the code has also been left wide open for other sorting algorithms to potentially be developed in the future. It has also been revealed that the local Blast queries take a considerably shorter period of time than previously expected. The next update will include more implementation on the GUI side of things, and hopefully the beginnings of an actual userfriendly README, rather than this progress log.

Update 3: I couldn't be happier!  The program has a fully usable and polished GUI.  All the features that I had hoped that the program would have on a base level are present plus a few extra features.  The GUI is pleasant looking, and I managed to create it in a standalone python script that will allow for future developers (if there are any) to poke around in the main GENe python script without having to worry about the GUI getting in the way. I added the capability in this update for a help button that opens up the README.  At this stage, the program is all but complete.  The last steps are to turn the python scripts into an executable bundled with a sufficient README, which has not been written yet. Once those two tasks are completed, the project will have entered its third and final version for this summer.  

Version Three
-------------

Update 1: At this point, GENe is complete.  I have presented the program to the biologists, and seemed to be very happy with the product, as am I.  There are more than a few things that need to be done, but at this point, I am comfortable calling these bugs, rather than portions of an uncompleted project.  These things that still need to be done are: Need to make better error messages (popups), Need to make it so this program writes to an unprotected directory, instead of its own directory so that it does not need administrative priveledges to run, remove the necessity to put a certain type of quote around the local database selection, make the README easier to read, and create an installer for the MacOS and make sure that everything works for that operating system. I will fix these bugs over the weekend, as I need to start packing to go home.

Update 2: I couldn't wait - I had to get the bugs out of the way.  Besides making the program able to work on a Mac which I can't do right now anyway, I cleaned up all of those bugs and everything should be working just fine now.  I'm going to write my final blog post tonight for this project, and at this point, I think it is safe to say, this project is done.