A simple Python program that downloads a genome assembly and extracts genes of interest
The program will read in a CSV file containing information on the organism and genes of interest.
The genome will be downloaded, saved, and subsequent parsing allows the program to extract the genes of interest into a new CSV file.
- Upon running the program, the user will be prompted to enter the name of their CSV file (including the .csv file ext) containing information about the organism, the genome accession number, and gene names for the genes of interest.
- The user will then be prompted to input an email as standard by the Entrez module in BioPython.
- The program will check to make sure that a file does not exist based on information in the CSV file and subsequently download the genome assembly.
- The genome assembly will be saved in the same directory as a genbank file.
- The assembly will be read in, parsed, and the genes of interest will be extracted and saved to a new CSV file.
- A follow up report will be printed on screen if any genes were not found.
- The program should successfully terminate.
Everything will happen within a single directory.
Your input CSV file should be saved in the same file that you are running the program from.
The only input required by the user is the input of the CSV filename, and the input of your email.
Note - Ensure you include the file extension (.csv) when you input the name of your file. The input is case and space sensitive, ensure you enter the name exactly.
The CSV file shoule be structured as follows:
- Firstly, ensure that the CSV file is UTF-8 formatted.
- row 1, column 1: Genus species (and any substrain details if necessary) eg. Escherichia coli K-12 strain BW25113
- row 2, column 1: The accession number of the genome. eg. CP009273.1
- row 3, column 1: Should contain the header "gene" (not including the quotations).
- row 3, column 2: Should contain the header "description" (not including the quotations).
- row 4 to row n: Within the "gene" column, there should be gene names for the genes of interest. eg. "flgA". Within the "description" column, there should be a description the the corresponding gene. eg. Assembly protein for flagellar basal-body periplasmic P ring.
- Proper error handling
- Save extract gene id
- Multi-input file handling (work with more than one organism/genome and their genes of interest)
- Ability to search by locus tag and/or gene name and/or gene id
- Second pass of input CSV after correcting potential mistakes indicated by "NOT FOUND" report at end of program