gene-extractor

A simple Python program that downloads a genome assembly and extracts genes of interest

1 - How it works

1.1 - The gist

The program will read in a CSV file containing information on the organism and genes of interest.
The genome will be downloaded, saved, and subsequent parsing allows the program to extract the genes of interest into a new CSV file.

1.2 - In depth

Upon running the program, the user will be prompted to enter the name of their CSV file (including the .csv file ext) containing information about the organism, the genome accession number, and gene names for the genes of interest.
The user will then be prompted to input an email as standard by the Entrez module in BioPython.
The program will check to make sure that a file does not exist based on information in the CSV file and subsequently download the genome assembly.
The genome assembly will be saved in the same directory as a genbank file.
The assembly will be read in, parsed, and the genes of interest will be extracted and saved to a new CSV file.
A follow up report will be printed on screen if any genes were not found.
The program should successfully terminate.

1.3 - How to use

Everything will happen within a single directory.
Your input CSV file should be saved in the same file that you are running the program from.
The only input required by the user is the input of the CSV filename, and the input of your email.

Note - Ensure you include the file extension (.csv) when you input the name of your file. The input is case and space sensitive, ensure you enter the name exactly.

2 - How to format your CSV file

The CSV file shoule be structured as follows:

Firstly, ensure that the CSV file is UTF-8 formatted.
row 1, column 1: Genus species (and any substrain details if necessary) eg. Escherichia coli K-12 strain BW25113
row 2, column 1: The accession number of the genome. eg. CP009273.1
row 3, column 1: Should contain the header "gene" (not including the quotations).
row 3, column 2: Should contain the header "description" (not including the quotations).
row 4 to row n: Within the "gene" column, there should be gene names for the genes of interest. eg. "flgA". Within the "description" column, there should be a description the the corresponding gene. eg. Assembly protein for flagellar basal-body periplasmic P ring.

3 - Possible updates to come

Proper error handling
Save extract gene id
Multi-input file handling (work with more than one organism/genome and their genes of interest)
Ability to search by locus tag and/or gene name and/or gene id
Second pass of input CSV after correcting potential mistakes indicated by "NOT FOUND" report at end of program

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
gene_extraction.py		gene_extraction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gene-extractor

1 - How it works

1.1 - The gist

1.2 - In depth

1.3 - How to use

2 - How to format your CSV file

3 - Possible updates to come

About

Releases

Packages

Languages

dimmerz92/gene-extractor

Folders and files

Latest commit

History

Repository files navigation

gene-extractor

1 - How it works

1.1 - The gist

1.2 - In depth

1.3 - How to use

2 - How to format your CSV file

3 - Possible updates to come

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages