Skip to content
This repository has been archived by the owner on Aug 28, 2024. It is now read-only.

dimmerz92/gene-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

gene-extractor

A simple Python program that downloads a genome assembly and extracts genes of interest

1 - How it works

1.1 - The gist

The program will read in a CSV file containing information on the organism and genes of interest.
The genome will be downloaded, saved, and subsequent parsing allows the program to extract the genes of interest into a new CSV file.

1.2 - In depth

  1. Upon running the program, the user will be prompted to enter the name of their CSV file (including the .csv file ext) containing information about the organism, the genome accession number, and gene names for the genes of interest.
  2. The user will then be prompted to input an email as standard by the Entrez module in BioPython.
  3. The program will check to make sure that a file does not exist based on information in the CSV file and subsequently download the genome assembly.
  4. The genome assembly will be saved in the same directory as a genbank file.
  5. The assembly will be read in, parsed, and the genes of interest will be extracted and saved to a new CSV file.
  6. A follow up report will be printed on screen if any genes were not found.
  7. The program should successfully terminate.

1.3 - How to use

Everything will happen within a single directory.
Your input CSV file should be saved in the same file that you are running the program from.
The only input required by the user is the input of the CSV filename, and the input of your email.

Note - Ensure you include the file extension (.csv) when you input the name of your file. The input is case and space sensitive, ensure you enter the name exactly.

2 - How to format your CSV file

The CSV file shoule be structured as follows:

  • Firstly, ensure that the CSV file is UTF-8 formatted.
  • row 1, column 1: Genus species (and any substrain details if necessary) eg. Escherichia coli K-12 strain BW25113
  • row 2, column 1: The accession number of the genome. eg. CP009273.1
  • row 3, column 1: Should contain the header "gene" (not including the quotations).
  • row 3, column 2: Should contain the header "description" (not including the quotations).
  • row 4 to row n: Within the "gene" column, there should be gene names for the genes of interest. eg. "flgA". Within the "description" column, there should be a description the the corresponding gene. eg. Assembly protein for flagellar basal-body periplasmic P ring.

3 - Possible updates to come

  1. Proper error handling
  2. Save extract gene id
  3. Multi-input file handling (work with more than one organism/genome and their genes of interest)
  4. Ability to search by locus tag and/or gene name and/or gene id
  5. Second pass of input CSV after correcting potential mistakes indicated by "NOT FOUND" report at end of program

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages