Skip to content

gturco/find_cns

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Author:Gina Turco (gturco), Brent Pedersen (brentp)
Email:[email protected]
License:MIT

Python application to automate large-scale identification of conserved noncoding sequences (CNS) between two usefully diverged plant species. This application works by first attempting to correct annotation errors between the two species using co-anno. It then condenses local duplicates and finds syntenic regions based on ploidy relationships using quota-alignment. BLAST is then applied to the syntenic regions between the two species to find CNSs. CNSs are found through blastn at an e-value less than or equal to a 15/15 exact base pair match (Kaplinsky et al.). Nonsyntenic CNSs are removed along with CNS with hits to known RNA or exons. Created in the Freeling Lab at UC Berkeley

http://upload.wikimedia.org/wikipedia/commons/0/08/Pipeline_git.png

Turco, G., Schnable, J. C., Pedersen, B., & Freeling, M. Automated conserved noncoding sequence (CNS) discovery reveals differences in gene content and promoter evolution among grasses. Frontiers in Plant Science, 4, 170. Link to Plant CNS Paper

CNS Datasets Available Here for:

  • Thaliana_v10 Thaliana_v10 CNS
  • Rice Maize CNS
  • Rice Sorghum CNS
  • Sorghum Sorghum CNS
  • Rice Setaria CNS
  • Setaria Setaria CNS

Please use citation above

  • Everything is predownloaded except scip in the iplant atmosphere image find_cns_pipeline (emi-8EF728EB)

  • Download the most recent code here:

    git clone https://github.com/gturco/find_cns.git
    
  • run bootstrap code and add scip to cns_pipline/bin/:

    python bootstrap.py
    

Required Dependencies

Obtaining Input Files

  • Download Fasta and gff from CoGe OrganismView for each organism
  • Make sure to UNCLICK "Do not generate features for ncRNA genes (CDS genes only)" when downloading gff

Convert gff to Bed format:

python scripts/gff_to_bed.py --re "^Os\d\dg\d{5}" --gff rice_v6.gff  --fasta 9109.faa --out rice_v6
  • The -re regular expression is not required, but in this case, it will prefer the readable Os01g101010 names over the names like m103430.
  • the --out is the root word for the Fasta and Bed outfiles since they must be the same name (in this case rice_v6.fasta and rice_v6.bed)
  • Fasta File (it is recommended to run 50x mask repeat)

Runing Pipeline

  • Create a new directory (under data) name for organisms being compared ORGA_ORGB eg mkdir data/rice_v6_sorghum_v1
  • Add Fasta and Bed files to directory
  • Edit run.sh file touch run.sh change ORGA, ORGB, QUOTA, SDGID, QDGID #DGID found in CoGe
  • activate screen screen
  • activate virtualenv: source ../cns_pipeline/bin/activate
  • run cmd: sh run.sh #that will call quota.sh (this will take a long time as it's doing a full blast (lastz) and then all of quota align, then cns pipeline).
  • this will create png's for the dotplots. check those to make sure the quota-blocks look correct.
  • when finshed deactivate virtualenv deactivate

Output Files

  • CNSlist (contains start,stop,chr,sequences and 5 prime 3 prime information for gene)
  • Genelist (one for each ORG, contains the gene start,stop,local dupinfo,orthos,number of CNSs)
  • new genes are named ORG_chr_start_stop in the Genelist
  • CNS with hits to rna or protein are also renamed in the Genelist

About

find conserved non-coding sequences (CNS)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 91.2%
  • Shell 4.6%
  • C++ 2.0%
  • Perl 1.6%
  • Other 0.6%