Skip to content

Commit

Permalink
Adding an updated glossary for the FAQ
Browse files Browse the repository at this point in the history
  • Loading branch information
skchronicles authored May 17, 2024
1 parent 825c39f commit f1c116f
Showing 1 changed file with 27 additions and 29 deletions.
56 changes: 27 additions & 29 deletions docs/faq/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,59 +2,57 @@

### Important file name extensions

The pipeline generate a lot of output files! Many intermediate output files have a unique name or extension to denote a special meaning. In a pipeline output directory, you may see one or many files contain the same base prefix but a different set of extensions, e.g.: `sorted`, `Q5`, or `Q5DD`.
The pipeline generates a lot of output files! Many intermediate output files have a unique name or extension to denote a special meaning. In a pipeline output directory, you may see one or many files contain the same base prefix but a different set of extensions, e.g.: `sorted`, `Q5`, or `Q5DD`.

_What do each of these file extension(s) mean?_

- **`sorted`**: The data has been sorted, but has not been filtered in any way. Note: blacklisted reads are filtered before alignment.
- **`Q5`**: Reads with a mapping quality below 5 have been filtered.
- **`Q5DD`**: Low mapping filter and deduplication (removal of PCR duplicates). For paired-end data, this means no fragments with the same exact start and end position occur more than once. For single-end data, we use a negative binomial distribution cutoff filter created by MACS.
- **`RPGC`**: Reads per genomic content, a method for normalizing to library size.
- **`FRiP`**: Fraction of reads in peaks. A calculation of the proportion of aligned reads that fall within the peaks called by a particular tool for a given sample.

- **`sorted`**: Indicates that the data has been sorted, but no further filtering has been applied. Note: blacklisted reads are filtered before alignment.
- **`Q5`**: Denotes that reads with a mapping quality below 5 have been filtered out.
- **`Q5DD`**: Indicates both low mapping quality filtering and deduplication (removal of PCR duplicates). For paired-end data, this means that no fragments with the same exact start and end position occur more than once. For single-end data, a negative binomial distribution cutoff filter created by MACS is used.
- **`RPGC`**: Stands for "reads per genomic content," a normalization method based on library size.
- **`FRiP`**: Represents the "fraction of reads in peaks," calculated as the proportion of aligned reads falling within the peaks called by a specific tool for a given sample.
### Annotation options:

The specific parameters chosen are listed in the associated json and pdf files.

_Here are a list of different annotation options:_

- **`genes`**: Analyze all genes in the gtf with the most lax Uropa parameters
- **`prot`**: Analyze only protein-coding genes in the gtf with the most lax Uropa parameters
- **`protSEC`**: Only protein-coding genes. Using Uropa's multi-step annotation approach, annotate based upon: 1) gene start, 2) gene end, 3) gene center, and 4) anywhere
- **`protTSS`**: Most popular option, ideal for most projects! Only protein-coding genes, focus around TSS sites.
- **`genes`**: Analyze all genes in the GTF file using the most lenient parameters from Uropa.
- **`prot`**: Analyze only protein-coding genes in the GTF file using the most lenient parameters from Uropa.
- **`protSEC`**: Focuses on protein-coding genes, utilizing Uropa's multi-step annotation approach, and annotates sequentially based on gene start, end, center, and anywhere within the gene.
- **`protTSS`**: This is the most popular option and is ideal for most projects. It focuses exclusively on protein-coding genes and centers around transcription start sites (TSS).

### File format types

Details about many of these file format can be found on this [UCSC page](https://genome.ucsc.edu/FAQ/FAQformat.html).
Details about many of these file formats can be found on this [UCSC page](https://genome.ucsc.edu/FAQ/FAQformat.html).

_Here is a short description of import file types/formats created by the pipeline:_

- **`bw`**: Short for bigwig. This is a binary file containing the pile-up patterns of the data along the chromosomes. The data is typically normalized and when viewed can be averaged across different window sizes depending on the size of the region being mapped to screen.
- **`wig`**: Short for wiggle. This is a non-binary form of a bigwig. There are two flavors of this file type, fixed-step and variable-step, each wth different formatting requirements.
- **`bed`**: A minimum 3 column file of chromosome, start, and end. The files can be up to 12 columns. Since there is no header to these files, the exact order of columns is standardized across the field. A very multi-purpose file format.
- **`tagAlign`**: Another name for a simple bed file, but typically each row is a read. Rarely used for peak information.
- **`bedgraph`**: A 3 column bed file with a fourth column of score.
- **`narrowPeak`**: The first 6 columns are the same as the bed file, but the remaining 4 columns contain information about the quality of the peak and the location of the peak summit.
- **`broadPeak`**: The same as a narrowPeak file, but with the 10th column (referring to the summit), missing.
- **`bw`**: Short for bigwig. Binary file containing normalized pile-up patterns of data along chromosomes, viewable and adjustable across different window sizes.
- **`wig`**: Short for wiggle. Non-binary format of BigWig, with fixed-step and variable-step variants, each with specific formatting requirements.
- **`bed`**: Minimum 3-column file (chromosome, start, end), extendable up to 12 columns, used for various purposes due to standardized column order. Lacks a header.
- **`tagAlign`**: Simple bed file format, typically with each row representing a read, seldom used for peak information
- **`bedgraph`**: 3-column bed file with an additional score column.
- **`narrowPeak`**: Similar to bed file with additional columns for peak quality and summit location.
- **`broadPeak`**: Similar to narrowPeak but lacks the summit location column.

### Peak callers

Peak callers are use to distinguish biological signal from noise within your dataset.

_Here are a list of peak callers the pipeline uses:_

- **`macsNarrow`**: The macs2 caller for narrow peaks. The most popular peak calling algorithm; this is typically used in most of the large databases. Can only call peaks between 150bp-10kb. Originally designed to handle peaks with only a single maxima/summit. FDR greatly improved with the addition of an "input" control, but generally still more accurate than most other peak callers out there. https://github.com/macs3-project/MACS/
- **`macsBroad`**: The macs2 caller for slightly broader peaks. Very similar algorithm to macsNarrow, but sometimes works better than macsNarrow when peaks have more than one maxima/summit. https://github.com/macs3-project/MACS/
- **`SICER`**: A broad peak caller. Can be really useful for some histone marks. Doesn't work well for extra broad domains like lamins, DNA damage markers, or some repressive marks. Allows for a small amount of dips/gaps between peaks. Window and gap parameters may need to be adjusted to improve calls. https://zanglab.github.io/SICER2/
- **`Genrich`**: Designed with ATAC-seq data in mind. Can work really well, but not all collaborators like it as it hasn't been published or reviewed. https://github.com/jsh58/Genrich
- **`macsNarrow`**: The macs2 caller optimized for narrow peaks, widely recognized as the most popular peak calling algorithm. Typically used in large databases, it identifies peaks within the range of 150bp to 10kb. Originally designed to handle peaks with a single maxima/summit, its false discovery rate (FDR) has been greatly improved with the addition of an "input" control. It is generally more accurate than most other peak callers, even without controls. https://github.com/macs3-project/MACS/
- **`macsBroad`**: The macs2 caller for slightly broader peaks, sharing a similar algorithm with macsNarrow. It is particularly useful when peaks exhibit more than one maxima/summit. https://github.com/macs3-project/MACS/
- **`SICER`**: SICER is a broad peak caller that can be highly effective for certain histone marks. However, it may not perform well for extra broad domains such as lamins or some repressive marks. It allows for a small amount of gaps between peaks, and users may need to adjust window and gap parameters for optimal results. https://zanglab.github.io/SICER2/
- **`Genrich`**: Designed with ATAC-seq data in mind, Genrich can yield excellent results. However, it may not be universally favored by all collaborators due to its lack of formal publication or review. https://github.com/jsh58/Genrich

### Other important tools we use

_Here are a list of other important tools the pipeline uses:_

- **`Deeptools`**: Visualizations and QC. Here is a [link](https://deeptools.readthedocs.io/en/develop/index.html) to Deeptools documentation.
- **`DiffBind version2`**: Differential peak calling. Analysis run with Deseq2 and EdgeR. Here is a [link](https://bioconductor.org/packages/release/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf) to DiffBind's documentation.
- **`Uropa`**: Peak annotations. Here is a [link](https://uropa-manual.readthedocs.io/introduction.html) to Uropa's documentation.
- **`MEME suite`**: Motif analysis. We use MEME-ChIP for *de novo* motif calling and AME for known motif calling. Note: the Centrimo subportion of MEME-ChIP will give false results for broad peak calling tools. Here is a [link](https://meme-suite.org/meme/index.html) to MEME suite's documentation.


- **`Deeptools`**: This tool is employed for visualizations and quality control (QC) purposes. You can find the documentation for Deeptools at [link](https://deeptools.readthedocs.io/en/develop/index.html).
- **`DiffBind version2`**: Used for conducting differential peak calling analyses, this tool integrates with Deseq2 and EdgeR for analysis. Here is a [link](https://bioconductor.org/packages/release/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf) to DiffBind's documentation.
- **`Uropa`**: Uropa is utilized for peak annotations, providing comprehensive annotation features. Here is a [link](https://uropa-manual.readthedocs.io/introduction.html) to Uropa's documentation.
- **`MEME suite`**: Employed for motif analysis, the MEME suite includes MEME-ChIP for *de novo* motif discovery and AME for known motif analysis. Note that the Centrimo subcomponent of MEME-ChIP may produce inaccurate results for broad peak calling tools. Here is a [link](https://meme-suite.org/meme/index.html) to MEME suite's documentation.

0 comments on commit f1c116f

Please sign in to comment.