The basic idea of the cleaning process is to build a set of reference sequences and to align every template sequence from Melasto689 to this set of references, correcting the frame, reverse-complementing if needed, and removing stop codons and poorly aligned regions. The cleaned template sequences are then associated to the reference sequences they matched into a final sequence set. Finally, the final sequence set go through different checks and fine-tuned cleaning.
The process will be divided in 3 main steps:
- Step 1: build the set of reference sequences
- Step 2: align and associate the Melasto689 template sequences to the set of reference into a final sequence set
- Step 3: final checks and fine-tuned cleaning: extra steps to make sure the sequences are clean
Directory organization. For simplicity, all the following assumes
analyses are run in a single directory $path_to_wd
. This directory is
divided in 6 sub-directory:
prepare1KP
(Step 1: 1. - Prepare the 1KP transcriptomes reference sequences set)prepare_mega353
(Step 1: 2. - Prepare the mega353 reference sequences set)full_reference_DB
(Step 1: 3. - Merge the two previous reference sequences set)original_template_sequences
(Step 2: 1. - Prepare the original Melast689 template sequence set)melasto689_on_full_ref
(Step 2: 2. - Align and associate the Melasto689 template sequences to the set of reference)CLEAN_PROBE_SET
(Step 3 - Contains the final and cleaned probe set)
cd $path_to_wd
mkdir prepare1KP
mkdir prepare_mega353
mkdir full_reference_DB
mkdir original_template_sequences
mkdir melasto689_on_full_ref
mkdir CLEAN_PROBE_SET
Naming convention. Sequences names follow the convention used in
many probe sets for targeted sequencing and associated programs
(e.g. Angio353, HybPiper, Captus, SECAPR). The name of a sequence is
composed of the template sequence ID (generally identifying the taxon
from which the sequence initially come from, sometimes with additional
info) and the locus name separated by an hyphen,
e.g. >template_sequence_ID-locusID
Scripts. Every step of the cleaning process is presented and
explained in detail below. The command are presented as if run in
interactive mode. For some steps, the running time is long and was
actually not achieved in interactive mode, but using .sh
scripts
(scripts were run on a computing cluster that uses the SLURM manager,
sbatch
command). The .sh
scripts are made available in this
repository (see the script
folder) and a statement is
present at the beginning of each step specifying which script to refer
to.
Files. Note that not all the intermediate files produced by the cleaning process are present in this repository. The missing files are some intermediate .FNA files or heavy (>100 Mo) blast output tables. The most important files are still included.
Dependencies
- SeqKit 2.0.0
- NCBI BLAST 2.10.1
- csvtk 0.23.0
- bedtools 2.30.0
- R 4.1.3 (needs packages DECIPHER, Biostrings)
- Python 3.10
- HybPiper 2.0.1
- AliView 1.28
- Muscle v3.8.425 (as wrapped in AliView)
Glossary:
- Angio353 (sometimes referred as full Angio353): the original Angiosperms 353 probe set
- locusCustomID: temporary locus identifier given to the loci during the cleaning process
- mega353: the Angiosperms 353 probe set extended with transcriptome sequences (see: https://github.com/chrisjackson-pellicle/NewTargets)
- Melasto689: the full Melastomataceae probe set. It is composed of
384 loci, each of them being composed of 1 to 4 template sequences,
for a total of 689 template sequences. It is divided into 4 subsets of
loci:
- angio353 (with small a): a set of 266 loci (439 template sequences) that come from the Angiosperms 353 probe set
- markerminer: a set of 104 loci (215 template sequences) that come from putative single-copy loci identified with MarkerMiner
- functional_extra: a set of 6 loci (12 template sequences) that come from functional genes
- RM2016: a set of 8 loci (23 template sequences) that come from Reginato & Michelangeli (2016)
- reference sequence (or RS): a sequence from the reference set; in the sequence set, a same locus can have several template sequences
- template sequence (or TS): a sequence from the Melasto689 set or final sequence set; in the sequence set, a same locus can have several template sequences
The set of reference is composed of mega353 (that is an extended version of Angio353) and the transcriptomes of two Melastomataceae species from the 1KP project.
The set of reference sequence will contain different sequences. The
sequences will be grouped under different “loci” based on their
similarity. During the process of building the set of references, a
temporary locus name (sometimes referred as *locusCustomID*
) will be
assigned to the loci in the form LOC#####
(where #
represents a
digit, e.g. LOC12345
). This *locusCustomID*
will be used as locus
name for most of the sequences during the whole cleaning process, but
the loci names will be reverted to their “old” names (i.e. the names
used in Melasto689 and Angio353) in Step 3: 3.
The
sbatch
script corresponding to this section is:prepare_1KP_DB.sh
We will use the transcriptomes of the two Melastomataceae samples present in the 1KP project dataset: SWGX and WWQZ (Tetrazygia bicolor and Medinilla magnifica, respectively).
Download SWGX and WWQZ data from the 1KP project (updated
release), and decompress (unzip) the
files into $path_to_wd/prepare1KP
. We will use of the
XXXX-translated-protein.fa
and XXXX-translated-nucleotides.fa
fasta
files.
Sequences in the SWGX and WWQZ data have a unique identifier (referred
as transcriptSequencesID
), in the form
>scaffold-XXXX-2#######-Genus_species
, where XXXX is either SWGX or
WWQZ, # represents a digit, and Genus_species is either
Tetrazygia_bicolor (SWGX) or Medinilla_magnifica (WWQZ),
e.g. >scaffold-WWQZ-2000004-Medinilla_magnifica
.
The sequences need to be grouped according to their ortholog group. To
do so, we’ll use the 2nd column in the
XXXX-translated-reference-names.tsv
files, containing the gene
identifiers according to OrthoFinder (see One Thousand Plant
Transcriptomes Initiative
(2019), referred as
locusOrthologID
. The identifier is in the form
“gi|255582030|ref|XP_002531812.1|”.
- get the list of the unique orthogroup identifiers from the
SWGX-translated-reference-names.tsv
WWQZ-translated-reference-names.tsv
files into the fileorthoIDs_list.txt
cd $path_to_wd/prepare1KP
cat SWGX-translated-reference-names.tsv WWQZ-translated-reference-names.tsv | awk '{print $2}' | sort | uniq > orthoIDs_list.txt # get the list of the unique gene identifiers according to OrthoFinder
- append a unique custom locus name (
locusCustomID
) in the form LOC##### to every orthogroup (locusOrthologID
) and store them in the array$loc_list
. The array contains locus names in the formlocusOrthologID__locusCustomID
.
loc_list=$(cat orthoIDs_list.txt | awk '{ printf $1"__LOC" "%05d\n", ++a }') # append a unique custom locus name in the form LOC##### (locusCustomID) to every orthogroup (locusOrthologID) and store them in the array loc_list
- create a folder called
orthogroups
and initialize the files:ortho_loci_table.txt
: will contain the correspondance betweenlocusCustomID
,locusOrthologID
andtranscriptSequencesID
SWGX-WWQZ-translated-nucleotides-grouped.FNA
: will contain the transcriptome sequences, grouped by custom locus IDSWGX-WWQZ-translated-prot-grouped.FAA
: will contain the transcriptome sequences, grouped by custom locus ID
mkdir orthogroups
echo locusCustomID locusOrthologID transcriptSequencesID > orthogroups/ortho_loci_table.txt # initialize file
:> orthogroups/SWGX-WWQZ-translated-nucleotides-grouped.FNA # initialize file
:> orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA # initialize file
- for every locus name in the list of loci
$loc_list
:- define $locID as the custom locus ID (
locusCustomID
) - define $orthoID as the orthogroup ID (
locusOrthologID
) - retrieve the
transcriptSequencesID
(s) corresponding to thelocusOrthologID
in SWGX and WWQZ databases - if
transcriptSequencesID
(s) were retrieved:- extract the sequence(s) from the databases, append the custom
locus ID (
locusCustomID
) to its name, and append the renamed sequence toSWGX-WWQZ-translated-nucleotides-grouped.FNA
andSWGX-WWQZ-translated-prot-grouped.FAA
- extract the sequence(s) from the databases, append the custom
locus ID (
- populate the
ortho_loci_table.txt
with the correspondinglocusCustomID
,locusOrthologID
andtranscriptSequencesID
- define $locID as the custom locus ID (
for name in $loc_list;
do
locID=$(echo $name | sed 's/.*__//g')
orthoID=$(echo $name | sed 's/__.*//g')
# retrieve the sequence ID(s) corresponding to the locus in SWGX and WWQZ databases
grep -e $orthoID SWGX-translated-reference-names.tsv | awk '{print $1}' > SWGX_seqs
grep -e $orthoID WWQZ-translated-reference-names.tsv | awk '{print $1}' > WWQZ_seqs
if [ -s SWGX_seqs ]; then
seqkit grep -w0 -n -f SWGX_seqs SWGX-translated-nucleotides.fa | sed 's/-/_/g' | sed '/^>/s/$/-'$locID'/g' >> orthogroups/SWGX-WWQZ-translated-nucleotides-grouped.FNA
seqkit grep -w0 -n -f SWGX_seqs SWGX-translated-protein.fa | sed 's/-/_/g' | sed '/^>/s/$/-'$locID'/g' >> orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA
fi
if [ -s WWQZ_seqs ]; then
seqkit grep -w0 -n -f WWQZ_seqs WWQZ-translated-nucleotides.fa | sed 's/-/_/g' | sed '/^>/s/$/-'$locID'/g' >> orthogroups/SWGX-WWQZ-translated-nucleotides-grouped.FNA
seqkit grep -w0 -n -f WWQZ_seqs WWQZ-translated-protein.fa | sed 's/-/_/g' | sed '/^>/s/$/-'$locID'/g' >> orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA
fi
echo $locID $orthoID $(cat SWGX_seqs) $(cat WWQZ_seqs) >> orthogroups/ortho_loci_table.txt
rm SWGX_seqs WWQZ_seqs
done
- remove sequences less than 30 amino-acids (AA) and less than 90 nucleotides (NT) long
cd orthogroups
seqkit seq -w0 -m 90 SWGX-WWQZ-translated-nucleotides-grouped.FNA > SWGX-WWQZ-translated-nucleotides-grouped.FNA.temp
seqkit seq -w0 -m 30 SWGX-WWQZ-translated-prot-grouped.FAA > SWGX-WWQZ-translated-prot-grouped.FAA.temp
rm -r SWGX-WWQZ-translated-prot-grouped.FAA SWGX-WWQZ-translated-nucleotides-grouped.FNA
rename -v '.FAA.temp' '.FAA' ./*
rename -v '.FNA.temp' '.FNA' ./*
cd $path_to_wd/prepare1KP
Sequences with high similarity need to be grouped into a same “locus”,
otherwise the template sequences from Melasto689 will align to several
loci which is not desirable since we want to distribute these template
sequences into a single loci.
To group sequences with high similarity, we will run a “self-Blast”
using blastp-fast
(protein to protein alignment).
Make a custom BLAST database with the 2 transcriptomes (SWGX and WWQZ) concatenated and grouped by orthogroup. Build a database for both amino-acids and nucleotides sequences.
cd $path_to_wd/prepare1KP
mkdir BLAST_custom_DB
makeblastdb -in orthogroups/SWGX-WWQZ-translated-nucleotides-grouped.FNA -out BLAST_custom_DB/SWGX-WWQZ_nucleotides_grouped_BLAST_DB -dbtype nucl -parse_seqids # actually not necessary
makeblastdb -in orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA -out BLAST_custom_DB/SWGX-WWQZ_protein_grouped_BLAST_DB -dbtype prot -parse_seqids
cd $path_to_wd/prepare1KP
blastp -task blastp-fast -db BLAST_custom_DB/SWGX-WWQZ_protein_grouped_BLAST_DB -query orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA -evalue 1e-06 -outfmt "10 std qframe qcovs qseq" -num_threads $SLURM_CPUS_PER_TASK > selfblast_SWGX-WWQZ_protein_grouped_blastp_out_SB1.csv
The raw result of a self-BLAST needs to be filtered since every sequence will inevitably have a 100% match with itself. So, we filter out all the matches between two sequences belonging to the same locus. Also, we only keep the matches that have either 1) a percentage of identity greater than 60%, an alignment length greater than 50 AA, a bitscore greater than 100 and a query coverage greater than 50% (meaning at least 50% of the query covers the subject sequence), or 2) a percentage of identity greater than 80%, an alignment length greater than 50 AA and a bitscore greater than 50.
cat selfblast_SWGX-WWQZ_protein_grouped_blastp_out_SB1.csv | awk -F ',' 'OFS = "," {
temp1 = $1
temp2 =$2
sub(".+-","", temp1)
sub(".+-","", temp2)
if (temp1 != temp2)
print $0
}' > different_query_and_subject_loci_SB1.csv
cat different_query_and_subject_loci_SB1.csv | csvtk filter2 -H -f '$3 > 60 && $4 > 50 && $14 > 50 && $12 > 100 || $3 > 80 && $4 > 50 && $12 > 50' > filtered_different_query_and_subject_loci_SB1.csv
Some of the sequences with high similarity were assigned to different orthogroups. We will thus group these sequences, which mean grouping some supposedly ortholog loci.
Retrieve the pairs of loci that need to be grouped.
cat filtered_different_query_and_subject_loci_SB1.csv | awk -F ',' '{
query = $1
subject = $2
sub(".+-","", query)
sub(".+-","", subject)
print query"," subject
}' > pairs_SB1.csv
A locus can have high sequence similarity with more than one other
locus, we need to retrieve all the loci associations, and print them in
a similarity table (selfblast_table.txt
).
For every locus in the list of loci from the pairs of loci:
- retrieve the other locus (loci) it matches to, and arrange them into a list
- search if any of the locus in the list is already in the table
selfblast_table.txt
- if none of the locus is in the similarity table, then append the list at the end of the table
- if any of the locus is found in the similarity table, then extract the matching list of loci in the similarity table, append the list to the extracted list, remove duplicates loci and replace the original list with the newly generated list
query_list=$(cat pairs_SB1.csv | awk -F ',' '{print $1}' | sort | uniq)
:> selfblast_table.txt
for query in $query_list
do
echo $query
subject=$(grep -e $query, pairs_SB1.csv | awk -F ',' 'ORS = "\n" {print $2}' | sort -n | uniq | tr '\n' ',' | sed 's/,$/\n/')
echo $subject
to_search=$(echo $query,$subject | sed 's/,/|/g')
#egrep -qs -e $to_search selfblast_table.txt
if ! egrep -qs -e $to_search selfblast_table.txt; then
echo Query or Subject not in the table
echo $query,$subject >> selfblast_table.txt
else
echo Query or Subject YET in the table
replacement=$(echo $(egrep -e $to_search selfblast_table.txt | sed -z 's/\n/,/g' | sed 's/.$/\n/'),$query,$subject | sed -e $'s/,/\\\n/g' | sort -n | uniq | tr '\n' ',' | sed 's/.$/\n/')
line_numbers=$(egrep -n -e $to_search selfblast_table.txt | sed 's/:.*//g')
#replacement=$(echo $query','$subject | sed -e $'s/,/\\\n/g' | sort -n | uniq | tr '\n' ',' | sed 's/.$/\n/')
for N in $line_numbers;
do
sed -i $N's/.*/'$replacement'/g' selfblast_table.txt
done
cat selfblast_table.txt | sort | uniq > selfblast_table.txt.temp
rm selfblast_table.txt
mv selfblast_table.txt.temp selfblast_table.txt
fi
done
cat selfblast_table.txt > selfblast_table_fromSB1.txt #intermediate backup
#check for no duplicates:
echo Total number of IDs: $(cat selfblast_table.txt | sed 's/,/\n/g' | wc -l)
echo "Number of IDs (dups removed):" $(cat selfblast_table.txt | sed 's/,/\n/g' | sort | uniq | wc -l)
Each line of the table selfblast_table.txt
contains a list of
comma-separated matching loci in the form:
locusCustomID, correspondingLocus1, correspondingLocus2, ...
. The
name retained as locusCustomID
is the one having the lowest suffix
number among the list of corresponding loci on a same line.
Now the list of corresponding loci based on similarity has been
established, we need to associate them in the .FNA
and .FAA
files.
Transform the table into a list of pairs of
locusCustomID,correspondingLocus
(selfblast_table_paired.txt
).
:> selfblast_table_paired.txt
for line in $(cat selfblast_table.txt)
do
# echo $line
locus=$(echo $line | cut -d ',' -f 1)
echo $locus
others=$(echo $line | sed 's/'$locus',//g' | sed -e $'s/,/\\\n/g')
for other in $others
do
echo $locus,$other >> selfblast_table_paired.txt
done
done
cat selfblast_table_paired.txt > selfblast_table_paired_fromSB1.txt #intermediate backup
Transform the comma-separated list into a tab-separated list
(replacement.txt
), that can be read by seqkit replace
command.
cat selfblast_table_paired.txt | awk -F ',' '{print $2"\t"$1}' > replacement.txt
Use seqkit replace
to replace every correspondingLocus to its matching
locusCustomID
.
seqkit replace -w0 -p '-(.+)$' -r '-{kv}' -k replacement.txt --keep-key orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA > orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA
seqkit replace -w0 -p '-(.+)$' -r '-{kv}' -k replacement.txt --keep-key orthogroups/SWGX-WWQZ-translated-nucleotides-grouped.FNA > orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA
Finally sort the fasta files by locusCustomID
.
sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA | seqkit sort -w0 | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA.temp
sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA | seqkit sort -w0 | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA.temp
rm orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA
rename -v '.FAA.temp' '.FAA' ./orthogroups/*
rename -v '.FNA.temp' '.FNA' ./orthogroups/*
cd $path_to_wd/prepare1KP
makeblastdb -in orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA -out BLAST_custom_DB/SWGX-WWQZ_nucleotides_grouped_SB1_BLAST_DB -dbtype nucl -parse_seqids
makeblastdb -in orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA -out BLAST_custom_DB/SWGX-WWQZ_protein_grouped_SB1_BLAST_DB -dbtype prot -parse_seqids
blastp -task blastp-fast -db BLAST_custom_DB/SWGX-WWQZ_protein_grouped_SB1_BLAST_DB -query orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA -evalue 1e-06 -outfmt "10 std qframe qcovs qseq" > check_selfblast_SWGX-WWQZ_protein_grouped_blastp_out.csv
cat check_selfblast_SWGX-WWQZ_protein_grouped_blastp_out.csv | awk -F ',' 'OFS = "," {
temp1 = $1
temp2 =$2
sub(".+-","", temp1)
sub(".+-","", temp2)
if (temp1 != temp2)
print $0
}' > check_different_query_and_subject_loci.csv
cat check_different_query_and_subject_loci.csv | csvtk filter -H -f'3>60' | csvtk filter -H -f'4>50' | csvtk filter -H -f'14>50' | csvtk filter2 -H -f '$12 > 100' > check_filtered_different_query_and_subject_loci.csv
The file check_filtered_different_query_and_subject_loci.csv
resulting
from the check self-BLAST is empty, meaning no sequence from a locus has
similarity with a sequence of any other locus.
The mega353 is an expansion of the Angio353 probe set, incorporating sequences from more than 500 transcriptomes from the the 1KP project. Along with the dataset, a script is provided in order to filter the sequences according to the group of interest. Here we will filter the mega353 dataset to keep the sequences from the samples in the Myrtales order.
Download the NewTargets directory from
https://github.com/chrisjackson-pellicle/NewTargets and unzip the
files. Store the NewTargets
folder into $path_to_wd/prepare_mega353
.
A filtering text file is created, in order to filter on the Myrtales
order
(select_file_Myrtales.txt
).
The filter_mega353.py
script is then run.
cd $path_to_wd/prepare_mega353/NewTargets
python filter_mega353.py mega353.fasta select_file_Myrtales.txt -filtered_target_file ../mega353_Myrtales.fa
The mega353 reference sequences set used is the
mega353_Myrtales.fa
file.
The
sbatch
script corresponding to this section is:prepare_full_reference_DB.sh
We will now associate the the transcriptome and mega353 references set into a single set of references. The association will be done based on sequence similarity.
Define the paths to previously prepared databases.
cd $path_to_wd/full_reference_DB
path_to_onekp=$path_to_wd/prepare1KP;
Angio353_FNA=$path_to_wd/prepare_mega353/mega353_Myrtales.fa;
Concatenate the transcriptome reference databases in the current
directory into the FULL_REFERENCE_nucl.FNA
and
FULL_REFERENCE_prot.FAA
files. Sequences from the mega353 database
will be appended to these files.
cat $path_to_wd/prepare1KP/orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA > FULL_REFERENCE_nucl.FNA
cat $path_to_wd/prepare1KP/orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA > FULL_REFERENCE_prot.FAA
nuc_FNA="FULL_REFERENCE_nucl.FNA"
prot_FAA="FULL_REFERENCE_prot.FAA"
Copy the mega353 database in the current directory and rename the sequences to remove hyphens that do not separate sequence ID from locus name.
dos2unix $Angio353_FNA
cat $Angio353_FNA | sed 's/ .*//g' | sed 's/-[[:digit:]]\{4\}_exonerate_grafted/_EG/g' | sed 's/grafted_with/GW/g' | sed 's/-v/_v/g' | sed 's/\./_/g' | sed 's/-ASGPB/_ASGPB/g' | sed 's/-iTAG/_iTAG/g' | sed 's/-Genoscope/_Genoscope/g' > Angio353.FNA
Angio353_FNA=$path_to_tmp/Angio353.FNA
Define the path to the transcriptome protein BLAST database from Step 1: 1.4.1.
prot_DB=$path_to_wd/prepare1KP/BLAST_custom_DB/SWGX-WWQZ_protein_grouped_SB1_BLAST_DB
Extract the list of mega353 loci in a text file. Initialize matching
table between mega353 loci and custom locus ID (locusCustomID
).
cd $path_to_wd/full_reference_DB
seqkit seq -n $Angio353_FNA | sed 's/.*-//g' | sort | uniq > Angio353_list.txt
:> Angio353_to_custom_loci_table.txt
For each locus in the list of mega353 loci:
- create a sub-directory and set it as current directory
- extract the locus sequence from the mega353 database into a .FNA file, translate it (.FAA)
- search for similar protein sequence into the transcriptome protein
database using
blastp
and store the results in a .csv file - if the .csv file exists and is not empty (i.e. blast found matching
sequences), then:
- filter the .csv to keep only query sequences with percentage identity greater than 60% to the matching subject, alignment length greater than 50 amino-acids, bitscore greater than 100, frame greater than 0 (i.e. sequences not in reverse complement) and a query coverage greater than 50% (meaning at least 50% of the query covers the subject sequence)
- if the filtered .csv file exists and is not empty (i.e. matching
sequences remain after filtering), then:
- extract the
customLocusID
(s) from the matching sequence(s) - if more than 1
customLocusID
were found, then:- print a warning
- append the locus sequences to the full reference database as they are (no renaming)
- print the mega353 locus name and the mention “multiplematches” to the matching table
- else (i.e. only 1
customLocusID
was found), then:- rename the locus sequence so that the mega353 locus name (4-digit
sequence) is enclosed in 2 __ (double underscore) and moved to
the sequence ID, and so that the
customLocusID
is appended at the end of the sequence - append the renamed locus sequence to the full reference database
- print the mega353 locus name and its
customLocusID
to the matching table
- rename the locus sequence so that the mega353 locus name (4-digit
sequence) is enclosed in 2 __ (double underscore) and moved to
the sequence ID, and so that the
- extract the
- else (i.e. no matching sequences remain after filtering), then:
- append the locus sequences to the full reference database as they are (no renaming)
- print the mega353 locus name and the mention “nomatch” to the matching table
- set the upper level directory as current directory
for locus in $(cat Angio353_list.txt)
do
echo STARTING FOR $locus
mkdir $locus
cd $locus
seqkit grep -w0 -nrp $locus $Angio353_FNA > $locus.FNA
query=$locus".FNA"
cat $query | seqkit translate -w0 > $locus".FAA"
blastp -task blastp-fast -matrix BLOSUM62 -db $prot_DB -query $locus".FAA" -evalue 1e-06 -outfmt "10 std qframe qcovs qseq sframe sseq" -num_threads $SLURM_CPUS_PER_TASK > $locus"_blastp_out_unfiltered.csv"
#blastx -task blastx-fast -matrix BLOSUM62 -db $prot_DB -query $query -culling_limit 1 -evalue 1e-06 -outfmt "10 std qframe qcovs qseq sframe sseq" -num_threads $SLURM_CPUS_PER_TASK > $locus"_blastx_out_unfiltered.csv"
if [ -s $locus"_blastp_out_unfiltered.csv" ]; then
#cat $locus"_blastx_out_unfiltered.csv" | csvtk filter -H -f"3>60" > $locus"_blastx_out.csv"
cat $locus"_blastp_out_unfiltered.csv" | csvtk filter2 -H -f '$3 > 60 && $4 > 50 && $14 > 50 && $12 > 100 && $13 > 0' > $locus"_blastp_out.csv"
fi
if [ -s $locus"_blastp_out.csv" ]; then
cat $locus"_blastp_out.csv" | awk -F ',' '{print $2}' | sed 's/.*-LOC/LOC/g' | sort | uniq > DB_locus
nb_DB_loci=$(cat DB_locus | wc -l)
if [ $nb_DB_loci -gt 1 ]; then
echo WARNING: several DB loci found for locus $locus. No transcriptome appended. You might want to check manually for $locus.
cat $locus".FNA" >> ../$nuc_FNA
cat $locus".FAA" >> ../$prot_FAA
echo $locus multiplematches >> ../Angio353_to_custom_loci_table.txt
else
DB_locus=$(cat DB_locus)
cat $locus".FNA" | sed '/^>/s/-/__/g' | sed '/^>/s/$/__-'$DB_locus'/g' >> ../$nuc_FNA
cat $locus".FAA" | sed '/^>/s/-/__/g' | sed '/^>/s/$/__-'$DB_locus'/g' >> ../$prot_FAA
echo $locus $(cat DB_locus) >> ../Angio353_to_custom_loci_table.txt
fi
else
echo No match found for $locus.
cat $locus".FNA" >> ../$nuc_FNA
cat $locus".FAA" >> ../$prot_FAA
echo $locus nomacth >> ../Angio353_to_custom_loci_table.txt
fi
cd ..
done
This step checks sequence similarity for sequences from different loci.
If highly similar sequences from different loci are found, the loci have
to be grouped under the same name. To check for sequence similarity, we
run a “self-Blast” using blastp-fast
(protein to protein alignment).
mkdir FULL_REFERENCE_BLAST_DB
makeblastdb -in FULL_REFERENCE_nucl.FNA -out FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_nucl_BLAST_DB -dbtype nucl -parse_seqids
makeblastdb -in FULL_REFERENCE_prot.FAA -out FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_prot_BLAST_DB -dbtype prot -parse_seqids
blastp -task blastp-fast -db FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_prot_BLAST_DB -query FULL_REFERENCE_prot.FAA -evalue 1e-06 -outfmt "10 std qframe qcovs qseq" -num_threads $SLURM_CPUS_PER_TASK > full_ref_blastp_out.csv
The raw result of a self-BLAST needs to be filtered since every sequence will inevitably have a 100% match with itself. So, we filter out all the matches between two sequences belonging to the same locus. Also, we only keep the query sequences with percentage identity greater than 60% to the matching subject, alignment length greater than 50 amino acids, bitscore greater than 100, frame greater than 0 (i.e. sequences not in reverse complement) and a query coverage greater than 50% (meaning at least 50% of the query covers the subject sequence).
cat full_ref_blastp_out.csv | awk -F ',' 'OFS = "," {
temp1 = $1
temp2 =$2
sub(".+-","", temp1)
sub(".+-","", temp2)
if (temp1 != temp2)
print $0
}' > different_query_and_subject_loci.csv
cat different_query_and_subject_loci.csv | csvtk filter2 -H -f '$3 > 60 && $4 > 50 && $14 > 50 && $12 > 100 && $13 > 0' > filtered_different_query_and_subject_loci.csv
Highly similar sequences from different loci were found, these loci have to be grouped under the same name.
Retrieve the pairs of loci that need to be grouped. The file contains a list of pairs of query-subject.
cat filtered_different_query_and_subject_loci.csv | awk -F ',' '{
query = $1
subject = $2
sub(".+-","", query)
sub(".+-","", subject)
print query"," subject
}' | sort | uniq > pairs.csv
A same locus may have to be associated with more than one other locus.
We thus need to retrieve all the loci associations, and print them in a
similarity table (selfblast_table.txt
).
For every query list of the pairs of loci:
- retrieve the other locus (loci) it matches to (subject), and arrange them into a list
- search if any of the locus in the list is already in the table
selfblast_table.txt
- if none of the locus is in the similarity table, then append the list at the end of the table
- if any of the locus is found in the similarity table, then extract the matching list of loci in the similarity table, append the list to the extracted list, remove duplicates loci and replace the original list with the newly generated list
query_list=$(cat pairs.csv | awk -F ',' '{print $1}' | sort | uniq)
:> selfblast_table.txt
#query="5899"
for query in $query_list
do
echo $query
subject=$(grep -e $query, pairs.csv | awk -F ',' 'ORS = "\n" {print $2}' | sort -n | uniq | tr '\n' ',' | sed 's/,$/\n/')
echo $subject
to_search=$(echo $query,$subject | sed 's/,/|/g')
if ! egrep -qs -e $to_search selfblast_table.txt; then
echo Query or Subject not in the table
echo $query,$subject >> selfblast_table.txt
else
echo Query or Subject YET in the table
replacement=$(echo $(egrep -e $to_search selfblast_table.txt | sed -z 's/\n/,/g' | sed 's/.$/\n/'),$query,$subject | sed -e $'s/,/\\\n/g' | sort -n | uniq | tr '\n' ',' | sed 's/.$/\n/')
line_numbers=$(egrep -n -e $to_search selfblast_table.txt | sed 's/:.*//g')
for N in $line_numbers;
do
sed -i $N's/.*/'$replacement'/g' selfblast_table.txt
done
cat selfblast_table.txt | sort | uniq > selfblast_table.txt.temp
rm selfblast_table.txt
mv selfblast_table.txt.temp selfblast_table.txt
fi
done
Now the list of corresponding loci based on similarity has been
established (selfblast_table.txt
), we need to associate them in the
.FNA
and .FAA
files.
Transform the table into a list of pairs of
locusCustomID,correspondingLocus
(selfblast_table_paired.txt
).
:> selfblast_table_paired.txt
for line in $(cat selfblast_table.txt)
do
# echo $line
locus=$(echo $line | cut -d ',' -f 1)
echo $locus
others=$(echo $line | sed 's/'$locus',//g' | sed -e $'s/,/\\\n/g')
for other in $others
do
echo $locus,$other >> selfblast_table_paired.txt
done
done
Transform the comma-separated list into a tab-separated list
(replacement.txt
), that can be read by seqkit replace
command.
:> replacement.txt
cat selfblast_table_paired.txt | grep ',[[:digit:]]\{4\}$' | awk -F ',' '{print $2"\t__"$2"__-"$1}' >> replacement.txt
cat selfblast_table_paired.txt | grep -v ',[[:digit:]]\{4\}$' | awk -F ',' '{print $2"\t"$1}' >> replacement.txt
Use seqkit replace
to replace every correspondingLocus
to its
matching locusCustomID
.
seqkit replace -w0 -p '-(.+)$' -r '-{kv}' -k replacement.txt --keep-key FULL_REFERENCE_prot.FAA | seqkit replace -w0 -p '-_' -r '_' > FULL_REFERENCE_prot_SB1.FAA
seqkit replace -w0 -p '-(.+)$' -r '-{kv}' -k replacement.txt --keep-key FULL_REFERENCE_nucl.FNA | seqkit replace -w0 -p '-_' -r '_' > FULL_REFERENCE_nucl_SB1.FNA
Finally sort the fasta files by locusCustomID
.
sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' FULL_REFERENCE_prot_SB1.FAA | seqkit sort -w0 | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > FULL_REFERENCE_prot_SB1.FAA.temp
sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' FULL_REFERENCE_nucl_SB1.FNA | seqkit sort -w0 | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > FULL_REFERENCE_nucl_SB1.FNA.temp
rm FULL_REFERENCE_prot_SB1.FAA FULL_REFERENCE_nucl_SB1.FNA
rename -v '.FAA.temp' '.FAA' ./*
rename -v '.FNA.temp' '.FNA' ./*
cd $path_to_wd/full_reference_DB
makeblastdb -in FULL_REFERENCE_nucl_SB1.FNA -out FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_nucl_BLAST_DB_FINAL -dbtype nucl -parse_seqids
makeblastdb -in FULL_REFERENCE_prot_SB1.FAA -out FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_prot_BLAST_DB_FINAL -dbtype prot -parse_seqids
blastp -task blastp-fast -db FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_prot_BLAST_DB_FINAL -query FULL_REFERENCE_prot_SB1.FAA -evalue 1e-06 -outfmt "10 std qframe qcovs qseq" -num_threads $SLURM_CPUS_PER_TASK > check_selfblast_full_ref_blastp_out.csv
cat check_selfblast_full_ref_blastp_out.csv | awk -F ',' 'OFS = "," {
temp1 = $1
temp2 =$2
sub(".+-","", temp1)
sub(".+-","", temp2)
if (temp1 != temp2)
print $0
}' > check_different_query_and_subject_loci.csv
cat check_different_query_and_subject_loci.csv | csvtk filter2 -H -f '$3 > 60 && $4 > 50 && $14 > 50 && $12 > 100 && $13 > 0' > check_filtered_different_query_and_subject_loci.csv
The file check_filtered_different_query_and_subject_loci.csv
resulting
from the check self-BLAST is empty, meaning no sequence from a locus has
similarity with a sequence of any other locus.
The full set of reference sequences is ready
in$path_to_wd/full_reference_DB
:
FULL_REFERENCE_nucl_SB1.FNA
and
FULL_REFERENCE_prot_SB1.FAA
Download or copy the original Melastomataceae 689 target sequence
set,
and decompress the Template_sequences.zip
folder.
- De-align the sequences
- Remove all the ? characters
cd $path_to_wd/original_template_sequences
cd Template_sequences
sed -i '/^>/!s/-//g' *.fasta
sed -i '/^>/!s/\?//g' *.fasta
KT377070.1 and KT377086.1 are each split into 2 different files
(KT377070.1_alignment.fasta
and KT377070_alignment.fasta
, and
KT377086.1_alignment.fasta
and KT377086-Affzeli.fasta
,
respectively), supposedly representing 2 different “versions” of the
genes with high divergence (Jantzen, pers. comm.). Actually, the
alternative “version” is not highly divergent but in reverse complement
of what it should be. Here we just merge the fasta files in order to
have only 1 file per gene. We arbitrarily keep the “.1” suffix in the
names of the genes.
KT377070.1
To avoid duplicates in fasta headers, we rename the sequence
“>KT377070-Tibouchina” (from KT377070_alignment.fasta
) to
“>KT377070.1-Tibouchina2”.
cat KT377070.1_alignment.fasta KT377070_alignment.fasta | sed 's/KT377070-Tibouchina/KT377070-Tibouchina2/g' | sed 's/KT377070-/KT377070.1-/g' > KT377070.1_custom.fasta
rm KT377070.1_alignment.fasta KT377070_alignment.fasta
KT377086.1
cat KT377086.1_alignment.fasta KT377086-Affzeli.fasta | sed 's/KT377086-/KT377086.1-/g' > KT377086.1_custom.fasta
rm KT377086.1_alignment.fasta KT377086-Affzeli.fasta
Modify the fasta headers of the template sequences for the KT######.1 loci so that all the headers have the “.1” suffix.
cat KT377102.1_alignment.fasta | sed 's/KT377102-/KT377102.1-/g' > KT377102.1_custom.fasta
cat KT377110.1_alignment.fasta | sed 's/KT377110-/KT377110.1-/g' > KT377110.1_custom.fasta
rm KT377102.1_alignment.fasta KT377110.1_alignment.fasta
Need to rename the sequences names to stick to the naming convention.
ls -1 *.fasta | \
while read file
do
cat $file| sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > $file.FNA
done
rename '.fasta.FNA' '.FNA' *
rm *.fasta
cat *.FNA > ../Melastomataceae_689_original.FNA
2. Associate each template sequence (or portion of) from Melasto689 to a locus from the full reference set
The
sbatch
script corresponding to this section is:melasto689_on_full_reference.sh
Define the paths to previously prepared databases.
cd $path_to_wd/melasto689_on_full_ref
path_to_Melasto689=$path_to_wd/original_template_sequences;
path_to_full_ref=$path_to_wd/full_reference_DB;
prot_DB=$path_to_full_ref/FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_prot_BLAST_DB_FINAL
path_to_scripts=$path_to_scripts # define the path to the scripts
Copy the necessary files to the current directory.
scp -r $path_to_Melasto689 ./
scp -r $path_to_full_ref/FULL_REFERENCE_nucl.FNA ./
The Melasto389 probe set includes template sequences from the Angio353
dataset (what we call angio353, with a small a, in the glossary).
They need to be removed beforehand. They are easily identifiable with
their name in the form XXXX-####
where X
is a letter and #
a
digit.
cd $path_to_wd/melasto689_on_full_ref
seqkit grep -w 0 -n -r -p "[[:upper:]]{4}-" Melastomataceae_689_original.FNA > Angio353_from_Melasto689.FNA;
seqkit grep -w 0 -n -r -v -p "[[:upper:]]{4}-" Melastomataceae_689_original.FNA > Melasto689_trimed.FNA;
Melasto689="Melasto689_trimed.FNA"
seqkit seq -n Melasto689_trimed.FNA > template_sequences_list.txt
Each template sequence (TS) from Melasto689 (Melasto689_trimed.FNA
)
will be aligned to the set of references designed in Step 1. The aligner
used here is BLAST
(blastx-fast
). The regions of the TS that match
to a reference (the hits) are then extracted, the name of the reference
locus is appended to them, and they are place into the extracted_hits
folder. The non-matching regions are also separated into the NMR_hits
folder, and the TS with no match (i.e. no portion of the TS matches a
reference) are placed into the nomatch_TS
folder.
Prepare the sub-folders.
cd $path_to_wd/melasto689_on_full_ref
mkdir extracted_hits
mkdir nomatch_TS
mkdir NMR_hits
For every TS in the list of template sequences:
- create a sub-directory and set it as current directory
- extract the TS from the whole Melasto689_trimmed file
- align the TS to the reference with
blastx-fast
and store the result into a table - if the result table exists and is not empty, then:
- filter the results to keep only the matches with a bitscore greater than 50 and a query coverage greater than 50% (meaning at least 50% of the query covers the subject sequence), or matches with percentage identity greater than 60%, a query coverage greater than 10%, and a bitscore greater than 50
- if the filtered table exists and is not empty, then:
- for every line in the table (i.e. for every hit):
- retrieve the reference locus to which the hit aligns
refLocus
- retrieve the frame in which the hit aligns to the reference
frame
- print the matching subject (i.e. the name of the reference TS to
which the hit matched) into the file
matching_subject-$refLocus.txt
. This file will be used later during step 2.5 - if the
frame
value is greater than 0 (i.e. the hit is not in reverse complement), then:- set the hit into
.bed
format - retrieve the sequence of the hit using
bedtools getfasta
and rename the sequence in order to have the old locus name separated with ’__’ (double underscore) and the new locus name (refLocus
) appended at the end of the name - the hit is saved in
.FNA
(<hitID>_hit.FNA
)
- set the hit into
- if the
frame
value is lower than 0 (i.e. the hit is in reverse complement), then:- set the hit into
.bed
format - retrieve the sequence of the hit using
bedtools getfasta
, reverse complement the sequence and rename the sequence in order to have the old locus name separated with ’__’ (double underscore) and the new locus name (refLocus
) appended at the end of the name - the hit is saved in
.FNA
(<hitID>_hit.FNA
)
- set the hit into
- copy the hits (
<hitID>_hit.FNA
files) into$path_to_wd/extracted_hits
- retrieve the reference locus to which the hit aligns
- retrieve the query coverage (i.e. the percentage of query covered by the alignment)
- if the query coverage is lower than 90 (i.e. at least 10% of the TS
did not align to any reference), then:
- merge all the
.bed
files to get the region(s) of the TS that did match - set a
.bed
file for the whole TS - extract the coordinates of region(s) of the TS that did not match
(non-matching regions, NMR) with
bedtools complement
- retrieve the sequence of the NMR using
bedtools getfasta
, filter out the sequences less than 30 nucleotides long, append ‘NMR’ to their locus name, and store them into<TS_ID>NMR.FNA
- copy the NMR (
<TS_ID>NMR.FNA
files) into$path_to_wd/NMR_hits
- merge all the
- for every line in the table (i.e. for every hit):
- else (i.e. the TS has no match):
- rename the TS to ‘nomatch’ and store it into
$path_to_wd/nomatch_TS
- rename the TS to ‘nomatch’ and store it into
- set the upper level directory as current directory
for TS in $(cat template_sequences_list.txt);
do
echo $TS
mkdir $TS
cd $TS
seqkit grep -w0 -nrp $TS ../Melasto689_trimed.FNA > $TS.FNA
blastx -task blastx-fast -matrix BLOSUM62 -db $prot_DB -query $TS.FNA -culling_limit 1 -evalue 1e-06 -ungapped -comp_based_stats F -outfmt "10 std qframe qcovs sframe qseq" > $TS"_blastx_out_unfiltered.csv"
if [ -s $TS"_blastx_out_unfiltered.csv" ]; then
cat $TS"_blastx_out_unfiltered.csv" | csvtk filter2 -H -f '$12 > 50 && $14 > 50 || $3 > 60 && $14 > 10 && $12 > 50' > $TS"_blastx_out.csv"
fi
if [ -s $TS"_blastx_out.csv" ]; then
number_of_seqs_BLASTED=$(cat $TS"_blastx_out.csv" | wc -l)
else
number_of_seqs_BLASTED=(0)
fi
if [ $number_of_seqs_BLASTED -gt 0 ]; then
for line in $(cat $TS"_blastx_out.csv");
do
# Retrieve the hit
refLocus=$(echo $line | awk -F ',' '{print $2}' | sed 's/.*-//g')
frame=$(echo $line | awk -F ',' '{print $13}')
echo $line | awk -F ',' '{print $2}' >> matching_subject-$refLocus.txt
if [ $frame -gt 0 ]; then # HIT IS REGULAR
hit=$(echo $line | awk -F ',' '{print $1":"$7-1"-"$8}' | sed 's/-/__/g' | sed 's/$/-'$refLocus'/g')
echo Hit is: $hit
echo "...(regular hit)"
echo $line | awk -F ',' '{print($1"\t"$7-1"\t"$8)}' | bedtools sort > $hit".bed"
bedtools getfasta -bed $hit".bed" -fi $TS.FNA | sed 's/-/__/g' | sed '/^>/s/$/-'$refLocus'/g' > $hit"_hit.FNA"
fi
if [ $frame -lt 0 ]; then # HIT IS RC
hit=$(echo $line | awk -F ',' '{print $1":"$8-1"-"$7}' | sed 's/-/__/g' | sed 's/$/-'$refLocus'/g')
echo Hit is: $hit
echo "... (reverse complemented hit)"
echo $line | awk -F ',' '{print($1"\t"$8-1"\t"$7)}' | bedtools sort > $hit".bed"
bedtools getfasta -bed $hit".bed" -fi $TS.FNA | seqkit seq -rp -t dna -v -w 0 | sed 's/-/__/g' | sed '/^>/s/$/-'$refLocus'/g' > $hit"_hit.FNA"
fi
done
cp *_hit.FNA ../extracted_hits
# Retrieve the NMR
qcovs=$(cat $TS"_blastx_out.csv" | csvtk summary -H -n0 -f 14:sum)
if [ $qcovs -lt 90 ]; then
echo ...NMR found for $TS, extract NMR...
cat *.bed | bedtools sort | bedtools merge > $TS"_merged_ALL.bed"
seqkit fx2tab -ln $TS".FNA" | awk '{print($1"\t"$2"\t"$2)}' | bedtools sort > $TS".bed"
bedtools complement -L -i $TS"_merged_ALL.bed" -g $TS".bed" > $TS"_nonmatching_regions.bed"
cat $TS"_nonmatching_regions.bed" >> ../TS_nonmatching_regions.bed
bedtools getfasta -bed $TS"_nonmatching_regions.bed" -fi $TS".FNA" | seqkit seq -g -m 30 -w 0 | sed 's/-/_/g' | sed '/^>/s/$/-NMR/g' > $TS"NMR.FNA"
if [ ! -s $TS"NMR.FNA" ]; then
echo ... NMR less than 30 nucleotides, discard it...
rm $TS"NMR.FNA"
else
cp *NMR.FNA ../NMR_hits
fi
fi
else
echo No matching hit for $TS
cat $TS".FNA" > $TS"_nomatch.FNA"
cp *_nomatch.FNA ../nomatch_TS
fi
cd ../
done
The extracted hits might or might not belong to a same TS and a same locus. This step will merge the extracted hits that do. This is important in order to keep a coherence in the cleaned TS. For example, a TS that has a hit from the position 1 to 30 and another hit from the position 60 to 90 has to be considered as a same TS and not two separated TS. Note that for a same TS and same locus, the hits might actually overlap. We will thus re-align them to the reference and draw a consensus sequence of the hits alignment.
In the folder extracted_hits
, create a new subfolder cleaned_hits
that will contain the merged hits.
cd $path_to_wd/melasto689_on_full_ref/extracted_hits
mkdir cleaned_hits
Initialize a table containing the matches between the locusCustomID
and the corresponding Melasto689 locus name.
echo "locusCustomID, correspondingLoci" > $path_to_wd/loci_matching_table.csv
Draw the the list of loci (containing locusCustomID
for the TS that
matched to a reference in 2.3, and old locus name in case the TS had no
match).
loci_list=$(ls -1 *.FNA | sed 's/.*-//g' | sed 's/_hit.FNA//g' | sort | uniq)
For every locus
in the list of loci:
- create a subfolder for the locus
- copy the hits that belong to the locus in the subfolder
- set the subfolder as current directory
- retrieve the list of TS belonging to this locus
TS_list
- retrieve the list of old loci from the TS belonging to this locus
TS_loci
- write the locus and the corresponding old loci list into the table of matching loci
- extract the sequences corresponding to the locus from the reference set
- retrieve the names of these sequences and store them into a text file
(later used by the
align_translated_to_ref_and_draw_consensus.R
script) - for every TS in the
TS_list
:- retrieve its old locus name
TS_locus
- retrieve the number of hits belonging to the TS and locus
nb_hits
- if the number of hits is greater than 1, then:
- concatenate these hits files together and with the reference sequences
- draw a consensus sequence for these hits using the
align_translated_to_ref_and_draw_consensus.R
R script - remove the gaps from the consensus sequence and rename it so that
the old locus name (
TS_locus
) is bordered with ’__’ and so that the-locus
name is at the end of the sequence name
- if the number of hits is equal to 1, then:
- rename the hit sequence so that the
-locus
name is at the end of the sequence name
- rename the hit sequence so that the
- retrieve its old locus name
- copy the cleaned hits or merged hits into the
cleaned_hits
- set the upper level directory as current directory
for locus in $loci_list
do
#echo $locus
mkdir $locus
cp *-$locus"_hit.FNA" $locus
cd $locus
TS_list=$(ls -1 *hit.FNA | sed 's/:.*//g' | sort | uniq)
TS_loci=$(ls -1 *hit.FNA | sed 's/:.*//g' | sed 's/.*__//g' | sort | uniq)
echo $locus, $TS_loci >> $path_to_wd/loci_matching_table.csv
seqkit grep -w0 -nrp $locus $path_to_wd/FULL_REFERENCE_nucl.FNA > $locus"_original.FNA"
seqkit seq -n $locus"_original.FNA" > IDs_to_remove_after_alignment.txt
#TS="Tetrazygia__NM_113708.2"
for TS in $TS_list
do
echo $TS
TS_locus=$(echo $TS | sed 's/.*__//g')
nb_hits=$(ls -1 $TS*_hit.FNA | wc -l)
if [ $nb_hits -gt 1 ]; then
echo "More than 1 hit found for "$TS". Draw consensus sequence."
cat $locus"_original.FNA" $TS*_hit.FNA > $TS"_hits_to_align.FNA"
# Align the hits to the reference (as translated amino acids) and build consensus upon aligned hits
Rscript $path_to_scripts/align_translated_to_ref_and_draw_consensus.R $TS"_hits_to_align.FNA" $TS"_hits_aligned_to_ref.FNA" IDs_to_remove_after_alignment.txt $TS"_hits_consensus" $TS"_hits_consensus.FNA"
# Remove the gaps to clean the consensus sequence
seqkit seq -g -w 0 $TS"_hits_consensus.FNA" | sed '/^>/s/_'$TS_locus'_/_'$TS_locus'__/g' | sed '/^>/s/$/'-$locus'/g' > $TS"_hits_consensus_clean-"$locus".FNA"
elif [ $nb_hits -eq 1 ]; then
echo "Only 1 hit found for "$TS"."
seqkit replace -w 0 -p ':.*' -r '__hit' $TS*_hit.FNA | sed '/^>/s/$/'-$locus'/g' > $TS"_hit_clean-"$locus".FNA"
fi
done
cp *clean-$locus.FNA ../cleaned_hits
cd ..
done
At the end of this step we have the folder cleaned_hits
full of
individual files with cleaned hits.
This step aims to add the relevant reference sequences (RS) to the final
probe set. We will only add the RS from loci that matched the TS in
2.3.
Some loci have a very high number of transcriptome RS. Instead of
appending them all to the final set, we just select the transcriptome RS
that previously aligned to a TS.
Concerning the Angio353 RS, no selection was carried out and all of them
were appended to the final set, if belonging to a locus present in the
final set.
- create the
cleaned_sequences_set
folder - concatenate all the cleaned hit
.FNA
files into the filecleaned_hits_all.FNA
in this folder
cd $path_to_wd/melasto689_on_full_ref/extracted_hits/cleaned_hits
mkdir $path_to_wd/cleaned_sequences_set
cat *.FNA > $path_to_wd/cleaned_sequences_set/cleaned_hits_all.FNA
- create the
matching_subjects
subfolder in thecleaned_sequences_set
folder - fetch all the
matching_subject-$refLocus.txt
files and copy them into$path_to_wd/cleaned_sequences_set/matching_subjects
- remove the duplicate lines in every text file
cd $path_to_wd/melasto689_on_full_ref
mkdir cleaned_sequences_set/matching_subjects
find . -type f -name "matching_subject*.txt" -exec cp {} ./cleaned_sequences_set/matching_subjects \; # error messages are normal as the destination folder is embedded within the searching directory
cd cleaned_sequences_set/matching_subjects
for file in $(ls -1 *.txt)
do
cat $file | sort | uniq > $file.temp
done
rm *.txt
rename -v '.txt.temp' '.txt' *
- retrieve the list of loci from the concatenated clean hits file
- initialize the
.FNA
file that will contain the final sequences and concatenate cleaned hits in it (final_sequences_set.FNA
) - for every locus in the list of loci:
- retrieve the sequences that belong to this locus in the set of reference and get the number of those sequences that are from the trancriptome data (identified with ‘scaffold’ in their name)
- if the number of these transcriptome sequences is greater than 5,
then:
- retrieve only the reference sequences that matched a TS in step
2.3 (whose ID are in the
matching_subject-locus.txt
) and append them tofinal_sequences_set.FNA
- retrieve all the reference sequences for this locus that come from
Angio353 and if any was retrieved, append them to
final_sequences_set.FNA
- retrieve only the reference sequences that matched a TS in step
2.3 (whose ID are in the
- else (i.e. number of transcriptome sequences is lower than 5):
- retrieve all the reference sequences for this locus and append
them to
final_sequences_set.FNA
- retrieve all the reference sequences for this locus and append
them to
cd $path_to_wd/melasto689_on_full_ref/cleaned_sequences_set
seqkit seq -n cleaned_hits_all.FNA | sed 's/.*-//g' | sort | uniq > loci_list.txt
cat cleaned_hits_all.FNA > final_sequences_set.FNA
for locus in $(cat loci_list.txt);
do
#echo $locus
nb_TTS=$(seqkit grep -w0 -nrp -$locus ../FULL_REFERENCE_nucl.FNA | seqkit seq -n | grep 'scaffold' | wc -l) # nb transcriptomes TS
if [ $nb_TTS -gt 5 ]; then # if number of transcript (from 1KP) reference sequences is above 4 just select a few (ie the matching ones)
# retrieve the matching TS(s) for the locus (SWGX WWQZ sequences transcripts and/or Angio353 seqs)
#cat matching_subjects/*-$locus.txt
seqkit grep -w0 -nrf matching_subjects/*-$locus.txt ../FULL_REFERENCE_nucl.FNA >> final_sequences_set.FNA
# retrieve the corresponding Angio353 sequences for the locus
# seqkit grep -w0 -nrp $locus ../FULL_REFERENCE_nucl.FNA | seqkit seq -n | grep 'scaffold'
seqkit grep -w0 -nrp -$locus ../FULL_REFERENCE_nucl.FNA | seqkit seq -n | grep -v 'scaffold' > $locus"_TS_to_get.txt"
if [ -s $locus"_TS_to_get.txt" ]; then
seqkit grep -w0 -nrf $locus"_TS_to_get.txt" ../FULL_REFERENCE_nucl.FNA >> final_sequences_set.FNA
fi
rm $locus"_TS_to_get.txt"
else
seqkit grep -w0 -nrp -$locus ../FULL_REFERENCE_nucl.FNA >> final_sequences_set.FNA
fi
done
Get the list of names of Angio353 loci that were included in the Melasto689 (i.e. the list of angio353 loci) using file generated in 2.2.
cd $path_to_wd/melasto689_on_full_ref/cleaned_sequences_set
seqkit seq -n ../Angio353_from_Melasto689.FNA | sed 's/.*-//g' | sort | uniq > Angio353_from_Melasto689_list.txt
For every locus in the list of angio353 loci:
- retrieve the sequences from the reference set that belong to the locus
(i.e. either having it as locus name or having it in their TS name) and
store them in
final_sequences_set.FNA
.
for locus in $(cat Angio353_from_Melasto689_list.txt);
do
#echo $locus
seqkit grep -w0 -nrp -$locus ../FULL_REFERENCE_nucl.FNA >> final_sequences_set.FNA
seqkit grep -w0 -nrp "__"$locus"__" ../FULL_REFERENCE_nucl.FNA >> final_sequences_set.FNA
done
Remove the duplicate sequences based on their name and sequence.
seqkit rmdup -w0 -D name_duplicates.txt -n final_sequences_set.FNA > final_sequences_set_rmdups.FNA
seqkit rmdup -w0 -D sequence_duplicates.txt -s final_sequences_set_rmdups.FNA > final_sequences_set.FNA
Sort the sequences by locus name in the final set.
cd $path_to_wd/melasto689_on_full_ref/cleaned_sequences_set
cat final_sequences_set.FNA | seqkit rmdup -n | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' | seqkit sort -w0 -n | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > final_sequences_set.FNA.temp
rm final_sequences_set.FNA
Modify the names of some TS to limit the length of the names.
seqkit replace -w0 -p 'hits_consensus' -r 'hc' final_sequences_set.FNA.temp > final_sequences_set.FNA.temp2
seqkit replace -w0 -p 'transcriptome_combo' -r 'tr_combo' final_sequences_set.FNA.temp2 > final_sequences_set.FNA
rm -r *temp*
Retrieve the TS names in the final set.
cat final_sequences_set.FNA | seqkit seq -n > TS_full_list.txt
cd $path_to_wd/melasto689_on_full_ref
mkdir BLAST_DB
makeblastdb -in cleaned_sequences_set/final_sequences_set.FNA -out BLAST_DB/final_sequences_set_nucl -dbtype nucl -parse_seqids
makeblastdb -in cleaned_sequences_set/final_sequences_set.FAA -out BLAST_DB/final_sequences_set_prot -dbtype prot -parse_seqids
blastp -task blastp-fast -db BLAST_DB/final_sequences_set_prot -query cleaned_sequences_set/final_sequences_set.FAA -evalue 1e-06 -outfmt "10 std qframe qcovs qseq" -num_threads $SLURM_CPUS_PER_TASK > selfblast_final_sequences_set_blastp_out_SB.csv
cat selfblast_final_sequences_set_blastp_out_SB.csv | awk -F ',' 'OFS = "," {
temp1 = $1
temp2 =$2
sub(".+-","", temp1)
sub(".+-","", temp2)
if (temp1 != temp2)
print $0
}' > different_query_and_subject_loci_SB.csv
cat different_query_and_subject_loci_SB.csv | csvtk filter -H -f'3>60' > filtered_different_query_and_subject_loci_SB.csv
echo "Done Self-Blast."
The file filtered_different_query_and_subject_loci_SB.csv
resulting
from the check self-BLAST contains a few sequences from different two
pairs of loci having high similarity.
For the loci LOC12198 and sLOC23791 it concerns reference sequences
coming from Angio353 references (QSKP__6363 and AEPI__6320), but the
sequences are similar over less than 20% of their length.
For the loci LOC20331 and LOC22857, it concerns reference sequences
coming from Angio353 and a TS from Melasto689. Their is relatively high
similarity (ca. 62%) over ca. 40% of the TS from Melasto689, but over
less than 10% for the reference sequences.
Given the short region of similarity, here we decided to keep them
separated.
The extension of the Angio353 set to the mega353 set appended some sequences (actually part of sequences) from the SWGX and WWQZ transcriptomes in the full reference set (Step 1: 2). These same transcriptomes sequences were also appended in full to the full reference set (Step 1: 1). Because of that, some sequences from the mega353 are subsets of sequences coming from the transcriptomes. To avoid redundancy, we remove the transcriptomes sequences that are supersets of mega353 sequences.
Filter and select these sequences from the self-BLAST output table.
cd $path_to_wd/melasto689_on_full_ref
cat selfblast_final_sequences_set_blastp_out_SB.csv | csvtk filter -H -f'14=100' | csvtk filter -H -f'3=100' | awk -F ',' 'OFS = "," {
if ($1 != $2)
print $0
}' | awk -F, '$2 ~ /scaffold/ {print}' | awk -F, '$1 ~ /^WWQZ|^SWGX/ {print}' > different_query_and_subject_SB.csv
cat different_query_and_subject_SB.csv
Retrieve the names of the transcriptomes sequences and remove them from the final sequence set.
cat different_query_and_subject_SB.csv | awk -F ',' '{print $2}' > TS_to_remove.txt
cd $path_to_wd/melasto689_on_full_ref/cleaned_sequences_set
seqkit grep -w0 -nrvf ../TS_to_remove.txt final_sequences_set.FNA > final_sequences_set.FNA.temp
rm final_sequences_set.FNA.
mv final_sequences_set.FNA.temp final_sequences_set.FNA
The final sequence set needs extra steps to make sure the sequences are clean.
Stop codons were found remaining in some TS of the final sequence set. We decided to assess them one by one. To do so, we aligned all the nucleotide TS from a same locus. Unless specified, we replaced the stop codon with the consensus sequence at the position of the stop codon.
Note that other stop codons are present in some reference sequences coming from the mega353 dataset. They are treated later (see Step 3: 6.1).
cd $path_to_wd
mkdir CLEAN_PROBE_SET
path_to_sequence_set=$path_to_wd/melasto689_on_full_ref
cp $path_to_sequence_set/cleaned_sequences_set/final_sequences_set.FNA ./CLEAN_PROBE_SET/
cd CLEAN_PROBE_SET
All the following correction commands are stored in the script
custom_stop_codon_removal.sh
.
TAG -> NAN
sed -i '/>Brachyotum__5699__hc-5699/,+1s/TTGCCCCTCAAATGGTAGTATCCC/TTGCCCCTCAAATGGNANTATCCC/g' final_sequences_set.FNA
TAA -> NAA
sed -i '/>Tibouchina__AT3G59040__hc-LOC00667/,+1s/CGATCTGAAAKCTGGAGGAACATGAGATAA/CGATCTGAAAKCTGGAGGAACATGAGANAA/g' final_sequences_set.FNA
TGA -> NNN
sed -i '/>Affzelli__5449__hit-LOC00776/,+1s/ACTTGGGCTAAATTGTTGTGTTTTTGA/ACTTGGGCTAAATTGTTGTGTTTTNNN/g' final_sequences_set.FNA
Stop codon in a badling aligned portion at the beginning of a hit. Remove the badly aligned portion (36 nucleotides).
sed -i '/>Memecylon_tr_combo__AT1G02020__hc-LOC01351/,+1s/TGCAACAAAAGTAAATCCTGGTTGTATTTTATGTAG//g' final_sequences_set.FNA
TAG -> NAG
sed -i '/>Memecylon_tr_combo__AT4G39520__hc-LOC01520/,+1s/AGGCTAAACAAGTAGCCACCT/AGGCTAAACAAGNAGCCACCT/g' final_sequences_set.FNA
2 stop codons in the middle of a badly aligning region with a lot of ambiguous characters. Remove the whole region (27 nucleotides).
sed -i '/>Tibouchina__TC05G024510__hc-LOC04226/,+1s/GGKTAGTTTATTCATTGAATTKKMSYT//g' final_sequences_set.FNA
TAA -> TAN
sed -i '/>Tibouchina__AT4G00740__hc-LOC04935/,+1s/GCATGGTAACTTCTGATGAAACATTTTGTA/GCATGGTANCTTCTGATGAAACATTTTGTA/g' final_sequences_set.FNA
TAG -> TNG
sed -i '/>Tibouchina__5639__hc-LOC05244/,+1s/TTKSTCTAGGGCAAT/TTKSTCTNGGGCAAT/g' final_sequences_set.FNA
TGA -> TGG
sed -i '/>Affzelli__5426__hit-LOC06188/,+1s/ATGTGAAAAAATGTTGGTAGT/ATGTGGAAAAATGTTGGTAGT/g' final_sequences_set.FNA
TAA -> TNN
sed -i '/>Tibouchina__NM_179236.3__hc-LOC06614/,+1s/TCAAGGTAACTTTTTGTGCTTTTTGCA/TCAAGGTNNCTTTTTGTGCTTTTTGCA/g' final_sequences_set.FNA
The other stop codon is in a badly aligning region with ambiguous characters. Remove the region (21 nucleotides).
sed -i '/>Tibouchina__NM_179236.3__hc-LOC06614/,+1s/YTSCTGCTGCTATGGTGACAG//g' final_sequences_set.FNA
Remove the stop codon and the region with a lot of ambiguous characters just before (48 nucleotides).
sed -i '/>Tibouchina__5339__hc-LOC07010/,+1s/CAKACCAGGKATKKWTYTCMMCTTRAAAAAATCTAAACGAGGCAGTTG//g' final_sequences_set.FNA
TAG -> TNN
sed -i '/>Memecylon_tr_combo__TC04G019470__hc-LOC07251/,+1s/TCTATGTAGAGATGGGTTGCAGTTGTTTTTGTATTC/TCTATGTNNAGATGGGTTGCAGTTGTTTTTGTATTC/g' final_sequences_set.FNA
TAA -> NAN
sed -i '/>Tibouchina__6563__hc-LOC07413/,+1s/TCCAAGCTTCCAAAGTAAAGGACTTCTGTGGTGACAATATTG/TCCAAGCTTCCAAAGNANAGGACTTCTGTGGTGACAATATTG/g' final_sequences_set.FNA
TGA -> NGN
sed -i '/>Torricelli_tr_combo__TC01G021540__hc-LOC07589/,+1s/TCATGACTTTTCTGTGCATCCATCAACTTCCTA/TCANGNCTTTTCTGTGCATCCATCAACTTCCTA/g' final_sequences_set.FNA
Remove the stop codon and the region with a lot of ambiguous characters just before.
sed -i '/>Tibouchina__NM_001198514.1__hc-LOC08266/,+1s/TTYWYSSYSTYYYSTKYKTTCTGA//g' final_sequences_set.FNA
The whole tail of the sequence, containing 2 stop codons badly aligns to the rest of the TSs. Remove the whole tail (51 nucleotides).
sed -i '/>Memecylon_tr_combo__TC09G010970__hc-LOC08580/,+1s/GTAAGTGTTCATTATTTCACACGCCTTCTGAGTTGATTTTGTTGATTCTGT//g' final_sequences_set.FNA
Stop codon in the tail of the sequence that badly aligns to the rest of the TS. Remove the tail (42 nucleotides).
sed -i '/>Tibouchina__TC05G027690__hc-LOC09163/,+1s/ATGATCGGTAGAGCACACGTCTGAACTCCAGTCACGAGATTC//g' final_sequences_set.FNA
The stop codon is at the end of a hit in a badly aligning region. Remove the region (24 nucleotides).
sed -i '/>Tibouchina__7572__hc-LOC09834/,+1s/CAAGGAGTTATAACGTGACTCAAG//g' final_sequences_set.FNA
The other stop codon is in a badly aligning region with ambiguous characters. Remove the region (54 nucleotides).
sed -i '/>Tibouchina__5942__hc-LOC14673/,+1s/RGCMRRRRKCMSYRSSSSYRGMARRARRCMSKWMGMCCAGAAGAAGACCGGTAG//g' final_sequences_set.FNA
Stop codon in a region of the sequence that badly aligns to the rest of the TS. Remove the region (33 nucleotides).
sed -i '/>Tibouchina__AT5G13630__hc-LOC15950/,+1s/GTATAGCCGGCTGMSMKRGYSRTSAARRTKGTK//g' final_sequences_set.FNA
Stop codon in a region of the sequence that badly aligns to the rest of the TS. Remove the region (33 nucleotides).
sed -i '/>Brachyotum__AT3G02760__hc-LOC24270/,+1s/AATTCCTAGAATTCATKSKKKCKYTAWKRWRRCSWKMWA//g' final_sequences_set.FNA
The TS that do not align to any sequence in the reference set can still be included in the final set. We need to make sure they are in the correct frame and that they do not include any stop codon. To do so, we will use a script developed by Chris Jackson that identifies the first forward frame that has no stop codon and recovers the in-frame sequence while removing trailing 3’ nucleotides.
path_to_sequence_set=$path_to_wd/melasto689_on_full_ref
cd $path_to_sequence_set/nomatch_TS
cat *nomatch.FNA > nomatches_all.FNA
python /blue/soltis/dagallierl/PROGRAMS/get_inframe_targetfile.py nomatches_all.FNA
The retrieved in-frame sequences are renamed so that the old locus name
is included in the TS name and so that their new locus name is in the
form -outlier###
(with #
being a digit).
cat nomatches_all_inframe.FNA | sed 's/-/_/g' | sed 's/\./_/g' | seqkit replace -w0 -p $ -r "-outlier{nr}" --nr-width 3 > nomatches_recovered.FNA
We still want to check whether these recovered TS could actually align to a sequence in the final set. To do so, we extract their old locus name and for each locus we retrieve all the sequences from the final set that belong to this locus into a separate file.
seqkit seq -n nomatches_recovered.FNA | sed 's/-.*//g' | sed 's/.*_//g'
> loci_to_check.txt
for locus in $(cat loci_to_check.txt)
do
seqkit grep -w0 -nrp $locus final_sequences_set.FNA > $locus"_test.FNA"
seqkit grep -w0 -nrp $locus nomatches_recovered.FNA >> $locus"_test.FNA"
done
We then check manually each file by aligning the sequences for every locus and to look visually if the ‘outlier’ sequence align to the rest of the sequences. To do so we used Muscle v3.8.425 as wrapped in the program AliView to visually assess the alignement.
It turns out that the sequence from the locus ‘outliers015’ aligns to the locus ‘5463’. We thus change the locus name of this sequence to ‘5463’.
seqkit replace -w0 -p '-outlier015' -r '-5463' nomatches_recovered.FNA | sed 's/_5463-5463/-5463/g' > nomatches_recovered_renammed.FNA
rm -r *test.FNA
To simplify the process of building the set of reference sequences (Step
1), we used custom loci names in the form LOC######
(where #
represents a digit).
In order to ease the readibility of the final probe set, we revert the loci names to their old name (the one used in Melasto689 and Angio353).
Prepare a matching table of the loci names between the original names and the custom loci names.
cd $path_to_wd/CLEAN_PROBE_SET
path_to_sequence_set=$path_to_wd/melasto689_on_full_ref
cp $path_to_sequence_set/cleaned_sequences_set/TS_full_list.txt ./
Retrieve the list of custom loci names.
cat TS_full_list.txt | sed 's/.*-//g' | sort | uniq > custom_loci_list.txt
Initialize the matching table and for every locus in the list of custom loci names:
- retrieve the corresponding locus (or loci) name(s)
- populate the matching table with the custom loci name and its corresponding locus (loci) name(s)
:> locus_matching_table.csv
for customLocus in $(cat custom_loci_list.txt)
do
correspondingLocus=$(grep -e -$customLocus TS_full_list.txt | grep -Po '(?<=__).*(?=__)' | sort | uniq)
echo $customLocus, $correspondingLocus >> locus_matching_table.csv
done
Several custom loci have more than one corresponding locus. This happens because in the Melasto689 probe set, TS from Angio353 loci actually have high similarity to sequences from MarkerMiner loci. In these cases, we manually check the sequences: for each locus we align the TS of the locus and check whether they all align correctly together. If they do we keep them under the same name (we arbitrarily choose to keep the Angio353 name). If they do not align correctly together, we separate them into different names. Here is the detail for these loci:
customLocusID: originalLocusID = finalLocusID
LOC00757: KT377110.1 = 5870
LOC07406: AT5G60590 = 6570
LOC13529: AT1G31780 = 6098
LOC14778: TC04G006100 = 4954
LOC15996: AT5G06830 = 6175
LOC16890: AT2G44660 = 6284
LOC18108: AT2G43030 = 5857
LOC19586: TC02G003400 = 6875
LOC22857: AT5G67530 = 5264
In the list above, customLocusID is the custom locus name in the
reference set, originalLocusID is the MarkerMiner locus name as in
Melasto689, and finalLocusID is the final locus name.
For the locus LOC00263, the template sequences should be re-separated
into different loci:
old TS name = new TS name
Tetrazygia__NM_113708.2__hc-LOC00263 = Tetrazygia_hc-NM_113708.2
Tetrazygia__NM_124699.3__hit-LOC00263 = Tetrazygia_hit-NM_124699.3
Tibouchina__NM_113708.2__hit-LOC00263 = Tibouchina_hit-NM_113708.2
Tibouchina__NM_124699.3__hit-LOC00263 = Tibouchina_hit-NM_124699.3
scaffold_WWQZ_2004803_Medinilla_magnifica-LOC00263 =
scaffold_WWQZ_2004803_Medinilla_magnifica-NM_113708.2
scaffold_SWGX_2016500_Tetrazygia_bicolor-LOC00263 =
scaffold_SWGX_2016500_Tetrazygia_bicolor-NM_113708.2
We end up with the manually modified .csv file
locus_matching_table_CUSTOM.csv
containing the string to replace on the first column and the replacement
string on the second column.
Build a replacement table that can be read by seqkit replace
.
cat locus_matching_table_CUSTOM.csv | sed 's/ //g' | awk -F ',' '{print $1"\t"$2}' > replacement.txt
Rename the sequences using seqkit replace
and draw the list of all the
TS. This is done in 2 steps because for some TS only the locus name is
changed but for other TS the full TS name is changed.
seqkit replace -w0 -p '-(.+)$' -r '-{kv}' -k replacement.txt --keep-key final_sequences_set.FNA > final_sequences_set_renamed.FNA
seqkit seq -n final_sequences_set_renamed.FNA > TS_full_list_renamed.txt
cat locus_matching_table_CUSTOM.csv | grep -e 'NM_113708.2\|NM_124699.3' | sed 's/ //g' | awk -F ',' '{print $1"\t"$2}' > extra_replacement.txt
seqkit replace -w0 -p '(.+)' -r '{kv}' -k extra_replacement.txt --keep-key final_sequences_set_renamed.FNA > final_sequences_set_renamed2.FNA
seqkit seq -n final_sequences_set_renamed2.FNA > TS_full_list_renamed2.txt
Most of the TS still contain their old locus name (which is actually their final locus name as the the custom locus name were removed) bordered with the ‘__’ (double underscore). We thus remove all the ‘__locus__’ from the TS names.
loci_list=$(seqkit seq -n final_sequences_set_renamed2.FNA | sed 's/.*-//g' | sort | uniq)
:> final_sequences_set_renamed3.FNA
for locus in $loci_list
do
seqkit grep -w0 -nrp -$locus final_sequences_set_renamed2.FNA | sed '/^>/s/__'$locus'__/_/g' >> final_sequences_set_renamed3.FNA
done
sed -i 's/_-/-/g' final_sequences_set_renamed3.FNA
sed -i 's/__/_/g' final_sequences_set_renamed3.FNA
seqkit seq -n final_sequences_set_renamed3.FNA > TS_full_list_renamed3.txt
cat final_sequences_set_renamed3.FNA nomatches_recovered_renammed.FNA > final_sequences_set_renamed4.FNA
Sort the sequences according to locus name.
cat final_sequences_set_renamed4.FNA | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' | seqkit sort -w0 -n | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > final_sequences_set_TO_CHECK.FNA
Translate the nucleotide sequence set into amino-acids.
seqkit translate -w0 final_sequences_set_TO_CHECK.FNA > final_sequences_set_TO_CHECK_prot.FAA
The sequence set is ready to be checked by HybPiper.
HybPiper v2 has a function that allow to check the target file for low complexity sequences and for the presence of stop codons and sequences not multiple of 3 (see HybPiper wiki for more details).
cd $path_to_wd/CLEAN_PROBE_SET
hybpiper check_targetfile --targetfile_dna final_sequences_set_TO_CHECK.FNA > check_targetfile.log
hybpiper check_targetfile --targetfile_aa final_sequences_set_TO_CHECK_prot.FAA > check_targetfile_aa.log
Some sequences still have stop codons and are not multiples of 3. These sequences are reference sequences that come from the mega353 dataset.
Manually get the list of TS to correct from the log file, and store them
in the file TS_to_correct.txt
.
Separate the TS that need to be corrected (TS_to_correct.FNA
) from
those that are ok TS_checked_OK.FNA
.
cd $path_to_wd/CLEAN_PROBE_SET
cat TS_to_correct.txt | sed 's/$/\n/' | sed 's/, /\n/g' > list_TS_to_correct.txt
seqkit grep -w0 -nrf list_TS_to_correct.txt final_sequences_set_TO_CHECK.FNA > TS_to_correct.FNA
seqkit grep -w0 -nrvf list_TS_to_correct.txt final_sequences_set_TO_CHECK.FNA > TS_checked_OK.FNA
Manually correct TS_to_correct.FNA
into TS_to_correct_CORRECTED.FNA
:
- remove the stop codon and the trailing 3’ nucleotides
- for sequences not multiple of 3, remove the 3’ hanging nucleotide(s)
All the correction commands are stored in the script
custom_TS_correction.sh
.
Then merge back TS_checked_OK.FNA
with TS_to_correct_CORRECTED.FNA
.
cat TS_checked_OK.FNA TS_to_correct_CORRECTED.FNA > final_sequences_set_corrected.FNA
Manually get the list of TS with low complexity from
check_targetfile.log
and check_targetfile_aa.log
store it in the
file list_of_TS_with_low_complexity.txt
. For This step removes the TS
flagged to present low complexity for a given locus, but only if the
locus is also represented by other TS (not flagged with low complexity).
- remove duplicates from
list_of_TS_with_low_complexity.txt
- retrieve the list of loci for the list of TS
- initialize the file
TS_with_low_complexity_to_remove.txt
- for every locus in
list_of_TS_with_low_complexity.txt
:- retrieve the full list of TS for this locus
- retrieve the list of TS to remove for this locus (i.e. TS with low complexity)
- retrieve the list of remaining TS (i.e. the full list minus the TS to remove)
- if there are any remaining TS in the list, then:
- append the list of TS to remove to the file
TS_with_low_complexity_to_remove.txt
- append the list of TS to remove to the file
- remove the low complexity TS from the final sequence set using the
file
TS_with_low_complexity_to_remove.txt
dos2unix list_of_TS_with_low_complexity.txt
cat list_of_TS_with_low_complexity.txt | sort | uniq > list_of_TS_with_low_complexity.txt.temp
rm list_of_TS_with_low_complexity.txt
mv list_of_TS_with_low_complexity.txt.temp list_of_TS_with_low_complexity.txt
loci_list=$(cat list_of_TS_with_low_complexity.txt | sed 's/.*-//g' | sort | uniq)
:> TS_with_low_complexity_to_remove.txt
for locus in $loci_list;
do
echo $locus
seqkit grep -w0 -nrp -$locus final_sequences_set_corrected.FNA | seqkit seq -n > TS_full_list
grep -e -$locus list_of_TS_with_low_complexity.txt > TS_to_remove
cat TS_full_list TS_to_remove | sort | uniq -u > TS_remaining
if [ -s TS_remaining ]; then
echo Remaining TS exist for $locus, removing the low complexity TS...
cat TS_to_remove >> TS_with_low_complexity_to_remove.txt
else
echo No remaining TS for $locus, keeping it as is...
fi
rm TS_full_list TS_to_remove TS_remaining
done
seqkit grep -w0 -nvf TS_with_low_complexity_to_remove.txt final_sequences_set_corrected.FNA > final_sequences_set_corrected_low_complexity_removed.FNA
All the TS flagged for low complexity were removed, except for locus 6128, for which all the TSs were flagged as presenting low complexity.
Sort the sequences according to locus name.
cat final_sequences_set_corrected_low_complexity_removed.FNA | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' | seqkit sort -w0 -n | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > final_sequences_set_corrected_low_complexity_removed_sorted.FNA
Translate to amino-acids.
cat final_sequences_set_corrected_low_complexity_removed_sorted.FNA > PROBE_SET_CLEAN.FNA
seqkit translate -w0 PROBE_SET_CLEAN.FNA > PROBE_SET_CLEAN_prot.FAA
This is an extra final check that the final sequence set is ok for HybPiper.
cd $path_to_wd/CLEAN_PROBE_SET
hybpiper check_targetfile --targetfile_dna PROBE_SET_CLEAN.FNA > check_targetfile_final.log
hybpiper check_targetfile --targetfile_aa PROBE_SET_CLEAN_prot.FAA > check_targetfile_aa_final.log
The sequence set is now cleaned and can be used with HybPiper or any
other program. The final files are
PROBE_SET_CLEAN.FNA
(nucleotides) and
PROBE_SET_CLEAN_prot.FAA
(amino-acids).
- Carpenter EJ, Matasci N, Ayyampalayam S, Wu S, Sun J, Yu J, Jimenez Vieira FR, Bowler C, Dorrell RG, Gitzendanner MA, Li L, Du W, K. Ullrich K, Wickett NJ, Barkmann TJ, Barker MS, Leebens-Mack JH, Wong GK-S (2019) Access to RNA-sequencing data from 1,173 plant species: The 1000 Plant transcriptomes initiative (1KP). GigaScience 8: giz126. https://doi.org/10.1093/gigascience/giz126
- Jantzen JR, Amarasinghe P, Folk RA, Reginato M, Michelangeli FA, Soltis DE, Cellinese N, Soltis PS (2020) A two-tier bioinformatic pipeline to develop probes for target capture of nuclear loci with applications in Melastomataceae. Applications in Plant Sciences 8: e11345. https://doi.org/10.1002/aps3.11345
- Johnson MG, Pokorny L, Dodsworth S, Botigué LR, Cowan RS, Devault A, Eiserhardt WL, Epitawalage N, Forest F, Kim JT, Leebens-Mack JH, Leitch IJ, Maurin O, Soltis DE, Soltis PS, Wong GK, Baker WJ, Wickett NJ (2019) A Universal Probe Set for Targeted Sequencing of 353 Nuclear Genes from Any Flowering Plant Designed Using k-Medoids Clustering. Systematic Biology 68: 594–606. https://doi.org/10.1093/sysbio/syy086
- McLay TGB, Birch JL, Gunn BF, Ning W, Tate JA, Nauheimer L, Joyce EM, Simpson L, Schmidt-Lebuhn AN, Baker WJ, Forest F, Jackson CJ (2021) New targets acquired: Improving locus recovery from the Angiosperms353 probe set. Applications in Plant Sciences 9. https://doi.org/10.1002/aps3.11420
- One Thousand Plant Transcriptomes Initiative (2019) One thousand plant transcriptomes and the phylogenomics of green plants. Nature 574: 679–685. https://doi.org/10.1038/s41586-019-1693-2
- Reginato M, Michelangeli FA (2016) Primers for low-copy nuclear genes in the Melastomataceae. Applications in Plant Sciences 4: 1500092. https://doi.org/10.3732/apps.1500092