-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProbe_set_cleaning_final.Rmd
1427 lines (1202 loc) · 79.6 KB
/
Probe_set_cleaning_final.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Cleaning of the Melastomataceae probe set for target enrichment"
output: github_document
# output:
# html_document:
# theme: united
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
eval = FALSE)
```
The basic idea of the cleaning process is to build a set of reference sequences and to align every template sequence from Melasto689 to this set of references, correcting the frame, reverse-complementing if needed, and removing stop codons and poorly aligned regions. The cleaned template sequences are then associated to the reference sequences they matched into a final sequence set. Finally, the final sequence set go through different checks and fine-tuned cleaning.
The process will be divided in 3 main steps:
- [**Step 1**][Step 1: build the set of reference sequences]: build the set of reference sequences
- [**Step 2**][Step 2: align the Melasto689 template sequences to the set of reference]: align and associate the Melasto689 template sequences to the set of reference into a final sequence set
- [**Step 3**][Step 3: Final checks and fine-tuned cleaning]: final checks and fine-tuned cleaning: extra steps to make sure the sequences are clean
**Directory organization.** For simplicity, all the following assumes analyses are run in a single directory `$path_to_wd`. This directory is divided in 6 sub-directory:
- [`prepare1KP`](prepare1KP) (Step 1: 1. - Prepare the 1KP transcriptomes reference sequences set)
- [`prepare_mega353`](prepare_mega353) (Step 1: 2. - Prepare the mega353 reference sequences set)
- [`full_reference_DB`](full_reference_DB) (Step 1: 3. - Merge the two previous reference sequences set)
- [`original_template_sequences`](original_template_sequences) (Step 2: 1. - Prepare the original Melast689 template sequence set)
- [`melasto689_on_full_ref`](melasto689_on_full_ref) (Step 2: 2. - Align and associate the Melasto689 template sequences to the set of reference)
- [`CLEAN_PROBE_SET`](CLEAN_PROBE_SET) (Step 3 - Contains the final and cleaned probe set)
```{bash}
cd $path_to_wd
mkdir prepare1KP
mkdir prepare_mega353
mkdir full_reference_DB
mkdir original_template_sequences
mkdir melasto689_on_full_ref
mkdir CLEAN_PROBE_SET
```
**Naming convention.** Sequences names follow the convention used in many probe sets for targeted sequencing and associated programs (e.g. Angio353, HybPiper, Captus, SECAPR). The name of a sequence is composed of the template sequence ID (generally identifying the taxon from which the sequence initially come from, sometimes with additional info) and the locus name separated by an hyphen, e.g. `>template_sequence_ID-locusID`
**Scripts.** Every step of the cleaning process is presented and explained in detail below. The command are presented as if run in interactive mode. For some steps, the running time is long and was actually not achieved in interactive mode, but using `.sh` scripts (scripts were run on a computing cluster that uses the SLURM manager, `sbatch` command). The `.sh` scripts are made available in this repository (see the [`script`](scripts) folder) and a statement is present at the beginning of each step specifying which script to refer to.
**Files.** Note that not all the intermediate files produced by the cleaning process are present in this repository. The missing files are some intermediate .FNA files or heavy (>100 Mo) blast output tables. The most important files are still included.
**Dependencies**
- [SeqKit 2.0.0](https://bioinf.shenwei.me/seqkit/)
- [NCBI BLAST 2.10.1](https://blast.ncbi.nlm.nih.gov/Blast.cgi)
- [csvtk 0.23.0](https://bioinf.shenwei.me/csvtk/)
- [bedtools 2.30.0](https://bedtools.readthedocs.io/en/latest/)
- [R 4.1.3](https://www.r-project.org/) (needs packages [DECIPHER](http://www2.decipher.codes/), [Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html))
- [Python 3.10](https://www.python.org/downloads/release/python-3100/)
- [HybPiper 2.0.1](https://github.com/mossmatters/HybPiper)
- [AliView 1.28](https://ormbunkar.se/aliview/)
- [Muscle v3.8.425](https://drive5.com/muscle/) (as wrapped in AliView)
**Glossary**:
- **Angio353** (sometimes referred as **full Angio353**): the original Angiosperms 353 probe set
- **locusCustomID**: temporary locus identifier given to the loci during the cleaning process
- **mega353**: the Angiosperms 353 probe set extended with transcriptome sequences (see: https://github.com/chrisjackson-pellicle/NewTargets)
- **Melasto689**: the full Melastomataceae probe set. It is composed of 384 loci, each of them being composed of 1 to 4 template sequences, for a total of 689 template sequences. It is divided into 4 subsets of loci:
- **angio353** (with small **a**): a set of 266 loci (439 template sequences) that come from the Angiosperms 353 probe set
- **markerminer**: a set of 104 loci (215 template sequences) that come from putative single-copy loci identified with MarkerMiner
- **functional_extra**: a set of 6 loci (12 template sequences) that come from functional genes
- **RM2016**: a set of 8 loci (23 template sequences) that come from [Reginato & Michelangeli (2016)](https://doi.org/10.3732/apps.1500092)
- **reference sequence** (or **RS**): a sequence from the reference set; in the sequence set, a same locus can have several template sequences
- **template sequence** (or **TS**): a sequence from the Melasto689 set or final sequence set; in the sequence set, a same locus can have several template sequences
# Step 1: build the set of reference sequences
The set of reference is composed of [mega353](https://doi.org/10.1002/aps3.11420) (that is an extended version of [Angio353](https://doi.org/10.1093/sysbio/syy086)) and the transcriptomes of two Melastomataceae species from the [1KP project](https://sites.google.com/a/ualberta.ca/onekp/?pli=1).
The set of reference sequence will contain different sequences. The sequences will be grouped under different "loci" based on their similarity. During the process of building the set of references, a temporary locus name (sometimes referred as `*locusCustomID*`) will be assigned to the loci in the form `LOC#####` (where `#` represents a digit, e.g. `LOC12345`). This `*locusCustomID*` will be used as locus name for most of the sequences during the whole cleaning process, but the loci names will be reverted to their "old" names (i.e. the names used in Melasto689 and Angio353) in Step 3: 3.
## 1. Prepare the 1KP transcriptomes reference database
> The `sbatch` script corresponding to this section is: [`prepare_1KP_DB.sh`](scripts/prepare_1KP_DB.sh)
We will use the transcriptomes of the two Melastomataceae samples present in the 1KP project dataset: SWGX and WWQZ (*Tetrazygia bicolor* and *Medinilla magnifica*, respectively).
### 1.1 Retrieve the sequences
Download SWGX and WWQZ data from the [1KP project (updated release)](https://dx.doi.org/10.5524/100910), and decompress (unzip) the files into `$path_to_wd/prepare1KP`.
We will use of the `XXXX-translated-protein.fa` and `XXXX-translated-nucleotides.fa` fasta files.
### 1.2 Group the sequences according to their ortholog group
Sequences in the SWGX and WWQZ data have a unique identifier (referred as `transcriptSequencesID`), in the form `>scaffold-XXXX-2#######-Genus_species`, where XXXX is either SWGX or WWQZ, # represents a digit, and Genus_species is either Tetrazygia_bicolor (SWGX) or Medinilla_magnifica (WWQZ), e.g. `>scaffold-WWQZ-2000004-Medinilla_magnifica`.
The sequences need to be grouped according to their ortholog group. To do so, we'll use the 2<sup>nd</sup> column in the `XXXX-translated-reference-names.tsv` files, containing the gene identifiers according to OrthoFinder (see [One Thousand Plant Transcriptomes Initiative (2019)](https://doi.org/10.1038/s41586-019-1693-2), referred as `locusOrthologID`. The identifier is in the form "gi|255582030|ref|XP_002531812.1|".
- get the list of the unique orthogroup identifiers from the `SWGX-translated-reference-names.tsv` `WWQZ-translated-reference-names.tsv` files into the file `orthoIDs_list.txt`
```{bash}
cd $path_to_wd/prepare1KP
cat SWGX-translated-reference-names.tsv WWQZ-translated-reference-names.tsv | awk '{print $2}' | sort | uniq > orthoIDs_list.txt # get the list of the unique gene identifiers according to OrthoFinder
```
- append a unique custom locus name (*`locusCustomID`*) in the form LOC##### to every orthogroup (*`locusOrthologID`*) and store them in the array `$loc_list`. The array contains locus names in the form `locusOrthologID__locusCustomID`.
```{bash}
loc_list=$(cat orthoIDs_list.txt | awk '{ printf $1"__LOC" "%05d\n", ++a }') # append a unique custom locus name in the form LOC##### (locusCustomID) to every orthogroup (locusOrthologID) and store them in the array loc_list
```
- create a folder called `orthogroups` and initialize the files:
- `ortho_loci_table.txt`: will contain the correspondance between *`locusCustomID`*, *`locusOrthologID`* and *`transcriptSequencesID`*
- `SWGX-WWQZ-translated-nucleotides-grouped.FNA`: will contain the transcriptome sequences, grouped by custom locus ID
- `SWGX-WWQZ-translated-prot-grouped.FAA`: will contain the transcriptome sequences, grouped by custom locus ID
```{bash}
mkdir orthogroups
echo locusCustomID locusOrthologID transcriptSequencesID > orthogroups/ortho_loci_table.txt # initialize file
:> orthogroups/SWGX-WWQZ-translated-nucleotides-grouped.FNA # initialize file
:> orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA # initialize file
```
- for every locus name in the list of loci `$loc_list`:
- define $locID as the custom locus ID (*`locusCustomID`*)
- define $orthoID as the orthogroup ID (*`locusOrthologID`*)
- retrieve the `transcriptSequencesID`(s) corresponding to the *`locusOrthologID`* in SWGX and WWQZ databases
- if `transcriptSequencesID`(s) were retrieved:
- extract the sequence(s) from the databases, append the custom locus ID (*`locusCustomID`*) to its name, and append the renamed sequence to `SWGX-WWQZ-translated-nucleotides-grouped.FNA` and `SWGX-WWQZ-translated-prot-grouped.FAA`
- populate the `ortho_loci_table.txt` with the corresponding *`locusCustomID`*, *`locusOrthologID`* and *`transcriptSequencesID`*
```{bash}
for name in $loc_list;
do
locID=$(echo $name | sed 's/.*__//g')
orthoID=$(echo $name | sed 's/__.*//g')
# retrieve the sequence ID(s) corresponding to the locus in SWGX and WWQZ databases
grep -e $orthoID SWGX-translated-reference-names.tsv | awk '{print $1}' > SWGX_seqs
grep -e $orthoID WWQZ-translated-reference-names.tsv | awk '{print $1}' > WWQZ_seqs
if [ -s SWGX_seqs ]; then
seqkit grep -w0 -n -f SWGX_seqs SWGX-translated-nucleotides.fa | sed 's/-/_/g' | sed '/^>/s/$/-'$locID'/g' >> orthogroups/SWGX-WWQZ-translated-nucleotides-grouped.FNA
seqkit grep -w0 -n -f SWGX_seqs SWGX-translated-protein.fa | sed 's/-/_/g' | sed '/^>/s/$/-'$locID'/g' >> orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA
fi
if [ -s WWQZ_seqs ]; then
seqkit grep -w0 -n -f WWQZ_seqs WWQZ-translated-nucleotides.fa | sed 's/-/_/g' | sed '/^>/s/$/-'$locID'/g' >> orthogroups/SWGX-WWQZ-translated-nucleotides-grouped.FNA
seqkit grep -w0 -n -f WWQZ_seqs WWQZ-translated-protein.fa | sed 's/-/_/g' | sed '/^>/s/$/-'$locID'/g' >> orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA
fi
echo $locID $orthoID $(cat SWGX_seqs) $(cat WWQZ_seqs) >> orthogroups/ortho_loci_table.txt
rm SWGX_seqs WWQZ_seqs
done
```
- remove sequences less than 30 amino-acids (AA) and less than 90 nucleotides (NT) long
```{bash}
cd orthogroups
seqkit seq -w0 -m 90 SWGX-WWQZ-translated-nucleotides-grouped.FNA > SWGX-WWQZ-translated-nucleotides-grouped.FNA.temp
seqkit seq -w0 -m 30 SWGX-WWQZ-translated-prot-grouped.FAA > SWGX-WWQZ-translated-prot-grouped.FAA.temp
rm -r SWGX-WWQZ-translated-prot-grouped.FAA SWGX-WWQZ-translated-nucleotides-grouped.FNA
rename -v '.FAA.temp' '.FAA' ./*
rename -v '.FNA.temp' '.FNA' ./*
cd $path_to_wd/prepare1KP
```
### 1.2 Check for sequences with high similarity
Sequences with high similarity need to be grouped into a same "locus", otherwise the template sequences from Melasto689 will align to several loci which is not desirable since we want to distribute these template sequences into a single loci.
To group sequences with high similarity, we will run a "self-Blast" using `blastp-fast` (protein to protein alignment).
#### 1.2.1 Build BLAST databases
Make a custom BLAST database with the 2 transcriptomes (SWGX and WWQZ) concatenated and grouped by orthogroup. Build a database for both amino-acids and nucleotides sequences.
```{bash}
cd $path_to_wd/prepare1KP
mkdir BLAST_custom_DB
makeblastdb -in orthogroups/SWGX-WWQZ-translated-nucleotides-grouped.FNA -out BLAST_custom_DB/SWGX-WWQZ_nucleotides_grouped_BLAST_DB -dbtype nucl -parse_seqids # actually not necessary
makeblastdb -in orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA -out BLAST_custom_DB/SWGX-WWQZ_protein_grouped_BLAST_DB -dbtype prot -parse_seqids
```
#### 1.2.2 Run the self-BLAST
```{bash}
cd $path_to_wd/prepare1KP
blastp -task blastp-fast -db BLAST_custom_DB/SWGX-WWQZ_protein_grouped_BLAST_DB -query orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA -evalue 1e-06 -outfmt "10 std qframe qcovs qseq" -num_threads $SLURM_CPUS_PER_TASK > selfblast_SWGX-WWQZ_protein_grouped_blastp_out_SB1.csv
```
The raw result of a self-BLAST needs to be filtered since every sequence will inevitably have a 100% match with itself. So, we filter out all the matches between two sequences belonging to the same locus. Also, we only keep the matches that have either 1) a percentage of identity greater than 60%, an alignment length greater than 50 AA, a bitscore greater than 100 and a query coverage greater than 50% (meaning at least 50% of the query covers the subject sequence), or 2) a percentage of identity greater than 80%, an alignment length greater than 50 AA and a bitscore greater than 50.
```{bash}
cat selfblast_SWGX-WWQZ_protein_grouped_blastp_out_SB1.csv | awk -F ',' 'OFS = "," {
temp1 = $1
temp2 =$2
sub(".+-","", temp1)
sub(".+-","", temp2)
if (temp1 != temp2)
print $0
}' > different_query_and_subject_loci_SB1.csv
cat different_query_and_subject_loci_SB1.csv | csvtk filter2 -H -f '$3 > 60 && $4 > 50 && $14 > 50 && $12 > 100 || $3 > 80 && $4 > 50 && $12 > 50' > filtered_different_query_and_subject_loci_SB1.csv
```
Some of the sequences with high similarity were assigned to different orthogroups. We will thus group these sequences, which mean grouping some supposedly ortholog loci.
### 1.3 Group loci with high sequence similarity
#### 1.3.1 Build a similarity table
Retrieve the pairs of loci that need to be grouped.
```{bash}
cat filtered_different_query_and_subject_loci_SB1.csv | awk -F ',' '{
query = $1
subject = $2
sub(".+-","", query)
sub(".+-","", subject)
print query"," subject
}' > pairs_SB1.csv
```
A locus can have high sequence similarity with more than one other locus, we need to retrieve all the loci associations, and print them in a similarity table (`selfblast_table.txt`).
For every locus in the list of loci from the pairs of loci:
- retrieve the other locus (loci) it matches to, and arrange them into a list
- search if any of the locus in the list is already in the table `selfblast_table.txt`
- if none of the locus is in the similarity table, then append the list at the end of the table
- if any of the locus is found in the similarity table, then extract the matching list of loci in the similarity table, append the list to the extracted list, remove duplicates loci and replace the original list with the newly generated list
```{bash}
query_list=$(cat pairs_SB1.csv | awk -F ',' '{print $1}' | sort | uniq)
:> selfblast_table.txt
for query in $query_list
do
echo $query
subject=$(grep -e $query, pairs_SB1.csv | awk -F ',' 'ORS = "\n" {print $2}' | sort -n | uniq | tr '\n' ',' | sed 's/,$/\n/')
echo $subject
to_search=$(echo $query,$subject | sed 's/,/|/g')
#egrep -qs -e $to_search selfblast_table.txt
if ! egrep -qs -e $to_search selfblast_table.txt; then
echo Query or Subject not in the table
echo $query,$subject >> selfblast_table.txt
else
echo Query or Subject YET in the table
replacement=$(echo $(egrep -e $to_search selfblast_table.txt | sed -z 's/\n/,/g' | sed 's/.$/\n/'),$query,$subject | sed -e $'s/,/\\\n/g' | sort -n | uniq | tr '\n' ',' | sed 's/.$/\n/')
line_numbers=$(egrep -n -e $to_search selfblast_table.txt | sed 's/:.*//g')
#replacement=$(echo $query','$subject | sed -e $'s/,/\\\n/g' | sort -n | uniq | tr '\n' ',' | sed 's/.$/\n/')
for N in $line_numbers;
do
sed -i $N's/.*/'$replacement'/g' selfblast_table.txt
done
cat selfblast_table.txt | sort | uniq > selfblast_table.txt.temp
rm selfblast_table.txt
mv selfblast_table.txt.temp selfblast_table.txt
fi
done
cat selfblast_table.txt > selfblast_table_fromSB1.txt #intermediate backup
#check for no duplicates:
echo Total number of IDs: $(cat selfblast_table.txt | sed 's/,/\n/g' | wc -l)
echo "Number of IDs (dups removed):" $(cat selfblast_table.txt | sed 's/,/\n/g' | sort | uniq | wc -l)
```
Each line of the table `selfblast_table.txt` contains a list of comma-separated matching loci in the form: *`locusCustomID, correspondingLocus1, correspondingLocus2, ...`*. The name retained as *`locusCustomID`* is the one having the lowest suffix number among the list of corresponding loci on a same line.
#### 1.3.2 Associate the loci in the fasta files
Now the list of corresponding loci based on similarity has been established, we need to associate them in the `.FNA` and `.FAA` files.
Transform the table into a list of pairs of *`locusCustomID,correspondingLocus`* (`selfblast_table_paired.txt`).
```{bash}
:> selfblast_table_paired.txt
for line in $(cat selfblast_table.txt)
do
# echo $line
locus=$(echo $line | cut -d ',' -f 1)
echo $locus
others=$(echo $line | sed 's/'$locus',//g' | sed -e $'s/,/\\\n/g')
for other in $others
do
echo $locus,$other >> selfblast_table_paired.txt
done
done
cat selfblast_table_paired.txt > selfblast_table_paired_fromSB1.txt #intermediate backup
```
Transform the comma-separated list into a tab-separated list (`replacement.txt`), that can be read by `seqkit replace` command.
```{bash}
cat selfblast_table_paired.txt | awk -F ',' '{print $2"\t"$1}' > replacement.txt
```
Use `seqkit replace` to replace every correspondingLocus to its matching *`locusCustomID`*.
```{bash}
seqkit replace -w0 -p '-(.+)$' -r '-{kv}' -k replacement.txt --keep-key orthogroups/SWGX-WWQZ-translated-prot-grouped.FAA > orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA
seqkit replace -w0 -p '-(.+)$' -r '-{kv}' -k replacement.txt --keep-key orthogroups/SWGX-WWQZ-translated-nucleotides-grouped.FNA > orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA
```
Finally sort the fasta files by *`locusCustomID`*.
```{bash}
sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA | seqkit sort -w0 | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA.temp
sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA | seqkit sort -w0 | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA.temp
rm orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA
rename -v '.FAA.temp' '.FAA' ./orthogroups/*
rename -v '.FNA.temp' '.FNA' ./orthogroups/*
```
### 1.4 Final self-BLAST check
#### 1.4.1 Build BLAST database
```{bash}
cd $path_to_wd/prepare1KP
makeblastdb -in orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA -out BLAST_custom_DB/SWGX-WWQZ_nucleotides_grouped_SB1_BLAST_DB -dbtype nucl -parse_seqids
makeblastdb -in orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA -out BLAST_custom_DB/SWGX-WWQZ_protein_grouped_SB1_BLAST_DB -dbtype prot -parse_seqids
```
#### 1.4.2 Run the self-BLAST check
```{bash}
blastp -task blastp-fast -db BLAST_custom_DB/SWGX-WWQZ_protein_grouped_SB1_BLAST_DB -query orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA -evalue 1e-06 -outfmt "10 std qframe qcovs qseq" > check_selfblast_SWGX-WWQZ_protein_grouped_blastp_out.csv
cat check_selfblast_SWGX-WWQZ_protein_grouped_blastp_out.csv | awk -F ',' 'OFS = "," {
temp1 = $1
temp2 =$2
sub(".+-","", temp1)
sub(".+-","", temp2)
if (temp1 != temp2)
print $0
}' > check_different_query_and_subject_loci.csv
cat check_different_query_and_subject_loci.csv | csvtk filter -H -f'3>60' | csvtk filter -H -f'4>50' | csvtk filter -H -f'14>50' | csvtk filter2 -H -f '$12 > 100' > check_filtered_different_query_and_subject_loci.csv
```
The file `check_filtered_different_query_and_subject_loci.csv` resulting from the check self-BLAST is empty, meaning no sequence from a locus has similarity with a sequence of any other locus.
## 2. Prepare the mega353 reference database
The mega353 is an expansion of the Angio353 probe set, incorporating sequences from more than 500 transcriptomes from the the 1KP project. Along with the dataset, a script is provided in order to filter the sequences according to the group of interest. Here we will filter the mega353 dataset to keep the sequences from the samples in the Myrtales order.
Download the NewTargets directory from https://github.com/chrisjackson-pellicle/NewTargets and unzip the files. Store the `NewTargets` folder into `$path_to_wd/prepare_mega353`.
A filtering text file is created, in order to filter on the Myrtales order ([`select_file_Myrtales.txt`](prepare_mega353/select_file_Myrtales.txt)).
The `filter_mega353.py` script is then run.
```{bash}
cd $path_to_wd/prepare_mega353/NewTargets
python filter_mega353.py mega353.fasta select_file_Myrtales.txt -filtered_target_file ../mega353_Myrtales.fa
```
The mega353 reference sequences set used is the [`mega353_Myrtales.fa`](prepare_mega353/mega353_Myrtales.fa) file.
## 3. Associate the transcriptome and mega353 references databases into a single set
> The `sbatch` script corresponding to this section is: [`prepare_full_reference_DB.sh`](scripts/prepare_full_reference_DB.sh)
We will now associate the the transcriptome and mega353 references set into a single set of references. The association will be done based on sequence similarity.
### 3.1 Prepare the paths and data
Define the paths to previously prepared databases.
```{bash}
cd $path_to_wd/full_reference_DB
path_to_onekp=$path_to_wd/prepare1KP;
Angio353_FNA=$path_to_wd/prepare_mega353/mega353_Myrtales.fa;
```
Concatenate the transcriptome reference databases in the current directory into the `FULL_REFERENCE_nucl.FNA` and `FULL_REFERENCE_prot.FAA` files. Sequences from the mega353 database will be appended to these files.
```{bash}
cat $path_to_wd/prepare1KP/orthogroups/SWGX-WWQZ-translated-nucleotides-grouped_SB1.FNA > FULL_REFERENCE_nucl.FNA
cat $path_to_wd/prepare1KP/orthogroups/SWGX-WWQZ-translated-prot-grouped_SB1.FAA > FULL_REFERENCE_prot.FAA
nuc_FNA="FULL_REFERENCE_nucl.FNA"
prot_FAA="FULL_REFERENCE_prot.FAA"
```
Copy the mega353 database in the current directory and rename the sequences to remove hyphens that do not separate sequence ID from locus name.
```{bash}
dos2unix $Angio353_FNA
cat $Angio353_FNA | sed 's/ .*//g' | sed 's/-[[:digit:]]\{4\}_exonerate_grafted/_EG/g' | sed 's/grafted_with/GW/g' | sed 's/-v/_v/g' | sed 's/\./_/g' | sed 's/-ASGPB/_ASGPB/g' | sed 's/-iTAG/_iTAG/g' | sed 's/-Genoscope/_Genoscope/g' > Angio353.FNA
Angio353_FNA=$path_to_tmp/Angio353.FNA
```
Define the path to the transcriptome protein BLAST database from Step 1: 1.4.1.
```{bash}
prot_DB=$path_to_wd/prepare1KP/BLAST_custom_DB/SWGX-WWQZ_protein_grouped_SB1_BLAST_DB
```
### 3.2 Associate mega353 loci with SWGX and WWQZ transcriptomes
Extract the list of mega353 loci in a text file. Initialize matching table between mega353 loci and custom locus ID (*`locusCustomID`*).
```{bash}
cd $path_to_wd/full_reference_DB
seqkit seq -n $Angio353_FNA | sed 's/.*-//g' | sort | uniq > Angio353_list.txt
:> Angio353_to_custom_loci_table.txt
```
For each locus in the list of mega353 loci:
- create a sub-directory and set it as current directory
- extract the locus sequence from the mega353 database into a .FNA file, translate it (.FAA)
- search for similar protein sequence into the transcriptome protein database using `blastp` and store the results in a .csv file
- if the .csv file exists and is not empty (i.e. blast found matching sequences), then:
- filter the .csv to keep only query sequences with percentage identity greater than 60% to the matching subject, alignment length greater than 50 amino-acids, bitscore greater than 100, frame greater than 0 (i.e. sequences not in reverse complement) and a query coverage greater than 50% (meaning at least 50% of the query covers the subject sequence)
- if the filtered .csv file exists and is not empty (i.e. matching sequences remain after filtering), then:
- extract the *`customLocusID`*(s) from the matching sequence(s)
- if more than 1 *`customLocusID`* were found, then:
- print a warning
- append the locus sequences to the full reference database as they are (no renaming)
- print the mega353 locus name and the mention "multiplematches" to the matching table
- else (i.e. only 1 *`customLocusID`* was found), then:
- rename the locus sequence so that the mega353 locus name (4-digit sequence) is enclosed in 2 __ (double underscore) and moved to the sequence ID, and so that the *`customLocusID`* is appended at the end of the sequence
- append the renamed locus sequence to the full reference database
- print the mega353 locus name and its *`customLocusID`* to the matching table
- else (i.e. no matching sequences remain after filtering), then:
- append the locus sequences to the full reference database as they are (no renaming)
- print the mega353 locus name and the mention "nomatch" to the matching table
- set the upper level directory as current directory
```{bash}
for locus in $(cat Angio353_list.txt)
do
echo STARTING FOR $locus
mkdir $locus
cd $locus
seqkit grep -w0 -nrp $locus $Angio353_FNA > $locus.FNA
query=$locus".FNA"
cat $query | seqkit translate -w0 > $locus".FAA"
blastp -task blastp-fast -matrix BLOSUM62 -db $prot_DB -query $locus".FAA" -evalue 1e-06 -outfmt "10 std qframe qcovs qseq sframe sseq" -num_threads $SLURM_CPUS_PER_TASK > $locus"_blastp_out_unfiltered.csv"
#blastx -task blastx-fast -matrix BLOSUM62 -db $prot_DB -query $query -culling_limit 1 -evalue 1e-06 -outfmt "10 std qframe qcovs qseq sframe sseq" -num_threads $SLURM_CPUS_PER_TASK > $locus"_blastx_out_unfiltered.csv"
if [ -s $locus"_blastp_out_unfiltered.csv" ]; then
#cat $locus"_blastx_out_unfiltered.csv" | csvtk filter -H -f"3>60" > $locus"_blastx_out.csv"
cat $locus"_blastp_out_unfiltered.csv" | csvtk filter2 -H -f '$3 > 60 && $4 > 50 && $14 > 50 && $12 > 100 && $13 > 0' > $locus"_blastp_out.csv"
fi
if [ -s $locus"_blastp_out.csv" ]; then
cat $locus"_blastp_out.csv" | awk -F ',' '{print $2}' | sed 's/.*-LOC/LOC/g' | sort | uniq > DB_locus
nb_DB_loci=$(cat DB_locus | wc -l)
if [ $nb_DB_loci -gt 1 ]; then
echo WARNING: several DB loci found for locus $locus. No transcriptome appended. You might want to check manually for $locus.
cat $locus".FNA" >> ../$nuc_FNA
cat $locus".FAA" >> ../$prot_FAA
echo $locus multiplematches >> ../Angio353_to_custom_loci_table.txt
else
DB_locus=$(cat DB_locus)
cat $locus".FNA" | sed '/^>/s/-/__/g' | sed '/^>/s/$/__-'$DB_locus'/g' >> ../$nuc_FNA
cat $locus".FAA" | sed '/^>/s/-/__/g' | sed '/^>/s/$/__-'$DB_locus'/g' >> ../$prot_FAA
echo $locus $(cat DB_locus) >> ../Angio353_to_custom_loci_table.txt
fi
else
echo No match found for $locus.
cat $locus".FNA" >> ../$nuc_FNA
cat $locus".FAA" >> ../$prot_FAA
echo $locus nomacth >> ../Angio353_to_custom_loci_table.txt
fi
cd ..
done
```
### 3.4 Check for sequence similarity
This step checks sequence similarity for sequences from different loci. If highly similar sequences from different loci are found, the loci have to be grouped under the same name. To check for sequence similarity, we run a "self-Blast" using `blastp-fast` (protein to protein alignment).
#### 3.4.1 Build BLAST database
```{bash}
mkdir FULL_REFERENCE_BLAST_DB
makeblastdb -in FULL_REFERENCE_nucl.FNA -out FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_nucl_BLAST_DB -dbtype nucl -parse_seqids
makeblastdb -in FULL_REFERENCE_prot.FAA -out FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_prot_BLAST_DB -dbtype prot -parse_seqids
```
#### 3.4.2 Run the self-BLAST
```{bash}
blastp -task blastp-fast -db FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_prot_BLAST_DB -query FULL_REFERENCE_prot.FAA -evalue 1e-06 -outfmt "10 std qframe qcovs qseq" -num_threads $SLURM_CPUS_PER_TASK > full_ref_blastp_out.csv
```
The raw result of a self-BLAST needs to be filtered since every sequence will inevitably have a 100% match with itself. So, we filter out all the matches between two sequences belonging to the same locus. Also, we only keep the query sequences with percentage identity greater than 60% to the matching subject, alignment length greater than 50 amino acids, bitscore greater than 100, frame greater than 0 (i.e. sequences not in reverse complement) and a query coverage greater than 50% (meaning at least 50% of the query covers the subject sequence).
```{bash}
cat full_ref_blastp_out.csv | awk -F ',' 'OFS = "," {
temp1 = $1
temp2 =$2
sub(".+-","", temp1)
sub(".+-","", temp2)
if (temp1 != temp2)
print $0
}' > different_query_and_subject_loci.csv
cat different_query_and_subject_loci.csv | csvtk filter2 -H -f '$3 > 60 && $4 > 50 && $14 > 50 && $12 > 100 && $13 > 0' > filtered_different_query_and_subject_loci.csv
```
### 3.5 Associate the loci having sequences with high similarity
Highly similar sequences from different loci were found, these loci have to be grouped under the same name.
#### 3.5.1 Build a similarity table
Retrieve the pairs of loci that need to be grouped. The file contains a list of pairs of query-subject.
```{bash}
cat filtered_different_query_and_subject_loci.csv | awk -F ',' '{
query = $1
subject = $2
sub(".+-","", query)
sub(".+-","", subject)
print query"," subject
}' | sort | uniq > pairs.csv
```
A same locus may have to be associated with more than one other locus. We thus need to retrieve all the loci associations, and print them in a similarity table (`selfblast_table.txt`).
For every query list of the pairs of loci:
- retrieve the other locus (loci) it matches to (subject), and arrange them into a list
- search if any of the locus in the list is already in the table `selfblast_table.txt`
- if none of the locus is in the similarity table, then append the list at the end of the table
- if any of the locus is found in the similarity table, then extract the matching list of loci in the similarity table, append the list to the extracted list, remove duplicates loci and replace the original list with the newly generated list
```{bash}
query_list=$(cat pairs.csv | awk -F ',' '{print $1}' | sort | uniq)
:> selfblast_table.txt
#query="5899"
for query in $query_list
do
echo $query
subject=$(grep -e $query, pairs.csv | awk -F ',' 'ORS = "\n" {print $2}' | sort -n | uniq | tr '\n' ',' | sed 's/,$/\n/')
echo $subject
to_search=$(echo $query,$subject | sed 's/,/|/g')
if ! egrep -qs -e $to_search selfblast_table.txt; then
echo Query or Subject not in the table
echo $query,$subject >> selfblast_table.txt
else
echo Query or Subject YET in the table
replacement=$(echo $(egrep -e $to_search selfblast_table.txt | sed -z 's/\n/,/g' | sed 's/.$/\n/'),$query,$subject | sed -e $'s/,/\\\n/g' | sort -n | uniq | tr '\n' ',' | sed 's/.$/\n/')
line_numbers=$(egrep -n -e $to_search selfblast_table.txt | sed 's/:.*//g')
for N in $line_numbers;
do
sed -i $N's/.*/'$replacement'/g' selfblast_table.txt
done
cat selfblast_table.txt | sort | uniq > selfblast_table.txt.temp
rm selfblast_table.txt
mv selfblast_table.txt.temp selfblast_table.txt
fi
done
```
#### 3.5.2 Associate the loci in the fasta files
Now the list of corresponding loci based on similarity has been established (`selfblast_table.txt`), we need to associate them in the `.FNA` and `.FAA` files.
Transform the table into a list of pairs of *`locusCustomID,correspondingLocus`* (`selfblast_table_paired.txt`).
```{bash}
:> selfblast_table_paired.txt
for line in $(cat selfblast_table.txt)
do
# echo $line
locus=$(echo $line | cut -d ',' -f 1)
echo $locus
others=$(echo $line | sed 's/'$locus',//g' | sed -e $'s/,/\\\n/g')
for other in $others
do
echo $locus,$other >> selfblast_table_paired.txt
done
done
```
Transform the comma-separated list into a tab-separated list (`replacement.txt`), that can be read by `seqkit replace` command.
```{bash}
:> replacement.txt
cat selfblast_table_paired.txt | grep ',[[:digit:]]\{4\}$' | awk -F ',' '{print $2"\t__"$2"__-"$1}' >> replacement.txt
cat selfblast_table_paired.txt | grep -v ',[[:digit:]]\{4\}$' | awk -F ',' '{print $2"\t"$1}' >> replacement.txt
```
Use `seqkit replace` to replace every *`correspondingLocus`* to its matching *`locusCustomID`*.
```{bash}
seqkit replace -w0 -p '-(.+)$' -r '-{kv}' -k replacement.txt --keep-key FULL_REFERENCE_prot.FAA | seqkit replace -w0 -p '-_' -r '_' > FULL_REFERENCE_prot_SB1.FAA
seqkit replace -w0 -p '-(.+)$' -r '-{kv}' -k replacement.txt --keep-key FULL_REFERENCE_nucl.FNA | seqkit replace -w0 -p '-_' -r '_' > FULL_REFERENCE_nucl_SB1.FNA
```
Finally sort the fasta files by *`locusCustomID`*.
```{bash}
sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' FULL_REFERENCE_prot_SB1.FAA | seqkit sort -w0 | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > FULL_REFERENCE_prot_SB1.FAA.temp
sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' FULL_REFERENCE_nucl_SB1.FNA | seqkit sort -w0 | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > FULL_REFERENCE_nucl_SB1.FNA.temp
rm FULL_REFERENCE_prot_SB1.FAA FULL_REFERENCE_nucl_SB1.FNA
rename -v '.FAA.temp' '.FAA' ./*
rename -v '.FNA.temp' '.FNA' ./*
```
### 3.6 Final self-BLAST check
#### 3.6.1 Build BLAST database
```{bash}
cd $path_to_wd/full_reference_DB
makeblastdb -in FULL_REFERENCE_nucl_SB1.FNA -out FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_nucl_BLAST_DB_FINAL -dbtype nucl -parse_seqids
makeblastdb -in FULL_REFERENCE_prot_SB1.FAA -out FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_prot_BLAST_DB_FINAL -dbtype prot -parse_seqids
```
#### 3.6.2 Run the self-BLAST check
```{bash}
blastp -task blastp-fast -db FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_prot_BLAST_DB_FINAL -query FULL_REFERENCE_prot_SB1.FAA -evalue 1e-06 -outfmt "10 std qframe qcovs qseq" -num_threads $SLURM_CPUS_PER_TASK > check_selfblast_full_ref_blastp_out.csv
cat check_selfblast_full_ref_blastp_out.csv | awk -F ',' 'OFS = "," {
temp1 = $1
temp2 =$2
sub(".+-","", temp1)
sub(".+-","", temp2)
if (temp1 != temp2)
print $0
}' > check_different_query_and_subject_loci.csv
cat check_different_query_and_subject_loci.csv | csvtk filter2 -H -f '$3 > 60 && $4 > 50 && $14 > 50 && $12 > 100 && $13 > 0' > check_filtered_different_query_and_subject_loci.csv
```
The file `check_filtered_different_query_and_subject_loci.csv` resulting from the check self-BLAST is empty, meaning no sequence from a locus has similarity with a sequence of any other locus.
**The full set of reference sequences is ready** in`$path_to_wd/full_reference_DB`: [`FULL_REFERENCE_nucl_SB1.FNA`](full_reference_DB/FULL_REFERENCE_nucl_SB1.FNA.gz) and [`FULL_REFERENCE_prot_SB1.FAA`](full_reference_DB/FULL_REFERENCE_prot_SB1.FAA.gz)
# Step 2: align the Melasto689 template sequences to the set of reference
## 1. Prepare the raw target sequences set (Melasto689)
Download or copy the [original Melastomataceae 689 target sequence set](https://datadryad.org/stash/dataset/doi:10.5061/dryad.8931zcrm2), and decompress the `Template_sequences.zip` folder.
- De-align the sequences
- Remove all the ? characters
```{bash}
cd $path_to_wd/original_template_sequences
cd Template_sequences
sed -i '/^>/!s/-//g' *.fasta
sed -i '/^>/!s/\?//g' *.fasta
```
### 1.1 Clean the KT######.1 loci
#### 1.1.1 KT377070.1 and KT377086.1
KT377070.1 and KT377086.1 are each split into 2 different files (`KT377070.1_alignment.fasta` and `KT377070_alignment.fasta`, and `KT377086.1_alignment.fasta` and `KT377086-Affzeli.fasta`, respectively), supposedly representing 2 different "versions" of the genes with high divergence (Jantzen, pers. comm.). Actually, the alternative "version" is not highly divergent but in reverse complement of what it should be. Here we just merge the fasta files in order to have only 1 file per gene. We arbitrarily keep the ".1" suffix in the names of the genes.
**KT377070.1**
To avoid duplicates in fasta headers, we rename the sequence ">KT377070-Tibouchina" (from `KT377070_alignment.fasta`) to ">KT377070.1-Tibouchina2".
```{bash}
cat KT377070.1_alignment.fasta KT377070_alignment.fasta | sed 's/KT377070-Tibouchina/KT377070-Tibouchina2/g' | sed 's/KT377070-/KT377070.1-/g' > KT377070.1_custom.fasta
rm KT377070.1_alignment.fasta KT377070_alignment.fasta
```
**KT377086.1**
```{bash}
cat KT377086.1_alignment.fasta KT377086-Affzeli.fasta | sed 's/KT377086-/KT377086.1-/g' > KT377086.1_custom.fasta
rm KT377086.1_alignment.fasta KT377086-Affzeli.fasta
```
#### 1.1.2 Other KT######.1 loci
Modify the fasta headers of the template sequences for the KT######.1 loci so that all the headers have the ".1" suffix.
```{bash}
cat KT377102.1_alignment.fasta | sed 's/KT377102-/KT377102.1-/g' > KT377102.1_custom.fasta
cat KT377110.1_alignment.fasta | sed 's/KT377110-/KT377110.1-/g' > KT377110.1_custom.fasta
rm KT377102.1_alignment.fasta KT377110.1_alignment.fasta
```
### 1.2 Rename the sequence names
Need to rename the sequences names to stick to the naming convention.
```{bash}
ls -1 *.fasta | \
while read file
do
cat $file| sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > $file.FNA
done
rename '.fasta.FNA' '.FNA' *
rm *.fasta
```
### 1.3 Concatenate all the sequences into a single fasta file:
```{bash}
cat *.FNA > ../Melastomataceae_689_original.FNA
```
## 2. Associate each template sequence (or portion of) from Melasto689 to a locus from the full reference set
> The `sbatch` script corresponding to this section is: [`melasto689_on_full_reference.sh`](scripts/melasto689_on_full_reference.sh)
### 2.1 Prepare the paths and data
Define the paths to previously prepared databases.
```{bash}
cd $path_to_wd/melasto689_on_full_ref
path_to_Melasto689=$path_to_wd/original_template_sequences;
path_to_full_ref=$path_to_wd/full_reference_DB;
prot_DB=$path_to_full_ref/FULL_REFERENCE_BLAST_DB/FULL_REFERENCE_prot_BLAST_DB_FINAL
path_to_scripts=$path_to_scripts # define the path to the scripts
```
Copy the necessary files to the current directory.
```{bash}
scp -r $path_to_Melasto689 ./
scp -r $path_to_full_ref/FULL_REFERENCE_nucl.FNA ./
```
### 2.2 Remove the Angio353 template sequences from Melasto689
The Melasto389 probe set includes template sequences from the Angio353 dataset (what we call angio353, with a small **a**, in the glossary). They need to be removed beforehand. They are easily identifiable with their name in the form `XXXX-####` where `X` is a letter and `#` a digit.
```{bash}
cd $path_to_wd/melasto689_on_full_ref
seqkit grep -w 0 -n -r -p "[[:upper:]]{4}-" Melastomataceae_689_original.FNA > Angio353_from_Melasto689.FNA;
seqkit grep -w 0 -n -r -v -p "[[:upper:]]{4}-" Melastomataceae_689_original.FNA > Melasto689_trimed.FNA;
Melasto689="Melasto689_trimed.FNA"
seqkit seq -n Melasto689_trimed.FNA > template_sequences_list.txt
```
### 2.3 Align every template sequence to the reference and extract the hits
Each template sequence (TS) from Melasto689 (`Melasto689_trimed.FNA`) will be aligned to the set of references designed in Step 1. The aligner used here is `BLAST` (`blastx-fast`). The regions of the TS that match to a reference (the hits) are then extracted, the name of the reference locus is appended to them, and they are place into the `extracted_hits` folder. The non-matching regions are also separated into the `NMR_hits` folder, and the TS with no match (i.e. no portion of the TS matches a reference) are placed into the `nomatch_TS` folder.
Prepare the sub-folders.
```{bash}
cd $path_to_wd/melasto689_on_full_ref
mkdir extracted_hits
mkdir nomatch_TS
mkdir NMR_hits
```
For every TS in the list of template sequences:
- create a sub-directory and set it as current directory
- extract the TS from the whole Melasto689_trimmed file
- align the TS to the reference with `blastx-fast` and store the result into a table
- if the result table exists and is not empty, then:
- filter the results to keep only the matches with a bitscore greater than 50 and a query coverage greater than 50% (meaning at least 50% of the query covers the subject sequence), or matches with percentage identity greater than 60%, a query coverage greater than 10%, and a bitscore greater than 50
- if the filtered table exists and is not empty, then:
- for every line in the table (i.e. for every hit):
- retrieve the reference locus to which the hit aligns `refLocus`
- retrieve the frame in which the hit aligns to the reference `frame`
- print the matching subject (i.e. the name of the reference TS to which the hit matched) into the file `matching_subject-$refLocus.txt`. This file will be used later during step 2.5
- if the `frame` value is greater than 0 (i.e. the hit is not in reverse complement), then:
- set the hit into `.bed` format
- retrieve the sequence of the hit using `bedtools getfasta` and rename the sequence in order to have the old locus name separated with '__' (double underscore) and the new locus name (`refLocus`) appended at the end of the name
- the hit is saved in `.FNA` (`<hitID>_hit.FNA`)
- if the `frame` value is lower than 0 (i.e. the hit is in reverse complement), then:
- set the hit into `.bed` format
- retrieve the sequence of the hit using `bedtools getfasta`, reverse complement the sequence and rename the sequence in order to have the old locus name separated with '__' (double underscore) and the new locus name (`refLocus`) appended at the end of the name
- the hit is saved in `.FNA` (`<hitID>_hit.FNA`)
- copy the hits (`<hitID>_hit.FNA` files) into `$path_to_wd/extracted_hits`
- retrieve the query coverage (i.e. the percentage of query covered by the alignment)
- if the query coverage is lower than 90 (i.e. at least 10% of the TS did not align to any reference), then:
- merge all the `.bed` files to get the region(s) of the TS that did match
- set a `.bed` file for the whole TS
- extract the coordinates of region(s) of the TS that did not match (non-matching regions, NMR) with `bedtools complement`
- retrieve the sequence of the NMR using `bedtools getfasta`, filter out the sequences less than 30 nucleotides long, append 'NMR' to their locus name, and store them into `<TS_ID>NMR.FNA`
- copy the NMR (`<TS_ID>NMR.FNA` files) into `$path_to_wd/NMR_hits`
- else (i.e. the TS has no match):
- rename the TS to 'nomatch' and store it into `$path_to_wd/nomatch_TS`
- set the upper level directory as current directory
```{bash}
for TS in $(cat template_sequences_list.txt);
do
echo $TS
mkdir $TS
cd $TS
seqkit grep -w0 -nrp $TS ../Melasto689_trimed.FNA > $TS.FNA
blastx -task blastx-fast -matrix BLOSUM62 -db $prot_DB -query $TS.FNA -culling_limit 1 -evalue 1e-06 -ungapped -comp_based_stats F -outfmt "10 std qframe qcovs sframe qseq" > $TS"_blastx_out_unfiltered.csv"
if [ -s $TS"_blastx_out_unfiltered.csv" ]; then
cat $TS"_blastx_out_unfiltered.csv" | csvtk filter2 -H -f '$12 > 50 && $14 > 50 || $3 > 60 && $14 > 10 && $12 > 50' > $TS"_blastx_out.csv"
fi
if [ -s $TS"_blastx_out.csv" ]; then
number_of_seqs_BLASTED=$(cat $TS"_blastx_out.csv" | wc -l)
else
number_of_seqs_BLASTED=(0)
fi
if [ $number_of_seqs_BLASTED -gt 0 ]; then
for line in $(cat $TS"_blastx_out.csv");
do
# Retrieve the hit
refLocus=$(echo $line | awk -F ',' '{print $2}' | sed 's/.*-//g')
frame=$(echo $line | awk -F ',' '{print $13}')
echo $line | awk -F ',' '{print $2}' >> matching_subject-$refLocus.txt
if [ $frame -gt 0 ]; then # HIT IS REGULAR
hit=$(echo $line | awk -F ',' '{print $1":"$7-1"-"$8}' | sed 's/-/__/g' | sed 's/$/-'$refLocus'/g')
echo Hit is: $hit
echo "...(regular hit)"
echo $line | awk -F ',' '{print($1"\t"$7-1"\t"$8)}' | bedtools sort > $hit".bed"
bedtools getfasta -bed $hit".bed" -fi $TS.FNA | sed 's/-/__/g' | sed '/^>/s/$/-'$refLocus'/g' > $hit"_hit.FNA"
fi
if [ $frame -lt 0 ]; then # HIT IS RC
hit=$(echo $line | awk -F ',' '{print $1":"$8-1"-"$7}' | sed 's/-/__/g' | sed 's/$/-'$refLocus'/g')
echo Hit is: $hit
echo "... (reverse complemented hit)"
echo $line | awk -F ',' '{print($1"\t"$8-1"\t"$7)}' | bedtools sort > $hit".bed"
bedtools getfasta -bed $hit".bed" -fi $TS.FNA | seqkit seq -rp -t dna -v -w 0 | sed 's/-/__/g' | sed '/^>/s/$/-'$refLocus'/g' > $hit"_hit.FNA"
fi
done
cp *_hit.FNA ../extracted_hits
# Retrieve the NMR
qcovs=$(cat $TS"_blastx_out.csv" | csvtk summary -H -n0 -f 14:sum)
if [ $qcovs -lt 90 ]; then
echo ...NMR found for $TS, extract NMR...
cat *.bed | bedtools sort | bedtools merge > $TS"_merged_ALL.bed"
seqkit fx2tab -ln $TS".FNA" | awk '{print($1"\t"$2"\t"$2)}' | bedtools sort > $TS".bed"
bedtools complement -L -i $TS"_merged_ALL.bed" -g $TS".bed" > $TS"_nonmatching_regions.bed"
cat $TS"_nonmatching_regions.bed" >> ../TS_nonmatching_regions.bed
bedtools getfasta -bed $TS"_nonmatching_regions.bed" -fi $TS".FNA" | seqkit seq -g -m 30 -w 0 | sed 's/-/_/g' | sed '/^>/s/$/-NMR/g' > $TS"NMR.FNA"
if [ ! -s $TS"NMR.FNA" ]; then
echo ... NMR less than 30 nucleotides, discard it...
rm $TS"NMR.FNA"
else
cp *NMR.FNA ../NMR_hits
fi
fi
else
echo No matching hit for $TS
cat $TS".FNA" > $TS"_nomatch.FNA"
cp *_nomatch.FNA ../nomatch_TS
fi
cd ../
done
```
### 2.4 Merge the hits belonging to a same TS and a same locus
The extracted hits might or might not belong to a same TS and a same locus. This step will merge the extracted hits that do. This is important in order to keep a coherence in the cleaned TS. For example, a TS that has a hit from the position 1 to 30 and another hit from the position 60 to 90 has to be considered as a same TS and not two separated TS. Note that for a same TS and same locus, the hits might actually overlap. We will thus re-align them to the reference and draw a consensus sequence of the hits alignment.
In the folder `extracted_hits`, create a new subfolder `cleaned_hits` that will contain the merged hits.
```{bash}
cd $path_to_wd/melasto689_on_full_ref/extracted_hits
mkdir cleaned_hits
```
Initialize a table containing the matches between the `locusCustomID` and the corresponding Melasto689 locus name.
```{bash}
echo "locusCustomID, correspondingLoci" > $path_to_wd/loci_matching_table.csv
```
Draw the the list of loci (containing `locusCustomID` for the TS that matched to a reference in 2.3, and old locus name in case the TS had no match).
```{bash}
loci_list=$(ls -1 *.FNA | sed 's/.*-//g' | sed 's/_hit.FNA//g' | sort | uniq)
```
For every `locus` in the list of loci:
- create a subfolder for the locus
- copy the hits that belong to the locus in the subfolder
- set the subfolder as current directory
- retrieve the list of TS belonging to this locus `TS_list`
- retrieve the list of old loci from the TS belonging to this locus `TS_loci`
- write the locus and the corresponding old loci list into the table of matching loci
- extract the sequences corresponding to the locus from the reference set
- retrieve the names of these sequences and store them into a text file (later used by the [`align_translated_to_ref_and_draw_consensus.R`](scripts/align_translated_to_ref_and_draw_consensus.R) script)
- for every TS in the `TS_list`:
- retrieve its old locus name `TS_locus`
- retrieve the number of hits belonging to the TS and locus `nb_hits`
- if the number of hits is greater than 1, then:
- concatenate these hits files together and with the reference sequences
- draw a consensus sequence for these hits using the [`align_translated_to_ref_and_draw_consensus.R`](scripts/align_translated_to_ref_and_draw_consensus.R) R script
- remove the gaps from the consensus sequence and rename it so that the old locus name (`TS_locus`) is bordered with '__' and so that the `-locus` name is at the end of the sequence name
- if the number of hits is equal to 1, then:
- rename the hit sequence so that the `-locus` name is at the end of the sequence name
- copy the cleaned hits or merged hits into the `cleaned_hits`
- set the upper level directory as current directory
```{bash}
for locus in $loci_list
do
#echo $locus
mkdir $locus
cp *-$locus"_hit.FNA" $locus
cd $locus
TS_list=$(ls -1 *hit.FNA | sed 's/:.*//g' | sort | uniq)
TS_loci=$(ls -1 *hit.FNA | sed 's/:.*//g' | sed 's/.*__//g' | sort | uniq)
echo $locus, $TS_loci >> $path_to_wd/loci_matching_table.csv
seqkit grep -w0 -nrp $locus $path_to_wd/FULL_REFERENCE_nucl.FNA > $locus"_original.FNA"
seqkit seq -n $locus"_original.FNA" > IDs_to_remove_after_alignment.txt
#TS="Tetrazygia__NM_113708.2"
for TS in $TS_list
do
echo $TS
TS_locus=$(echo $TS | sed 's/.*__//g')
nb_hits=$(ls -1 $TS*_hit.FNA | wc -l)
if [ $nb_hits -gt 1 ]; then
echo "More than 1 hit found for "$TS". Draw consensus sequence."
cat $locus"_original.FNA" $TS*_hit.FNA > $TS"_hits_to_align.FNA"
# Align the hits to the reference (as translated amino acids) and build consensus upon aligned hits
Rscript $path_to_scripts/align_translated_to_ref_and_draw_consensus.R $TS"_hits_to_align.FNA" $TS"_hits_aligned_to_ref.FNA" IDs_to_remove_after_alignment.txt $TS"_hits_consensus" $TS"_hits_consensus.FNA"
# Remove the gaps to clean the consensus sequence
seqkit seq -g -w 0 $TS"_hits_consensus.FNA" | sed '/^>/s/_'$TS_locus'_/_'$TS_locus'__/g' | sed '/^>/s/$/'-$locus'/g' > $TS"_hits_consensus_clean-"$locus".FNA"
elif [ $nb_hits -eq 1 ]; then
echo "Only 1 hit found for "$TS"."
seqkit replace -w 0 -p ':.*' -r '__hit' $TS*_hit.FNA | sed '/^>/s/$/'-$locus'/g' > $TS"_hit_clean-"$locus".FNA"
fi
done
cp *clean-$locus.FNA ../cleaned_hits
cd ..
done
```
At the end of this step we have the folder `cleaned_hits` full of individual files with cleaned hits.
### 2.5 Add reference sequences to the final set
This step aims to add the relevant reference sequences (RS) to the final probe set. We will only add the RS from loci that matched the TS in 2.3.
Some loci have a very high number of transcriptome RS. Instead of appending them all to the final set, we just select the transcriptome RS that previously aligned to a TS.
Concerning the Angio353 RS, no selection was carried out and all of them were appended to the final set, if belonging to a locus present in the final set.
- create the `cleaned_sequences_set` folder
- concatenate all the cleaned hit `.FNA` files into the file `cleaned_hits_all.FNA` in this folder
```{bash}
cd $path_to_wd/melasto689_on_full_ref/extracted_hits/cleaned_hits
mkdir $path_to_wd/cleaned_sequences_set
cat *.FNA > $path_to_wd/cleaned_sequences_set/cleaned_hits_all.FNA
```
- create the `matching_subjects` subfolder in the `cleaned_sequences_set` folder
- fetch all the `matching_subject-$refLocus.txt` files and copy them into `$path_to_wd/cleaned_sequences_set/matching_subjects`
- remove the duplicate lines in every text file
```{bash}
cd $path_to_wd/melasto689_on_full_ref
mkdir cleaned_sequences_set/matching_subjects
find . -type f -name "matching_subject*.txt" -exec cp {} ./cleaned_sequences_set/matching_subjects \; # error messages are normal as the destination folder is embedded within the searching directory
cd cleaned_sequences_set/matching_subjects
for file in $(ls -1 *.txt)
do
cat $file | sort | uniq > $file.temp
done
rm *.txt
rename -v '.txt.temp' '.txt' *
```
- retrieve the list of loci from the concatenated clean hits file
- initialize the `.FNA` file that will contain the final sequences and concatenate cleaned hits in it (`final_sequences_set.FNA`)
- for every locus in the list of loci:
- retrieve the sequences that belong to this locus in the set of reference and get the number of those sequences that are from the trancriptome data (identified with 'scaffold' in their name)
- if the number of these transcriptome sequences is greater than 5, then:
- retrieve only the reference sequences that matched a TS in step 2.3 (whose ID are in the `matching_subject-locus.txt`) and append them to `final_sequences_set.FNA`
- retrieve all the reference sequences for this locus that come from Angio353 and if any was retrieved, append them to `final_sequences_set.FNA`
- else (i.e. number of transcriptome sequences is lower than 5):
- retrieve all the reference sequences for this locus and append them to `final_sequences_set.FNA`
```{bash}
cd $path_to_wd/melasto689_on_full_ref/cleaned_sequences_set
seqkit seq -n cleaned_hits_all.FNA | sed 's/.*-//g' | sort | uniq > loci_list.txt
cat cleaned_hits_all.FNA > final_sequences_set.FNA
for locus in $(cat loci_list.txt);
do
#echo $locus
nb_TTS=$(seqkit grep -w0 -nrp -$locus ../FULL_REFERENCE_nucl.FNA | seqkit seq -n | grep 'scaffold' | wc -l) # nb transcriptomes TS
if [ $nb_TTS -gt 5 ]; then # if number of transcript (from 1KP) reference sequences is above 4 just select a few (ie the matching ones)
# retrieve the matching TS(s) for the locus (SWGX WWQZ sequences transcripts and/or Angio353 seqs)
#cat matching_subjects/*-$locus.txt
seqkit grep -w0 -nrf matching_subjects/*-$locus.txt ../FULL_REFERENCE_nucl.FNA >> final_sequences_set.FNA
# retrieve the corresponding Angio353 sequences for the locus
# seqkit grep -w0 -nrp $locus ../FULL_REFERENCE_nucl.FNA | seqkit seq -n | grep 'scaffold'
seqkit grep -w0 -nrp -$locus ../FULL_REFERENCE_nucl.FNA | seqkit seq -n | grep -v 'scaffold' > $locus"_TS_to_get.txt"
if [ -s $locus"_TS_to_get.txt" ]; then
seqkit grep -w0 -nrf $locus"_TS_to_get.txt" ../FULL_REFERENCE_nucl.FNA >> final_sequences_set.FNA
fi
rm $locus"_TS_to_get.txt"
else
seqkit grep -w0 -nrp -$locus ../FULL_REFERENCE_nucl.FNA >> final_sequences_set.FNA
fi
done
```
### 2.6 Append the Angio353 loci and remove duplicates
Get the list of names of Angio353 loci that were included in the Melasto689 (i.e. the list of angio353 loci) using file generated in 2.2.
```{bash}
cd $path_to_wd/melasto689_on_full_ref/cleaned_sequences_set
seqkit seq -n ../Angio353_from_Melasto689.FNA | sed 's/.*-//g' | sort | uniq > Angio353_from_Melasto689_list.txt
```
For every locus in the list of angio353 loci:
- retrieve the sequences from the reference set that belong to the locus (i.e. either having it as locus name or having it in their TS name) and store them in `final_sequences_set.FNA`.
```{bash}
for locus in $(cat Angio353_from_Melasto689_list.txt);
do
#echo $locus
seqkit grep -w0 -nrp -$locus ../FULL_REFERENCE_nucl.FNA >> final_sequences_set.FNA
seqkit grep -w0 -nrp "__"$locus"__" ../FULL_REFERENCE_nucl.FNA >> final_sequences_set.FNA
done
```
Remove the duplicate sequences based on their name and sequence.
```{bash}
seqkit rmdup -w0 -D name_duplicates.txt -n final_sequences_set.FNA > final_sequences_set_rmdups.FNA
seqkit rmdup -w0 -D sequence_duplicates.txt -s final_sequences_set_rmdups.FNA > final_sequences_set.FNA
```
### 2.7 Finalization
Sort the sequences by locus name in the final set.
```{bash}
cd $path_to_wd/melasto689_on_full_ref/cleaned_sequences_set
cat final_sequences_set.FNA | seqkit rmdup -n | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' | seqkit sort -w0 -n | sed -E '/^>/s/(\w*\.*\w*)-(\w*\.*\w*)/\2-\1/' > final_sequences_set.FNA.temp
rm final_sequences_set.FNA
```
Modify the names of some TS to limit the length of the names.
```{bash}
seqkit replace -w0 -p 'hits_consensus' -r 'hc' final_sequences_set.FNA.temp > final_sequences_set.FNA.temp2
seqkit replace -w0 -p 'transcriptome_combo' -r 'tr_combo' final_sequences_set.FNA.temp2 > final_sequences_set.FNA
rm -r *temp*
```
Retrieve the TS names in the final set.
```{bash}
cat final_sequences_set.FNA | seqkit seq -n > TS_full_list.txt
```
### 2.8 self-BLAST check
#### 2.8.1 Build BLAST database
```{bash}
cd $path_to_wd/melasto689_on_full_ref
mkdir BLAST_DB
makeblastdb -in cleaned_sequences_set/final_sequences_set.FNA -out BLAST_DB/final_sequences_set_nucl -dbtype nucl -parse_seqids
makeblastdb -in cleaned_sequences_set/final_sequences_set.FAA -out BLAST_DB/final_sequences_set_prot -dbtype prot -parse_seqids
```
#### 2.8.2 Run the self-BLAST check
```{bash}
blastp -task blastp-fast -db BLAST_DB/final_sequences_set_prot -query cleaned_sequences_set/final_sequences_set.FAA -evalue 1e-06 -outfmt "10 std qframe qcovs qseq" -num_threads $SLURM_CPUS_PER_TASK > selfblast_final_sequences_set_blastp_out_SB.csv
cat selfblast_final_sequences_set_blastp_out_SB.csv | awk -F ',' 'OFS = "," {
temp1 = $1
temp2 =$2
sub(".+-","", temp1)
sub(".+-","", temp2)
if (temp1 != temp2)
print $0
}' > different_query_and_subject_loci_SB.csv
cat different_query_and_subject_loci_SB.csv | csvtk filter -H -f'3>60' > filtered_different_query_and_subject_loci_SB.csv
echo "Done Self-Blast."
```
The file `filtered_different_query_and_subject_loci_SB.csv` resulting from the check self-BLAST contains a few sequences from different two pairs of loci having high similarity.
For the loci LOC12198 and sLOC23791 it concerns reference sequences coming from Angio353 references (QSKP__6363 and AEPI__6320), but the sequences are similar over less than 20% of their length.
For the loci LOC20331 and LOC22857, it concerns reference sequences coming from Angio353 and a TS from Melasto689. Their is relatively high similarity (ca. 62%) over ca. 40% of the TS from Melasto689, but over less than 10% for the reference sequences.
Given the short region of similarity, here we decided to keep them separated.
### 2.9 Remove transcriptome sequences that are supersets of mega353 sequences
The extension of the Angio353 set to the mega353 set appended some sequences (actually part of sequences) from the SWGX and WWQZ transcriptomes in the full reference set (Step 1: 2). These same transcriptomes sequences were also appended in full to the full reference set (Step 1: 1). Because of that, some sequences from the mega353 are subsets of sequences coming from the transcriptomes. To avoid redundancy, we remove the transcriptomes sequences that are supersets of mega353 sequences.