-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path4_regex.Rmd
382 lines (267 loc) · 9.07 KB
/
4_regex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
---
title: |-
R Crash Course \
Part 4 -- Regular Expressions in R
author: "Rob Colautti"
---
# 1. Overview
Regular Expressions
* aka 'regex' and 'regexp'
* a sort of find-and-replace for nerds
* one of the most powerful data tools I have ever learned
* requires patience and lots of **practice**
# 2. Basic Regex Functions in R
## Four Basic Regex Functions in R:
`grep()` and `grepl()` are equivalent to 'find' in your favorite word processor
* General form: `grep("find this", in.this.object)`
`sub()` and `gsub()` are equivalent to 'find and replace'
* General form: `grep("find this", "replace with this", in.this.object)`
## Three Advanced Regex Functions in R:
`regexpr()` provides more detailed info about the first match
`gregexpr()` provides more detailed results about all matches
More examples [here](https://bioinfomagician.wordpress.com/2013/11/03/regular-expression-tutorial-2-commands-in-r/)
## Start with a simple data frame of species names:
```{r}
Species<-c("petiolata", "verticillatus", "salicaria", "minor")
print(Species)
```
## `grep()` -- returns cell addresses matching query
```{r}
grep("a",Species)
```
## `grepl()` -- returns T/F associated with
```{r}
grepl("a",Species)
```
## `sub()` -- replaces first match (in each cell)
```{r}
sub("l","L",Species)
```
## `gsub()` -- replaces all matches
```{r}
gsub("l","L",Species)
```
# 3. Wildcards
## `\` escape character
## `\\` in R, double-escape is usually needed (because first `\` is used to escape special characters in R, and `\` is itself a special character that needs to be escaped to pass through the function")
## `\\w`= all letters and digits (aka 'words')
```{r}
gsub("w","*","...which 1-100 words get replaced?")
gsub("\\w","*","...which 1-100 words get replaced?")
```
## `\\W` = non-word and non-number (inverse of `\\w`)
```{r}
gsub("\\W","*","...which 1-100 words get replaced?")
```
## `\\s` = spaces
```{r}
gsub("\\s","*","...which 1-100 words get replaced?")
```
## `\\t` = tab character (useful for tab-delimited data)
```{r}
gsub("\\t","*","...which 1-100 words get replaced?")
```
## `\\d` = digits
```{r}
gsub("\\d","*","...which 1-100 words get replaced?")
```
## `\\D` = non-digits
```{r}
gsub("\\D","*","...which 1-100 words get replaced?")
```
## __Two more special wildcards:__
## `\\r` = carriage return
## `\\n` = newline character
Unix/Mac files -- lines usually end with `\\n` only
Windows/DOS files -- lines usually end with `\\r\\n`
# 4. Special characters:
## `|`= 'or'
Example, look for w or e
```{r}
gsub("w|e","*","...which 1-100 words get replaced?")
```
## `.` = any character except new line
```{r}
gsub(".","*","...which 1-100 words get replaced?")
```
So how to search for a period? Use the escape character
```{r}
gsub("\\.","*","...which 1-100 words get replaced?")
```
## Use *, ? + and {} for more complicated searches
## Look at these examples carefully
```{r}
sub("\\w","*","...which 1-100 words get replaced?")
gsub("\\w","*","...which 1-100 words get replaced?")
```
## `+` = 1 or more matches
```{r}
sub("\\w+","*","...which 1-100 words get replaced?")
gsub("\\w+","*","...which 1-100 words get replaced?")
```
## `?` is a 'lazy' search (match the shortest possible)
```{r}
sub("\\w?","*","...which 1-100 words get replaced?")
gsub("\\w?","*","...which 1-100 words get replaced?")
```
## `*` is a 'greedy' search (match the largest possible)
```{r}
sub("\\w*","*","...which 1-100 words get replaced?")
gsub("\\w*","*","...which 1-100 words get replaced?")
```
> NOTE: `?` and `*` can include matches with a length of zero, whereas `+` matches 1 or more. This can be tricky to think about.
## Combine `?` with `+`
This can avoid zero matches. Compare with above.
```{r}
sub("\\w+?","*","...which 1-100 words get replaced?")
gsub("\\w+?","*","...which 1-100 words get replaced?")
```
> Try using `+*`. Why do you get an error message?
## `{n,m}` = between n to m matches
```{r}
gsub("\\w{3,4}","*","...which 1-100 words get replaced?")
```
## `{n}` = exactly n matches
```{r}
gsub("\\w{3}","*","...which 1-100 words get replaced?")
```
## `{n,}`= n or more matches
```{r}
gsub("\\w{4,}","*","...which 1-100 words get replaced?")
```
# 5. List range of options using []
## Find everything in square brackets
```{r}
gsub("[aceihw-z]","*","...which 1-100 words get replaced?")
```
## Find everything in square brackets occurring 1 or more times
```{r}
gsub("[aceihw-z]+","*","...which 1-100 words get replaced?")
```
# 6. ^Start and end of line$
## ^ Start of line
Find species starting with "s"
```{r}
grep("^s",Species)
```
### IMPORTANT: ^ Also 'negates' when used within []
Find species containing any letter other than s
```{r}
grep("[^a]",Species)
```
Replace every letter except s
```{r}
gsub("[^a]","*",Species)
```
## $ End of line
Find species ending with "a"
```{r}
grep("a$",Species)
```
# 7. Capture text with brackets `()`
Capture text using `()` and reprint using `\\1`, `\\2`, etc
Replace each word with its first letter
```{r}
gsub("(\\w)\\w+","\\1","...which 1-100 words get replaced?")
```
Pull out only the numbers and reverse their order
```{r}
gsub(".*([0-9]+)-([0-9]+).*","\\2-\\1","...which 1-100 words get replaced?")
```
Reverse first two letters of each word
```{r}
gsub("(\\w)(\\w)(\\w+)","\\2\\1\\3","...which 1-100 words get replaced?")
```
# 8. 'Scraping' data from online sources
Use `readLines` with `curl()` to 'scrape' data from the internet, then `grep()` and `gsub()` into useful structure.
```{r}
library(curl)
Prot<-readLines(curl("http://www.rcsb.org/pdb/files/1ema.pdb"))
Prot[grep("TITLE",Prot)]
```
> HINT: You may have to use install.packages("curl") the first time you run this, if you don't already have it installed.
# EXAMPLE 1: Genbank Record
In this example, we'll work with a download of the 18S rRNA gene from *Lythrum salicaria* You manually:
1. Open [Lythrum 18S rRNA gene](https://www.ncbi.nlm.nih.gov/nuccore/7595475/) in a web browser
2. Look for as a 'Send To:' button and choose 'File'
3. Choose GenBank format and save to your computer (e.g. you current project folder for this tutorial)
4. You should have a file called `sequence.gb` -- this is just a text file with the `.gb` extention to tell you it's a genbank layout.
5. Import the file into R as an object using the `scan` function
```{r}
Lythrum.18S<-scan("./sequence.gb",what="character",sep="\n")
```
The 18S gene s equence for *Lythrum salicaria* (from Genbank)
```{r}
print(Lythrum.18S)
```
Notice that each line is read in as a separate cell in a vector, with sequences beginning with a number ending with 1. We can take advantage of this to extract just the sequence data
Use `.*` with `()` to delete everything before the DNA sequence
```{r}
Seq<-gsub(".*(1 [gatc])","",paste(Lythrum.18S))
paste(Seq,collapse="")
```
Use the `.*` and space with `+` to eliminate all text before the sequence :
```{r}
Seq<-gsub(".*ORIGIN +","",paste(Seq,collapse=""))
Seq
```
Elimminate spaces and the two `//` at the end
```{r}
Seq<-gsub(" |//","",Seq)
Seq
```
Capital letters look nicer, but requires a PERL qualifier `\\U` that is not standard in R
```{r}
Seq<-gsub("([actg])","\\U\\1",Seq,perl=T)
print(Seq)
```
Look for start codons?
```{r}
gsub("ATG","-->START<--",Seq)
```
Open reading frames?
```{r}
gsub("(ATG([ATGC]{3})+?(TAA|TAG|TGA)+?)","-->\\1<--",Seq)
```
Or go back, and keep non-reading frame in lower case
```{r}
Seq<-gsub(".*ORIGIN","",Lythrum.18S)
Seq<-gsub("[\n0-9 /]+","",Seq) ## Note the single \ in this case since it is inside the square brackets
gsub("(atg([atgc]{3})+?(taa|tag|tga)+?)","\\U\\1",Seq,perl=T)
```
# EXAMPLE 2: Organizing Data
Imagine you have a repeated measures design. 3 transects (A-C) and 3 positions along each transect (1-3)
```{r}
Transect<-data.frame(Species=1:20,A1=rnorm(20),A2=rnorm(20),A3=rnorm(20),B1=rnorm(20),B2=rnorm(20),B3=rnorm(20),C1=rnorm(20),C2=rnorm(20),C3=rnorm(20))
head(Transect)
```
You want to look at only transect A for the first 3 species
```{r}
Transect[1:3,grep("A",names(Transect))]
```
Or look at the first position of each transect for the first 3 species
```{r}
Transect[1:3,grep("1",names(Transect))]
```
Or rows with species IDs containing the number 2
```{r}
Transect[grep("2",Transect$Species),]
```
# PRACTICE EXERCISES
## 1. Consider a vector of email addresses scraped from the internet:
* robert 'dot' colautti 'at' queensu 'dot' ca
* chris.eckert[at]queensu.ca
* lonnie.aarssen at queensu.ca
Use regular expressions to convert all email addresses to the standard format: [email protected]
## 2. Create a random sequence of DNA:
```{r,results="hide"}
My.Seq<-sample(c("A","T","G","C"),1000,replace=T)
```
* Replace T with U
* Find all start codons (AUG) and stop codons (UAA, UAG, UGA)
* Find all open reading frames (hint: consider each sequence beginning with AUG and ending with a stop codon; how do you know if both sequences are in the same reading frame?)
* Count the length of bp for all open reading frames
## 3. More online examples
[http://regex.sketchengine.co.uk/extra_regexps.html](http://regex.sketchengine.co.uk/extra_regexps.html)
## 4. Regex Golf
Have fun! [LINK](https://alf.nu/RegexGolf)