2.2.2 Subseries GC Content
In this course we will write functions to analyze the GC content of subsequences of a sequence. You can download the sequences right clicking the following link: dengue and choosing “Save as…”, saving in choose the Bioinformatics/workspace
folder that is in the desktop, as always.
Before starting, load the data:
library(seqinr)
dengue <- read.fasta("dengue.fasta")[[1]]
1 GC Content
In the last lecture we defined the GC content of a sequence manually, using the frequencies of G and C in the sequences. The library seqinr
has a function that automatically calculates the GC content for any sequence.
> GC(dengue)
0.4666977
We will use this function instead today.
2 Selecting windows
Different parts of the DNA strand behave differently. It is useful to be able to select subparts of it. For example, to select the first 10 nucleotides:
> dengue[1:10]
[1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t"
To select n
nucleotides, starting from the position s
, the operation to be used is dengue[s,(s+n-1)]
.
3 Choosing some windows
Let’s analyze the data for two windows of size 100. We will get the GC content for each slice, and save it in an vector.
#reate and empty vector
gc <- c()
#ave the first element
gc[1] <- GC(dengue[1:100])
#ave the second element
gc[2] <- GC(dengue[101:200])
Now we have a vector:
> gc
[1] 0.40 0.48
Which we can plot with barplot(gc)
. But how can we do this for all intervals of a given size in a sequence?
4 Getting all the intervals
We want to obtain an array such as c(1,101,201,301...)
, corresponding to the start points for the intervals. but that is too long to write by hand. We will use instead the function seq(start,end,by=size)
where start
is the first element we want, end
the last one, and size
the size of the interval between elements.
In the example above, we want:
size(1,length(dengue)-100,by=100)
5 Going through all the elements
To go through all the elements, we will use a for
loop. For example, to print all the elements in the vector 1:100
:
for(i in 1:100) {
print(i)
}
6 Final result
Putting everything together, we can write a function that results into a vector containing the GC content of each piece of the sequence.
windowGC <- function(data, window) {
starts <- seq(1,length(data)-window,by=window)
gc <- c()
for(i in 1:length(starts)) {
start <- starts[i]
end <- start+window-1
gc[i] <- GC(data[start:end])
}
gc
}
To calculate for windows of size 1000:
> windowGC(dengue,1000)
[1] 0.477 0.453 0.464 0.441 0.471 0.470 0.488 0.470 0.438 0.471
> barplot(windowGC(dengue,1000))
The function above can be used to compare different window sizes. By decreasing the window size, we can see the GC content with more resolution. Try generating each one of these graphs.
Looking at this graph, what part of the DNA would break first when heating it?