2.3.2 Codon frequencies and finding genes
1 Introduction
Today we will be studying codon frequencies, and automatically find genes in DNA sequences. You can download the sequences by clicking the following links: dengue and zika. Right click it, and choose “Save as…” and choose the Bioinformatics/workspace
folder that is in the desktop, as always.
After, copy and paste the following code in the RStudio console, which we will use during the class.
library(seqinr)
library(Biostrings)
# Load vector of nucleotides from file
<- function(file) {
readDNAString DNAString(toupper(c2s(read.fasta(file)[[1]])))
}
# Start and end codons
= "M"
start = "*"
end
# Translate a sequence of bases into codons
<- function(data, index) {
translateAndRegularize <- (length(data)- index+1) %/% 3 * 3
l <- data[index:(l+index-1)]
reg ::translate(reg,no.init.codon = TRUE)
Biostrings
}
# Get first element satisfying constraint
<- function(s, es) {
firstCompatible if(length(es) >= 1) {
for (i in 1:length(es)) {
<- es[i]
e if(e > s) {
return(e)
}
}
}return(NA)
}
# Get possible genes for a sequence
<- function(dna) {
getPossibleGenes <- Biostrings::matchPattern(start, dna)@ranges@start
starts <- Biostrings::matchPattern(end, dna)@ranges@start
ends <- c()
sequences for(i in 1:length(starts)) {
<- firstCompatible(starts[i], ends)
startEnd if(!is.na(startEnd)) {
<- c(sequences,dna[starts[i]:startEnd])
sequences
}
}
sequences
}
# Get all possible genes, using the three frames
<- function(data) {
getAllPossibleGenes <- getPossibleGenes(translateAndRegularize(data,1))
l1 <- getPossibleGenes(translateAndRegularize(data,2))
l2 <- getPossibleGenes(translateAndRegularize(data,3))
l3 <- c(l1,l2,l3)
gs <- sapply(gs,length)
len order(len)]
gs[
}
# Plot frequencies
<- function(data) {
frequencies barplot(table(strsplit(as.character(data),split="")),cex.names=0.6)
}
After that, load the dengue and zika sequences:
<- readDNAString("dengue.fasta")
dengue <- readDNAString("zika.fasta") zika
2 Codon frequencies
Previously, we studied base sequences:
frequencies(dengue)
The same can be done for codons. However, there are three possible positions for codons, so we need to graph these thee:
> frequencies(translateAndRegularize(dengue,1))
> frequencies(translateAndRegularize(dengue,2))
> frequencies(translateAndRegularize(dengue,3))
The first one looks like this:
2.1 Questions
- Look at the graphs for positions 1, 2 and 3. Do they look the same?
- Do the same four graphs (one bases and three for codons) for the zika virus.
3 Gene finding
In the previous classes we learned about the simple method for looking for genes in a sequence. This method is implemented in the getAllPossibleGenes
function. For example:
> getAllPossibleGenes(DNAString("ATGTAG"))
1]]
[[2-letter AAString object
: M* seq
The advantage of this function is that we can do that for larger sequences:
> dengueGenes <- getAllPossibleGenes(dengue)
> dengueGenes
Each identified sequence has a different size, and as discussed during the letter, this is one of the main characteristics used to discern between false and true genes. The resulting list is sorted, look at the sizes for each sequence.
One can plot sizes with:
hist(sapply(dengueGenes,length),100,main="ORF Lengths",xlab="Length")
This plot makes it simpler to draw a boundary between sequences that are two small and those who have a sufficient size.
3.1 Questions
- How many sequences have a size bigger than 100? Show
dengueGenes
and take a look at array indices. - Repeat this analysis for the zika virus.