2.1.3 Gene Sequences
As mentioned in the lecture, there are many places where one can find sequenced genes. One of these is the NCBI Sequence Database which can be accessed via its website.
1 Finding gene sequence data
We are going to look for the data of the dengue virus, a mosquito transmitted virus which causes the dengue fever in tropical humid areas. For this, we look for the NC_001477
sequence in the search bar, and download it using the Download button. It must be saved in the Bioinformatics/workspace
folder with the name dengue.fasta
.
If you open this file in notepad, you will see the following:
After a line of descriptions, you should have the sequence of nucleotides found in the Dengue genome. Make sure to not change this file.
2 Using this file in R
By default, R doesn’t know how to load genome data. But other people wrote code that can load this data, and we can take advantage of this. To install the seqinr
type in console the following:
install.packages("seqinr")
This only needs to be done once. Once the library is installed in the system, you can load it with:
library("seqinr")
And finally, load the data inside the dengue
variable using:
dengue <- read.fasta("dengue.fasta")
You can check the values in this variable by typing dengue
. This will contain all the data set. To get only the nucleotides, we can do:
dengueseq <- c(dengue$NC_001477.1)
this will result in a vector whose entries are the nucleotides. To check the entry, type denqueseq
.
3 Simple Statistics
Now that we have the data in hand, what can we do with it? The first thing we can do is check nucleotide statistics. Let’s count how many of each of A, C, G and T we have.
> n1 <- count(dengueseq,1)
> n1
a c g t
3426 2240 2770 2299
As you can see, the quantity of each nucleotide is not the same. We can see this more clearly with a barplot:
barplot(n1,main="Nucleotides")
Individual elements of the vector can be accessed either by position or by name:
n1[1]
n1['aa']
Since the first is more explicit, it is often preferred.
These same statistics can be done for nucleotide pairs: AA , AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG and TT:
> n2 <- count(dengueseq,2)
> n2
aa ac ag at ca cc cg ct ga gc gg gt ta tc
1108 720 890 708 901 523 261 555 976 500 787 507 440 497
tg tt
832 529
and plot them again:
barplot(n2,main="Nucleotide Pairs",cex.names=0.8)
4 Exercises
Manually, calculate the nucleotide statistics and draw the histogram for the following sequences.
ACTCGACGGT
CGCATTGTAG
ATATCGTAAT
CCGTGGGCCT
TCCGGGTGCC
TAATGCTATA
Are the histograms that are identical? If yes, what is the relation between them? Is this relevant?