2.2.1 GC Content

In this course we will write functions to analyze the GC content of two viruses. You can download the sequences by clicking the following links: dengue and zika. Right click it, and choose “Save as…” and choose the Bioinformatics/workspace folder that is in the desktop, as always. Remember to load the library seqinr using library(seqinr).

1 GC Content

As seen in the lecture, the A-T bond has two hydrogen bonds, while the G-C one has three. The G-C bond is more thermostable and allows more stable stackings on the DNA strand. DNA with low GC-content is less stable than DNA with high GC-content. Especially at high temperatures.

The GC content is calculated as:

We will define a function that calculates GC from data. The data will have the following format, as obtained from the count function:

a    c    g    t 
3426 2240 2770 2299 

then we can define:

myGC <- function(data) {
    c <- data["c"]
    g <- data["g"]
    total <- sum(data)
    (c+g)/total
}

We will use this to calculate the GC content for dengue:

> dengue <- read.fasta("dengue.fasta")[[1]]
> dengueFreq <- count(dengue,1)
> dengueGC <- myGC(dengueFreq)
> dengueGC
0.4666977 

Threfore, the dengue sequence has around 47% of GC content. Do the same for zika, by yourself.

1.1 Using seqinr

The library seqinr has a function that automatically calculates the GC content for any sequence.

> GC(dengue)
0.4666977