1.1 Bioinformatics Basics

1 Introduction

1.1 What is bioinformatics?

Bioinformatics is the use of computer methods for analyzing biological data.

Nowadays, it refers mostly to the computational analysis of gene sequences.

1.2 Bases

DNA is composed of four bases adenine (A), guanine (G), cytosine (C) and thymine (T).

Bases match in the DNA in two pairs: G-T and A-T.

1.3 Bases sequences

DNA is a sequence of such bases. The bases of DNA determines the RNA it can encode, and therefore the proteins in can enduce production of.

2 DNA sequencing

2.1 What DNA sequencing does

DNA sequencing may be used to determine the sequence of individual genes, larger genetic regions (i.e. clusters of genes or operons), full chromosomes, or entire genomes of any organism.

2.2 History

2.3 Maxam-Gilbert sequencing

Breaking DNA at specific sites, and marking. By matching breaking points, one can find the sequencing. Not used much anymore.

2.4 Chain-termination methods

PCR breaking and multiplying chains with fluorescent ending markers. Sorting and viewing.

2.5 Shotgun sequencing

Break DNA, sequence each part as above, and assemble fragments using overlapping.

2.6 Price

Price dropped from $2,400,000 per million bases to $5-$100 depending on de technique.

Nowadays human genome can be sequenced with less than $1000.

3 Applications

3.1 What is done with DNA sequences?

DNA sequencing is a key technology in biology, as medicine, forensics, and anthropology.

3.2 Molecular biology

It is used to study genomes and the proteins they encode.

It allows to identify changes in DNA and associations with diseases and phenotypes.

3.3 Evolutionary biology

DNA is transmitted from one generation to another, so sequencing is used to study how different organisms are related and how they evolved.

In 2021, the sequencing of DNA from a mammoth over a million years old was done.

3.4 Metagenomics

Metagenomics identifies organisms present in a body of water, sewage, dirt, swab samples from organisms…

DNA sequencing determines which types of microbes may be present in a such microbiome. This is important for ecology, epidemiology…

3.5 Virology

Sequencing is one of the main tools in virology to identify and study viruses, based in DNA or RNA.

There are more than 2.3 million unique viral sequences in GenBank.

3.6 Medicine

Medical technicians may sequence genes from patients to determine if there is risk of genetic diseases.

DNA sequencing cal also be used to diagnose and treat rare genetic diseases.

DNA sequencing can determine specific bacteria, to allow for more precise antibiotics treatments.

3.7 Forensic investigation

DNA sequencing is used for forensic identification and paternity testing.

The DNA patterns in fingerprint, saliva, hair follicles, etc. uniquely separate each living organism from another.

4 Bioinformatics

4.1 General idea

DNA is very large. Manual analysis is prohibitive, except for very short sequences.

Informatics allows large-scale work to be done.

4.2 DNA sequencing

DNA sequencing is still a non-trivial problem as the raw data may be noisy or affected by weak signals. Algorithms have been developed for base calling.

Most DNA sequencing techniques produce short fragments of sequence that need to be assembled. Shotgun sequencing yields sequence data quickly, and the task of assembling the fragments can be quite complicated for larger genomes.

4.3 Genome annotation

Annotation is the process of marking genes and other biological features in a DNA sequence. Annotation is made possible by the fact that genes have recognisable start and stop regions.

5 What will we do?

5.1 Why R?

As explained, genetic data is very large. It would not be practical to analyze it by hand, because it is slow. It is also not practical to use Excel, where all the data is visible all the time.

R is a simple programming language, and very practical for statistical data analysis.

5.2 Genome databases

Data can be found in genome databases such as NCBI (US), EMBL (EU), DDBJ (JP). Many specimens are sequenced.

5.3 Finding data

Data can be found in two ways:

Going to the website and downloading it
Programmatically, from R itself

> choosebank("refseqViruses")
> query("Dengue1", "AC=NC_001477")

5.4 Fasta

Usually, sequences are encoded in the FASTA format. We can manipulate this data in R.

5.5 Statistics

Using this data, it’s very simple to have statistics for bases:

> count(dengueseq, 1)
  a    c    g    t
3426 2240 2770 2299

Also pairs and so on:

> count(dengueseq, 2)
 aa   ac   ag   at   ca   cc   cg   ct
1108  720  890  708  901  523  261  555
 ga   gc   gg   gt   ta   tc   tg   tt
 976  500  787  507  440  497  832  529

Introduction to Bioinformatics