1.1 Bioinformatics Basics
1 Introduction
1.1 What is bioinformatics?
Bioinformatics is the use of computer methods for analyzing biological data.
Nowadays, it refers mostly to the computational analysis of gene sequences.
1.2 Bases
DNA is composed of four bases adenine (A), guanine (G), cytosine (C) and thymine (T).
Bases match in the DNA in two pairs: G-T and A-T.
1.3 Bases sequences
DNA is a sequence of such bases. The bases of DNA determines the RNA it can encode, and therefore the proteins in can enduce production of.
2 DNA sequencing
2.1 What DNA sequencing does
DNA sequencing may be used to determine the sequence of individual genes, larger genetic regions (i.e. clusters of genes or operons), full chromosomes, or entire genomes of any organism.
2.2 History
2.3 Maxam-Gilbert sequencing
Breaking DNA at specific sites, and marking. By matching breaking points, one can find the sequencing. Not used much anymore.
2.4 Chain-termination methods
PCR breaking and multiplying chains with fluorescent ending markers. Sorting and viewing.
2.5 Shotgun sequencing
Break DNA, sequence each part as above, and assemble fragments using overlapping.
2.6 Price
Price dropped from $2,400,000 per million bases to $5-$100 depending on de technique.
Nowadays human genome can be sequenced with less than $1000.
3 Applications
3.1 What is done with DNA sequences?
DNA sequencing is a key technology in biology, as medicine, forensics, and anthropology.
3.2 Molecular biology
It is used to study genomes and the proteins they encode.
It allows to identify changes in DNA and associations with diseases and phenotypes.
3.3 Evolutionary biology
DNA is transmitted from one generation to another, so sequencing is used to study how different organisms are related and how they evolved.
In 2021, the sequencing of DNA from a mammoth over a million years old was done.
3.4 Metagenomics
Metagenomics identifies organisms present in a body of water, sewage, dirt, swab samples from organisms…
DNA sequencing determines which types of microbes may be present in a such microbiome. This is important for ecology, epidemiology…
3.5 Virology
Sequencing is one of the main tools in virology to identify and study viruses, based in DNA or RNA.
There are more than 2.3 million unique viral sequences in GenBank.
3.6 Medicine
Medical technicians may sequence genes from patients to determine if there is risk of genetic diseases.
DNA sequencing cal also be used to diagnose and treat rare genetic diseases.
DNA sequencing can determine specific bacteria, to allow for more precise antibiotics treatments.
3.7 Forensic investigation
DNA sequencing is used for forensic identification and paternity testing.
The DNA patterns in fingerprint, saliva, hair follicles, etc. uniquely separate each living organism from another.
4 Bioinformatics
4.1 General idea
DNA is very large. Manual analysis is prohibitive, except for very short sequences.
Informatics allows large-scale work to be done.
4.2 DNA sequencing
DNA sequencing is still a non-trivial problem as the raw data may be noisy or affected by weak signals. Algorithms have been developed for base calling.
Most DNA sequencing techniques produce short fragments of sequence that need to be assembled. Shotgun sequencing yields sequence data quickly, and the task of assembling the fragments can be quite complicated for larger genomes.
4.3 Genome annotation
Annotation is the process of marking genes and other biological features in a DNA sequence. Annotation is made possible by the fact that genes have recognisable start and stop regions.
5 What will we do?
5.1 Why R?
As explained, genetic data is very large. It would not be practical to analyze it by hand, because it is slow. It is also not practical to use Excel, where all the data is visible all the time.
R is a simple programming language, and very practical for statistical data analysis.
5.2 Genome databases
Data can be found in genome databases such as NCBI (US), EMBL (EU), DDBJ (JP). Many specimens are sequenced.
5.3 Finding data
Data can be found in two ways:
- Going to the website and downloading it
- Programmatically, from R itself
> choosebank("refseqViruses")
> query("Dengue1", "AC=NC_001477")
5.4 Fasta
Usually, sequences are encoded in the FASTA format. We can manipulate this data in R.
5.5 Statistics
Using this data, it’s very simple to have statistics for bases:
> count(dengueseq, 1)
a c g t
3426 2240 2770 2299
Also pairs and so on:
> count(dengueseq, 2)
aa ac ag at ca cc cg ct
1108 720 890 708 901 523 261 555
ga gc gg gt ta tc tg tt
976 500 787 507 440 497 832 529