Biology-based+data

**Basic Biology**
The basic unit of life is the cell. The membrane of the cell functions as a selective barrier to substances that enter the cell and #|exit from it. The membrane-bounded compartments filled with a concentrated aqueous solution of chemicals. When mixtures of gases such as CO 2, CH 4 , NH 3 , and H 2 are heated with water and energized by electrical discharge or by ultraviolet radiation, they react to form small organic molecules. Small organic molecules found in cells generates amino acids, sugars, and the purines and pyrimidines which are required to make **nucleotides**.

Simple organic molecules such as amino acids and nucleotides can associate to form polymers. One amino acid can #|join with another by forming a peptide bond, and two nucleotides can join together by a phosphodiester bond. The repetition of these reactions leads to linear polymers known as polypeptides and polynucleotides, respectively.Large polypeptides - known as proteins - and polynucleotides - in the form of both ribonucleic acids (RNA) and deoxyribonucleic acids (DNA). Each cell is an independent entity, capable of creating copies of itself by growing and dividing into two identical daughter cells. The hereditary information is stored within the DNA molecule. All cells transform DNA to proteins, which determine cells structure and function. Organisms can be divided into two #|classes:
 * [|Prokaryotes]
 * [|Eukaryotes]

In 1953 Watson and Crick characterized the [|DNA] properties. They discovered that the DNA is a long double stranded molecule composed of 4 types of bases(A), cytosine (C), guanine (G), and thymine (T).C and T bases are composed of one carbon ring which is called pyrmidine. A and G bases,however are composed of two coupled carbon rings, a structure termed purine. Each DNA strand has a chemical polarity, meaning, each of its two ends are chemically different. This polarity is indicated by referring to one end of the chain as the 5’ end (pronounced ”ﬁve prime”) and the other as the 3’ end (pronounced ”three prime”). The 5’ end terminates with a phosphate group attached to the ﬁfth carbon on the sugar-ring. The 3’ end terminates with hydroxyl group (OH) on the third carbon of the sugar-ring.The two strands of the DNA double helix are antiparallel, that is, the polarity of one strand is oriented opposite to that of the other strand.
 * DNA characteristics **

A gene is the basic unit of heredity in a living organism. Genes are segments of DNA that determine the characteristics of a species as a whole and of function of the cells within it.The term gene can be described as an **open reading frame** (**ORF** – the region of the gene that will be transcribed and translated into a protein sequence) together with its transcriptional control elements (promoter and terminator).Genes usually are found at deﬁned #|positions on a chromosome, and such a #|position may be referred to as a ** locus **. (A locus may correspond to a gene, but there are some loci that are not genes.) Most protein coding ORFs have the same overall format (Figure 1 ). They #|start with a particular triplet of DNA bases (ATG) and end at stop signal, again a triplet of DNA bases (either TGA, TTA or TAG).The transformation of a gene into a protein is called ** expression **.
 * Genes**





**Figure 1. Showing elements of Prokaryotic and Eukaryotic DNA elements (Taken from Analysis of Genes and Genomes R.Reece)** an exon.
 * Eukaryotic v/s prokaryotic genes **
 * Genes**
 * - Eukaryotic genes ** interrupted by long regions of non-coding sequence called introns. Each contiguous portion of a coding sequence is called

-Transcription factors bind to specific DNA sequences upstream of the #|start of operons, or sets of related genes. -Transcribed mRNA is directly translated by ribosomes.
 * Transcription **
 * Prokaryotes **

**In Eukaryotes**, - Each gene has its own transcriptional control (no operons) - mRNA is processed before translation.

- Splicing out introns - A 5' 7-methylguanylate cap (m7Gppp) is added - Polyadenylation adds a PolyA tail. - Processed mRNA is called mature
 * Post Processing of mRNA**
 * In Eukaryotes ** after transcription, but before translation, mRNAs are processed. Processing includes

Gene prediction or gene finding methods are mainly based on hidden Markov models or on neural networks. Different organism have gene codon preferences and splice junctions, therefore each genome requires a model trained to its speciﬁc characteristics. Genes can especially identiﬁed in prokaryotes through ORFs ,but in eukaryotes the problem is more difﬁcult because of introns which can have extended sequences. #|Hidden Markov models (HMMs) for gene ﬁnding include GeneMark,GeneMark.hmm, GLIMMER,GRAIL, GenScan / GenomeScan, and Genie.These HMM programs detect the coding region of a gene which is indicated by splice sites, start and stop codons, transcription factors, protein binding sites like the TATA-box, transcription start points, branch points, transcription termination sites, polyadenylation sites (which prevent mRNA from degradation),ribosomal binding sites, topoisomerase I cleavage sites, topoisomerase II binding sites etc. Here is a list of [|gene prediction tools].
 * Gene Prediction**

The innovative technologies have provided us the power to fetch information regarding the cellular activities such as Gene expression, protein protein expression etc. The data can be used as input for Bioinformatic tools, or can be used as a basis for Biological networks modeling.Some of the fundamental methods are described below. Here is a link to the youtube video for the [|Gene technology methods]
 * Biotechnological methods**

For example, if one holds a probe of DNA containing a known mutation causing a disease, he can examine whether this probe hybridizes with a DNA extract from a human cell in order to determine whether it carries this disease.
 * 1.Hybridization:- ** Hybridization is a process where complementary nucleic acid strands are paired into a double-stranded hybrid. This process occurs naturally when two DNA/RNA strands with complementary segments interact.Hybridization existence helps the identiﬁcation of segments in the DNA.


 * 2.Polymerase chain reaction:-** PCR is a method to amplify a DNA sequence - to replicate many copies of the same sequence in a short time.The procedure starts with a single DNA sequence of up to 10Kbp, which we want to replicate many times. This sequence is typically a gene, part of a gene, or even a non coding region. It will be used as a template for the reaction.

**Microarrays**: One method to monitor gene expression is by using DNA chips or microarrays assays. The Microarray enables us to measure the expression of many genes simultaneously.Most genes have different transcription levels at different conditions. According to those differences,one can deduce their functionality and essentiality in different environmental conditions. More information about microarrays can be found in the gene expression section.
 * 3. Gene expression analysis:- ** The system level expression of genes to proteins, mRNAs, tRNAs are important in understanding the various biological processes of the cell.It helps to understand the logic of how the cell operates.


 * 4. Protein Protein Interactions :- ** Protein-protein interactions refer to the physical association of protein molecules.Signals from the exterior of a cell are mediated to the inside of that cell by the interactions between signalling proteins.Finding interactions between proteins which are involved in common cellular functions, is a way to get a broader view of how they work cooperatively in a cell.Methods to study protein protein interaction can be found [|here] . More on protein protein interaction can be found on this [|EMBL- EBI course].

One of the aims of bioinformatics is to predict 3D protein structures from 1D amino acid sequences in order to understand the function and the folding process of the proteins. The structural levels of proteins are: More information about the protein structures you can get it from [|here].
 * Proteins **
 * 1D: primary structure, the amino acid sequence as assembled on the ribosome using the genetic code to translate mRNA (three mRNA nucleotides = one amino acid).
 * 2D: secondary structure, elements like loops, α-helices, and β-sheets which arise through local hydrogen bonds between amino acids and form a local minimal energy state.
 * 3D: tertiary structure, a global minimal energy state of the amino acid sequence through global interactions among amino acids. Secondary structures of proteins are coiled to constitute distinct structure called **domain**. **[|Domains]** are fundamental functional three dimensional structural units of polypeptides. Therefore, tertiary structure also describes the relationship of different domains to one another within a protein molecule. The interactions of different domains is governed by several forces: These include hydrogen bonding, hydrophobic interactions, electrostatic interactions and van der Waals forces.


 * Chemical properties of the 20 amino acids in proteins **



Protein folding is the process by which a string of amino acids (the chemical building blocks of protein) interacts with itself to form a stable three-dimensional structure during production of the protein within the cell. The folding of proteins thus facilitates the production of discrete functional entities, including enzymes and structural proteins, which allow the various processes associated with life to occur. Here is an animation to understand [|protein folding].
 * Protein folding**

The protein-folding problem came to be three main questions: (i) The physical folding code: How is the 3D native structure of a protein determined by the physicochemical properties that are encoded in its 1D amino-acid sequence? (ii) The folding mechanism: A polypeptide chain has an almost unfathomable number of possible conformations. How can proteins fold so fast? (iii) Predicting protein structures using computers: Can we devise a computer algorithm to predict a protein’s native structure from its amino acid sequence? Such an algorithm might circumvent the time-consuming process of experimental protein-structure determination and accelerate the discovery of protein structures and new drugs. Here is a t[|ed talk from Ken A Dill] who discusses the answers of the above questions.
 * [|Protein folding problem 50 years On] **

Protein folding is essential for the production of structures that can perform particular functions in the cell. Also it prevents inappropriate interactions between proteins, in that folding hides elements of the amino acid sequence which if exposed would react non-specifically with other proteins.
 * Importance of folding protein molecules **

We have read about the protein folding problem but recently a [|paper cooper etal] made an unique way to solve the problem.They made a computer game [|Foldit] which is a multiplayer online game that engages non-scientists in solving hard prediction problems. Foldit players interact with protein structures using direct manipulation tools and user-friendly versions of algorithms from the Rosetta structure prediction methodology, while they compete and collaborate to optimize the computed energy.

Under favorable conditions, most proteins have no problem quickly folding to their native structures. However, there are some proteins which appear unable to fold without the presence of other helper proteins, called chaperones. In the absence of chaperones, these proteins will fail to achieve their native state and instead may associate with other unfolded polypeptide chains to form large aggregate structures.Inappropriate folding is one way in which a protein imbalance may arise – the misfolded protein may be nonfunctional or suboptimally functional, or it may be degraded by cellular machinery.
 * Protein misfolding**

In many cases, misfolded proteins are r ecognised to be undesirable by a group of proteins called heat shock proteins, and consequently directed to protein degradation machinery in the cell. This involves conjugation to the protein ubiquitin, which acts as a tag that directs the proteins to proteasomes, where they are degraded into their constituent amino acids. Hence many protein misfolding diseases are characterised by absence of a key protein, as it has been recognised as dysfunctional and eliminated by the cell’s own machinery. Diseases caused as a consequence of misfolding, include cystic fibrosis & other disease .In addition, some cancers may be associated with misfolding. Many protein misfolding diseases are characterised not by disappearance of a protein but by its deposition in insoluble aggregates within the cell.
 * Protein misfolding diseases**

Here are two papers that discusses various protein misfolding diseases a) [|Protein Misfolding, Functional Amyloid, and Human Disease] b) [|Protein Misfolding and Human Disease]


 * Pathways **

A [|metabolic pathway] is a chain of enzymatic reactions.The pathway is a collection of step by step modifications: the initial substance used as substrate by the first enzyme is transformed into a product. This product will then be the substrate for the next reaction, until the exact chemical structure necessary for the cell is reached. Aerobic cellular respiration is a series of enzyme controlled biochemical reactions in which oxygen is involved in breakdown of glucose to carbon dioxide ,water and ATP(energy). Check [|here the animation] and discussion of the cellular respiration involving Glycolysis, Krebs Cycle and Electron transport system. [|Alexander Pico] surveys key methods used in [|visualizing biochemical & signalling pathways, highlighting CyAnimator], which manages dynamic data, and WikiPathways - a resource he co-developed - which allows community editing of pathways. He underscores the need for animations that can show molecular events and transitions more realistically than static pathways . This talk was presented at VIZBI 2012, an international conference series on visualizing biological data ([|http://vizbi.org]) funded by EMBO & NIH.
 * Example of metabolic pathway **

<span style="background-color: transparent !important; background-position: initial initial !important; background-repeat: initial initial !important; border: none !important; display: inline !important; float: none !important; font-family: arial,helvetica,sans-serif !important; font-size: 13px !important; font-style: normal !important; font-variant: normal !important; font-weight: normal !important; height: auto !important; line-height: 19px !important; margin: 0px !important; min-height: 0px !important; min-width: 0px !important; padding: 0px !important; text-decoration: underline !important; vertical-align: baseline !important; width: auto !important;"> on Reactome from [|EMBL-EBI]

[|Daniel Huson] r[|eviews approaches to the visual analysis of metagenomic data,] and he demonstrates the utility of his tool MEGAN for data obtained from environmental sequencing campaigns. He also discusses emerging standards being used for visualizing phyla abundance and the presence of particular enzymatic pathways. This talk was presented at VIZBI 2012, an international conference series on visualizing biological data (vizbi.org) funded by EMBO & NIH.
 * Metagenomics **

[|Susannah Tringe] talks about [|metagenomics analysis tools]and the challenges of gaining a systems-level understanding of a microbial community. She notes that there are some good analysis tools available for analysis at the 16S sequence and unassembled metagenome level, but more effort into analysis and visualisation solutions are required at the microbial community and ecosystem level. This talk was presented at VIZBI 2013.

Here is video of the [|GOLD Metagenomics database]


 * Learning Goals**

1**. Study [|Protein protein Interactions]** 2. **[|Protein Structures]** 3. **Pathways Databases :** [|Reactome]


 * References:**

[|Analysis of Genes and Genomes Richard Reece] [|Bioinformatics: Sequence and Genome Analysis] [|Introduction to Protein Structure] [|Molecular Biology of the Cell]