Skip to main content
Wikispaces Classroom is now free, social, and easier than ever.
Try it today.
Data Science | DSDHT
Are there any new drugs left?
Boiling the ocean - integrating everything for patient care
Boiling the ocean - integrating everything together
Gene expression data analysis
Information-based Drug Discovery
Introduction to R Shiny
Managing disease with data science
Mapping Structure to function
Add "All Pages"
In this module, Vineela Gangalapudi walks through the field of human molecular biology, including the "central dogma" of modern biology and how this relates to DNA, genes, proteins and cells
Here is a copy of the presentation given in the video.
biology based data.pptx
The basic unit of life is the cell. The membrane of the cell functions as a selective barrier to substances that enter the cell and [[#|exit]] from it. The membrane-bounded compartments filled with a concentrated aqueous solution of chemicals. When mixtures of gases such as CO
, and H
are heated with water and energized by electrical discharge or by ultraviolet radiation, they react to form small organic molecules. Small organic molecules found in cells generates amino acids, sugars, and the purines and pyrimidines which are required to make
Simple organic molecules such as amino acids and nucleotides can associate to form polymers. One amino acid can [[#|join]] with
another by forming a peptide bond, and two nucleotides can join together by a phosphodiester bond. The repetition of these reactions leads to linear polymers known as polypeptides and polynucleotides, respectively.Large polypeptides - known as proteins - and polynucleotides - in the form of both ribonucleic acids (RNA) and deoxyribonucleic acids (DNA). Each cell is an independent entity, capable of creating copies of itself by growing and dividing into two identical daughter cells. The hereditary information is stored within the DNA molecule. All cells transform DNA to proteins, which determine cells structure and function.
Organisms can be divided into two [[#|classes]]:
In 1953 Watson and Crick characterized the
properties. They discovered that the DNA is a long double stranded molecule composed of 4 types of bases(A), cytosine (C), guanine (G), and thymine (T).C and T bases are composed of one carbon ring which is called pyrmidine. A and G bases,however are composed of two coupled carbon rings, a structure termed purine. Each DNA strand has a chemical polarity, meaning, each of its two ends are chemically different. This polarity is indicated by referring to one end of the chain as the 5’ end (pronounced ”ﬁve prime”) and the other as the 3’ end (pronounced ”three prime”). The 5’ end terminates with a phosphate group attached to the ﬁfth carbon on the sugar-ring. The 3’ end terminates with hydroxyl group (OH) on the third carbon of the sugar-ring.The two strands of the DNA double helix are antiparallel, that is, the polarity of one strand is oriented opposite to that of the other strand.
A gene is the basic unit of heredity in a living organism. Genes are segments of DNA that determine the
characteristics of a species as a whole and of function of the cells within it.The term gene can be described as an
open reading frame
– the region of the gene that will be transcribed and translated into a protein sequence) together with its transcriptional control elements (promoter and terminator).Genes usually are found at deﬁned [[#|positions]] on a chromosome, and such a [[#|position]] may be referred to as a
. (A locus may correspond to a gene, but there are some loci that are not genes.) Most protein coding ORFs have the same overall format (Figure 1 ). They [[#|start]] with a particular triplet of DNA bases (ATG) and end at stop signal, again a triplet of DNA bases (either TGA, TTA or TAG).The transformation of a gene into a protein is called
Figure 1 . Showing elements of Prokaryotic and Eukaryotic DNA elements (Taken from Analysis of Genes and Genomes R.Reece)
Eukaryotic v/s prokaryotic genes
- Eukaryotic genes
interrupted by long regions of non-coding sequence called introns. Each contiguous portion of a coding sequence is called
-Transcription factors bind to specific DNA sequences upstream
of the [[#|start]] of operons, or sets of related genes.
-Transcribed mRNA is directly translated by ribosomes.
- Each gene has its own transcriptional control (no operons)
- mRNA is processed before translation.
Post Processing of mRNA
after transcription, but before translation, mRNAs are processed. Processing includes
- Splicing out introns
- A 5' 7-methylguanylate cap (m7Gppp) is added
Polyadenylation adds a PolyA tail.
- Processed mRNA is called mature
Gene prediction or gene finding methods are mainly based on
hidden Markov models or on neural networks. Different organism have gene codon preferences and splice junctions, therefore each genome requires a model trained to its speciﬁc characteristics. Genes can especially identiﬁed in prokaryotes through ORFs ,but in eukaryotes the problem is more difﬁcult because of introns which can have extended sequences.
[[#|Hidden]] Markov models (HMMs)
for gene ﬁnding include GeneMark,GeneMark.hmm, GLIMMER,GRAIL, GenScan / GenomeScan, and Genie.These HMM programs detect the coding region of a gene which is indicated by splice sites, start and stop codons, transcription factors, protein binding sites like the TATA-box, transcription start points, branch points, transcription termination sites, polyadenylation sites (which prevent mRNA from degradation),ribosomal binding sites, topoisomerase I cleavage sites, topoisomerase II binding sites etc. Here is a list of
gene prediction tools
The innovative technologies have provided us the power to fetch information regarding the cellular activities such as Gene expression, protein protein expression etc.
The data can be used as input for Bioinformatic tools, or can be used as a basis for Biological networks modeling.Some of the fundamental methods are described below. Here is a link to the youtube video for the
Gene technology methods
Hybridization is a process where complementary nucleic acid strands are paired into a double-stranded hybrid. This process occurs naturally when two DNA/RNA strands with complementary segments interact.Hybridization existence helps the identiﬁcation of segments in the DNA.
For example, if one holds a
probe of DNA containing a known mutation causing a disease, he can examine whether this probe hybridizes with a DNA extract from a human cell in order to determine whether it carries this disease.
2.Polymerase chain reaction:-
PCR is a method to amplify a DNA sequence - to replicate many copies of the same sequence in a short time.The procedure starts with a single DNA sequence of up to 10Kbp, which we want to replicate many times. This sequence is typically a gene, part of a gene, or even a non coding region. It will be used as a template for the reaction.
3. Gene expression analysis:-
The system level expression of genes to proteins, mRNAs, tRNAs are important in understanding the various biological processes of the cell.It helps to understand the logic of how the cell operates.
: One method to monitor gene expression is by using DNA chips or microarrays assays. The Microarray enables us to measure the expression of many genes simultaneously.Most genes have different transcription levels at different conditions. According to those differences,one can deduce their functionality and essentiality in different environmental conditions. More information about microarrays can be found in the gene expression section.
4. Protein Protein Interactions :-
Protein-protein interactions refer to the physical association of protein molecules.Signals from the exterior of a cell are mediated to the inside of that cell by the interactions between signalling proteins.Finding interactions between proteins which are involved in common cellular functions, is a way to get a broader view of how they work cooperatively in a cell.Methods to study protein protein interaction can be found
. More on protein protein interaction can be found on this
EMBL- EBI course
One of the aims of bioinformatics is to predict 3D protein structures from 1D amino acid sequences in order to understand the function and the folding process of the proteins.
The structural levels of proteins are:
1D: primary structure, the amino acid sequence as assembled on the ribosome using the genetic code to translate mRNA (three mRNA nucleotides = one amino acid).
2D: secondary structure, elements like loops, α-helices, and β-sheets which arise through local hydrogen bonds between amino acids and form a local minimal energy state.
3D: tertiary structure, a global minimal energy state of the amino acid sequence through global interactions among amino acids.
Secondary structures of proteins are coiled to constitute distinct structure called
are fundamental functional three
structural units of polypeptides. Therefore, tertiary structure also describes the relationship of different domains to one another within a protein molecule.
The interactions of different domains is governed by several forces: These include hydrogen bonding, hydrophobic interactions, electrostatic interactions and van der Waals forces.
More information about the protein structures you can get it from
Chemical properties of the 20 amino acids in proteins
Table taken from David Mount - Bioinformatics sequence and genome analysis.
Protein folding is the process by which a string
of amino acids (the chemical building blocks of protein) interacts with itself to form a stable three-dimensional structure during production of the protein within the cell. The folding of proteins thus facilitates the production of discrete functional entities, including enzymes and structural proteins, which allow the various processes associated with life to occur.
Here is an animation to understand
Protein folding problem 50 years On
The protein-folding problem came to be three main questions: (i) The physical folding code: How is the 3D native structure of a protein determined by the physicochemical properties that are encoded in its 1D amino-acid sequence? (ii) The folding mechanism: A polypeptide chain has an almost unfathomable number of possible conformations. How can proteins fold so fast? (iii) Predicting protein structures using computers: Can we devise a computer algorithm to predict a protein’s native structure from its amino acid sequence? Such an algorithm might circumvent the time-consuming process of experimental protein-structure determination and accelerate the discovery of protein structures and new drugs.
Here is a t
ed talk from Ken A Dill
who discusses the answers of the above questions.
Importance of folding protein molecules
Protein folding is essential for the production of
structures that can perform particular functions in the cell . Also it prevents inappropriate interactions between proteins, in that folding hides elements of the amino acid sequence which if exposed would react non-specifically with other proteins.
We have read about the protein folding problem but recently a
paper cooper etal
made an unique way to solve the problem.They made a computer game
which is a multiplayer online game that engages non-scientists in solving hard prediction problems. Foldit players interact with protein structures using direct manipulation tools and user-friendly versions of algorithms from the Rosetta structure prediction methodology, while they compete and collaborate to optimize the computed energy.
Under favorable conditions, most proteins have no
problem quickly folding to their native structures. However, there are some proteins which appear unable to fold without the presence of other helper proteins, called chaperones. In the absence of chaperones, these proteins will fail to achieve their native state and instead may associate with other unfolded polypeptide chains to form large aggregate structures.Inappropriate folding is one way in which a protein imbalance may arise – the misfolded protein may be nonfunctional or suboptimally functional, or it may be degraded by cellular machinery.
Protein misfolding diseases
In many cases, misfolded proteins are r
ecognised to be undesirable by a group of proteins called heat shock proteins, and consequently directed to protein degradation machinery in the cell. This involves conjugation to the protein ubiquitin, which acts as a tag that directs the proteins to proteasomes, where they are degraded into their constituent amino acids.
Hence many protein misfolding diseases are
characterised by absence of a key protein, as it has been recognised as dysfunctional and eliminated by the cell’s own machinery. Diseases caused as a consequence of misfolding, include cystic fibrosis & other disease .In addition, some cancers may be associated with misfolding. Many protein misfolding diseases are characterised not by disappearance of a protein but by its deposition in insoluble aggregates within the cell.
Here are two papers that discusses various protein misfolding diseases
Protein Misfolding, Functional Amyloid, and Human Disease
Protein Misfolding and Human Disease
chain of enzymatic reactions.The pathway is a collection of step by step modifications: the initial substance used as substrate by the first enzyme is transformed into a product. This product will then be the substrate for the next reaction, until the exact chemical structure necessary for the cell is reached.
Example of metabolic pathway
Aerobic cellular respiration is a series of enzyme controlled biochemical reactions in which oxygen is involved in breakdown of glucose to carbon dioxide ,water and ATP(energy). Check
here the animation
and discussion of the cellular respiration involving Glycolysis, Krebs Cycle and Electron transport system.
surveys key methods used in
visualizing biochemical & signalling pathways, highlighting CyAnimator
, which manages dynamic data, and WikiPathways - a resource he co-developed - which allows community editing of pathways. He underscores the need for animations that can show molecular events and transitions more realistically than static pathways
This talk was presented at VIZBI 2012, an international conference series on visualizing biological data (
) funded by EMBO & NIH.
on Reactome from
eviews approaches to the visual analysis of metagenomic data,
and he demonstrates the utility of his tool MEGAN for data obtained from environmental sequencing campaigns. He also discusses emerging standards being used for visualizing phyla abundance and the presence of particular enzymatic pathways. This talk was presented at VIZBI 2012, an international conference series on visualizing biological data (
) funded by EMBO & NIH.
metagenomics analysis tools
and the challenges of gaining a systems-level understanding of a microbial community. She notes that there are some good analysis tools available for analysis at the 16S sequence and unassembled metagenome level, but more effort into analysis and visualisation solutions are required at the microbial community and ecosystem level. This talk was presented at VIZBI 2013.
Here is video of the
GOLD Metagenomics database
Protein protein Interactions
Pathways Databases :
Analysis of Genes and Genomes Richard Reece
Bioinformatics: Sequence and Genome Analysis
Introduction to Protein Structure
Molecular Biology of the Cell
help on how to format text
Turn off "Getting Started"