Skip to main content
Get your Wikispaces Classroom now:
the easiest way to manage your class.
Data Science | DSDHT
Are there any new drugs left?
Boiling the ocean - integrating everything for patient care
Boiling the ocean - integrating everything together
Gene expression data analysis
Information-based Drug Discovery
Introduction to R Shiny
Managing disease with data science
Mapping Structure to function
Add "All Pages"
Mapping Structure to function
Full slides of the talk
Mapping protein to function
Computational Protein Function Prediction
Ontology based methods
Proteins are biological macromolecules responsible for a wide range of activities such as constitution of the organs (structural proteins), the catalysis of biochemical reactions necessary for metabolism (enzymes), and the maintenance of the cellular environment (transmembrane proteins).
The early approaches to predicting protein function were experimental and usually focused on a specific target gene or protein, or a small set of proteins forming natural groups such as protein complexes. These approaches included gene knockout, targeted mutation and the inhibition of gene expression.However, it is still a daunting task to correctly assign functional annotations to these newly sequenced genomes based on their sequence information. It is not feasible to conduct conventional experimental procedures on this entire set of sequences for recovering the functional information. Researchers have focused on using homology or sequence similarity to transfer annotations to newly sequenced proteins using popular
homology search algorithms such as BLAST and FASTA. Although considering homology is a genuine way of inferring function in the light of evolution,practically, it is not always trivial to extract correct function information from a sequence database search result.
The concept of protein function is highly context-sensitive and not very well-defined. In fact, this concept typically acts as an umbrella term for all types of activities that a protein is involved in, be it cellular, molecular or physiological. One such categorization of the types of functions a protein can perform has been suggested by Bork et al. :
: The biochemical functions performed by a protein, such as ligand binding, catalysis of biochemical reactions and conformational changes.
: Many proteins come together to perform complex physiological functions, such as operation of metabolic pathways and signal transduction, to keep the various components of the organism working well.
: The integration of the physiological subsystems, consisting of various proteins performing their cellular functions, and the interaction of this integrated system with environmental stimuli determines the phenotypic properties and behavior of the organism.
The three categories are not independent, but rather are hierarchically related. For managing computational function prediction we need to transform the descriptive biological knowledge into qualitative and quantitative models, which requires robust and accessible biological information system. Protein functions or annotations have long been described with vocabularies that are conventionally used within each research community or research group.
In recent years controlled sets of functional vocabularies have been developed including Gene Ontology (GO), Enzyme Commission (EC) number, MIPS functional catalogue(FunCat), Transporter Classification System, KEGG orthology.
The Gene Ontology (GO) Consortium [
] of collaborating databases has developed a structured controlled vocabulary to describe gene function. GO vocabulary terms are arranged in a hierarchical fashion using a Directed Acyclic Graph (DAG) and are separated into three categories: Biological Process (BP), Molecular Function (MF), and Cellular Component (CC).Cellular component indicates to which anatomical part of the cell the protein belongs to, for example, ribosome (GO:0005840) or nucleus (GO:0005634). Biological process terms indicate assemblies of molecular functions which achieve a well defined task through a series of cellular events. Examples of biological processes are carbohydrate metabolism (GO:0003677), regulation of transcription (GO:0045449) etc. Molecular functions represent activities carried out at molecular level by proteins or complexes, for example, catalytic activity (GO:0003824) or DNA binding (GO:0003677) etc.Thus each GO term will have a category and an identifier in the format GO:xxxxxxx associated with it, along with a term definition to explain the meaning of the term.All terms in GO other than the root term have either is-a, is_part_of, positively regulates, or negatively regulates relationship with some other more general term. Glucose transport (GO:0015758) is_a
Hexose transport (GO:0008645), which ultimately is_a transport (GO:0006810).Due to this relationship when a protein is annotated by term X then it is automatically annotated by all ancestor terms of X.
Partial Gene Ontology hierarchy describing the ancestors of terms Glucoside transport and Glucose transport. Double lined arrows show the path to the Lowest Common Ancestor (LCA) of the two terms.
GO possesses all the desirable properties of a functional classification system listed earlier. In fact, its design ideology incorporated all these properties. The major points which illustrates the properties of GO are:
a) Wide coverage b) Standardized format c) Hierarchical structure d) Disjoint categories e) Multiple functions f) Dynamic nature
Conceptual foundations of GO have produced great results, both in a quantitative and qualitative sense.
[GO Consortium 2006], has enhanced the utility of GO for experimental and computational biologists substantially.
GO is the way to go
for the field of protein function prediction.
Sequence and Structure based methods for protein function prediction
Another way of protein function prediction is largely based on transferring the functional knowledge from sequences similar to the one being searched using BLAST,FASTA or SSEARCH. Prediction of protein function from sequence can be categorized into three classes, namely, sequence homology-based approaches, subsequence-based approaches and feature-based approaches.
imple transfer of annotation from the most homologous sequence.
Sub-sequence based methods
his approach treats segments or subsequences(Motifs and domains) of proteins as features of a protein sequence and construct models for the mapping of these features to protein function.
: Final category of approaches attempts to exploit the perspective that the amino acid sequence is a unique characterization of a protein, and determines several of its physical and functional features. Feature-based approaches, which use standard
classification algorithms to learn models of functional classes from the transformed set of features, and then utilize this model to make predictions for uncharacterized proteins. The most commonly used classifiers in this class of approaches are support vector machines
(SVM), neural networks (NN) and the naive Bayesian classifier.
Enhanced sequence based methods
Protein function prediction (PFP) algorithm for function prediction which extends a conventional PSI-BLAST search. Along with strong PSI-BLAST hits which have significant E-value, PFP also uses weak hits that are not generally considered for transferring annotations. Weakly similar hits that are not recognized as homologous to the query sequence are used in PFP because they share common functional domains or some functional similarity at a broader level. GO terms extracted from retrieved sequences are ranked according to the following equation considering the E-value assigned to the retrieved sequences. The Flow chart of PFP is given below
Protein Domain prediction
is a simple modular architecture research tool and database that provides domain identification and annotation on the web. SMART detects extracellular domains those that mediate cell signaling events. Domains in both eukaryotic and prokaryotic regulatory systems can also be detected. Sequences are searched for signal peptides, coiled coils, low complexity regions and transmembrane domains.
Computational Approach for Protein Function prediction: A survey
Protein function prediction (PFP)
help on how to format text
Turn off "Getting Started"