Full slides of the talk

Computational Protein Function Prediction

Ontology based methods
Proteins are biological macromolecules responsible for a wide range of activities such as constitution of the organs (structural proteins), the catalysis of biochemical reactions necessary for metabolism (enzymes), and the maintenance of the cellular environment (transmembrane proteins).
The early approaches to predicting protein function were experimental and usually focused on a specific target gene or protein, or a small set of proteins forming natural groups such as protein complexes. These approaches included gene knockout, targeted mutation and the inhibition of gene expression.However, it is still a daunting task to correctly assign functional annotations to these newly sequenced genomes based on their sequence information. It is not feasible to conduct conventional experimental procedures on this entire set of sequences for recovering the functional information. Researchers have focused on using homology or sequence similarity to transfer annotations to newly sequenced proteins using popular
homology search algorithms such as BLAST and FASTA. Although considering homology is a genuine way of inferring function in the light of evolution,practically, it is not always trivial to extract correct function information from a sequence database search result.

The concept of protein function is highly context-sensitive and not very well-defined. In fact, this concept typically acts as an umbrella term for all types of activities that a protein is involved in, be it cellular, molecular or physiological. One such categorization of the types of functions a protein can perform has been suggested by Bork et al. [1998]:
  • Molecular function : The biochemical functions performed by a protein, such as ligand binding, catalysis of biochemical reactions and conformational changes.
  • Cellular function : Many proteins come together to perform complex physiological functions, such as operation of metabolic pathways and signal transduction, to keep the various components of the organism working well.
  • Phenotypic function : The integration of the physiological subsystems, consisting of various proteins performing their cellular functions, and the interaction of this integrated system with environmental stimuli determines the phenotypic properties and behavior of the organism.
The three categories are not independent, but rather are hierarchically related. For managing computational function prediction we need to transform the descriptive biological knowledge into qualitative and quantitative models, which requires robust and accessible biological information system. Protein functions or annotations have long been described with vocabularies that are conventionally used within each research community or research group.
In recent years controlled sets of functional vocabularies have been developed including Gene Ontology (GO), Enzyme Commission (EC) number, MIPS functional catalogue(FunCat), Transporter Classification System, KEGG orthology.

Gene Ontology
The Gene Ontology (GO) Consortium [17] of collaborating databases has developed a structured controlled vocabulary to describe gene function. GO vocabulary terms are arranged in a hierarchical fashion using a Directed Acyclic Graph (DAG) and are separated into three categories: Biological Process (BP), Molecular Function (MF), and Cellular Component (CC).Cellular component indicates to which anatomical part of the cell the protein belongs to, for example, ribosome (GO:0005840) or nucleus (GO:0005634). Biological process terms indicate assemblies of molecular functions which achieve a well defined task through a series of cellular events. Examples of biological processes are carbohydrate metabolism (GO:0003677), regulation of transcription (GO:0045449) etc. Molecular functions represent activities carried out at molecular level by proteins or complexes, for example, catalytic activity (GO:0003824) or DNA binding (GO:0003677) etc.Thus each GO term will have a category and an identifier in the format GO:xxxxxxx associated with it, along with a term definition to explain the meaning of the term.All terms in GO other than the root term have either is-a, is_part_of, positively regulates, or negatively regulates relationship with some other more general term. Glucose transport (GO:0015758) is_a
Hexose transport (GO:0008645), which ultimately is_a transport (GO:0006810).Due to this relationship when a protein is annotated by term X then it is automatically annotated by all ancestor terms of X.

Screen Shot 2014-03-18 at 4.29.30 PM.png
Partial Gene Ontology hierarchy describing the ancestors of terms Glucoside transport and Glucose transport. Double lined arrows show the path to the Lowest Common Ancestor (LCA) of the two terms.

GO possesses all the desirable properties of a functional classification system listed earlier. In fact, its design ideology incorporated all these properties. The major points which illustrates the properties of GO are:a) Wide coverage b) Standardized format c) Hierarchical structure d) Disjoint categories e) Multiple functions f) Dynamic nature
Conceptual foundations of GO have produced great results, both in a quantitative and qualitative sense. AmiGO browser [GO Consortium 2006], has enhanced the utility of GO for experimental and computational biologists substantially. GO is the way to go for the field of protein function prediction.
Sequence and Structure based methods for protein function prediction
Another way of protein function prediction is largely based on transferring the functional knowledge from sequences similar to the one being searched using BLAST,FASTA or SSEARCH. Prediction of protein function from sequence can be categorized into three classes, namely, sequence homology-based approaches, subsequence-based approaches and feature-based approaches.
  • Homology-based approaches : Simple transfer of annotation from the most homologous sequence.
  • Sub-sequence based methods: This approach treats segments or subsequences(Motifs and domains) of proteins as features of a protein sequence and construct models for the mapping of these features to protein function.
  • Feature-based approaches: Final category of approaches attempts to exploit the perspective that the amino acid sequence is a unique characterization of a protein, and determines several of its physical and functional features. Feature-based approaches, which use standard
    classification algorithms to learn models of functional classes from the transformed set of features, and then utilize this model to make predictions for uncharacterized proteins. The most commonly used classifiers in this class of approaches are support vector machines
    (SVM), neural networks (NN) and the naive Bayesian classifier.

Enhanced sequence based methods
Protein function prediction (PFP) algorithm for function prediction which extends a conventional PSI-BLAST search. Along with strong PSI-BLAST hits which have significant E-value, PFP also uses weak hits that are not generally considered for transferring annotations. Weakly similar hits that are not recognized as homologous to the query sequence are used in PFP because they share common functional domains or some functional similarity at a broader level. GO terms extracted from retrieved sequences are ranked according to the following equation considering the E-value assigned to the retrieved sequences. The Flow chart of PFP is given below

Screen Shot 2014-03-18 at 9.45.45 PM.png
Protein Domain predictionSMART is a simple modular architecture research tool and database that provides domain identification and annotation on the web. SMART detects extracellular domains those that mediate cell signaling events. Domains in both eukaryotic and prokaryotic regulatory systems can also be detected. Sequences are searched for signal peptides, coiled coils, low complexity regions and transmembrane domains.
Learning Goal
Computational Approach for Protein Function prediction: A survey

References:Protein function prediction (PFP)