Skip to main content
Get your brand new Wikispaces Classroom now
and do "back to school" in style.
Data Science | DSDHT
Are there any new drugs left?
Boiling the ocean - integrating everything for patient care
Boiling the ocean - integrating everything together
Gene expression data analysis
Information-based Drug Discovery
Introduction to R Shiny
Managing disease with data science
Mapping Structure to function
Add "All Pages"
Predictive modeling of drug-protein interactions
In this module, we look at how mathematical and supervised machine learning methods like Linear Regression, Random Forests, Support Vector Machines and Bayesian methods can be used with chemical structure data, descriptors, and experimental biological training data to predict the interaction of chemical compounds and drugs with protein targets.
Predictive modeling and QSAR in cheminformatics
The establishment of
(SAR) in medicinal chemistry predates the use of computers in chemistry, and relies on correlating structural features with experimental results for multiple compounds, usually in the same series. It is common in medicinal chemistry to use synthesis techniques to create several related compounds (e.g. methyl-, ethyl-, butyl- forms), and then to investigate the effect of these synthetic changes on a particular property or biological activity (so we might find, for instance, that extending the Methyl chain reduces a particular activity). The relationship between structure and activity may or may not be quantified.
Quantitative Structure-Activity Relationships
(QSAR) were originally designed as an attempt to add some mathematical basis to this process, particularly to define the activity as some function of descriptors (note that when the activity is a property or a toxicity, this is sometimes referred to as QSPR and QSTR respectively). If we develop a function that relates descriptors to a particular activity, we can then use the function
for compounds where the activity is unknown but the descriptors can be calculated.
The earliest examples of QSAR were
which are actually applications of
. Hansch analysis pertained to property descriptors, and Free-Wilson, which we shall discuss here, to structural descriptors. Free-Wilson defined a function that equates activity (defined as log of 1 / the concentration) with weighted descriptors, the weightings, or coefficients, being determined by linear regression. That is, we have the equation:
Log (1/C) = a1x1 + a2x2 + a3x3 ...
where C is the concentration required for activity, x1, x2, x3, etc are the descriptor values (usually 1 or 0 to represent absence or presence of features), and a1, a2, a3, etc are the coefficients derived from linear regression. Linear regression is a generalized technique that aims to optimize the coefficients applied to independent variables so that the dependent variable (in this case Log 1/C) most closely matches the observed value for a set of descriptors. Thus one an think of a regression equation being
using data with known dependent values, and then being
predictively to data with unknown dependent values. Linear regression works by minimizing the sum of the differences between the values predicted by the equation and the actual observation. This is nicely illustrated in a
If a regression equation is to be used predictively, then we need some way of gauging its accuracy. The simplest way to do this is with
, which is the proportion of the variance in the dependent variable that is explained by the regression equation (i.e. if r-squared = 1.0, then all the actual points lie on the regression line; if r-squared = 0.0, then the variance around the regression line is as high as the overall variance of the dependent variable).
There is a problem though with r-squared: the same data that is used to build the equation is also used to evaluate it. This can be addressed using q-squared (sometimes called crossvalidated r-squared). Here, we make n versions of the equation, each build leaving one of the original known values out (it is thus an example of leave-one-out validation); the q-squared is then the mean overall variance in using the equation to predict the values left out. q-squared is always thus less than r-squared.
Nonlinear approaches to QSAR
The main drawback of these early approaches are that they assume that the activity varies linearly with the descriptor values that affect it. However, this is usually not the case. Nolinear approaches still try to correlate descriptors and outcomes, but do not make this assumption. They are thus at least theoretically more useful, although there is usually some trade-off (such as speed, scalability or interpretability). Nonlinear approaches are generally an example of
(as opposed to
such as clustering; however unsupervised methods such as
may also be employed). The method used will also sometimes depend on the kind of QSAR that is to be determined - particularly there is a difference between classification problems (such as predicting whether compounds are active or inactive) and quantitative prediction problems (where we want to predict an activity value). Some of the most frequently-used nonlinear methods for QSAR are:
(such as Recursive Partitioning and
Support Vector Machines
Different methods have different strengths and weakensses: for example neural nets are a "black box" approach and thus are not useful if we want to know
a particular prediction was made. Decision Trees are only usable for classification problems.
Regardless of the method use, building a model will generally be done in three phases: training (presenting known data to build the model); validation (testing the model with known data that has not been presented to build the model, such as a validation set); and prediction (using the model for truly unknown data). This is illustrated below.
Evaluation of predictive models
If predictive models are to be properly evaluated there are a few basic principles that should be adhered to:
For publication. public datasets should be used, and the method and descriptors used should be made freely available or be described well enough that a reader could replicate the experiment
A validation set should always be used, and any success statistics should be based on the validation set, not the training set
For classification problems, always create a
. From this, you can derive measures like
sensitivity and specificity
precision and recall
. For large sets, particularly for virtual screening applications, it is appropriate to show a
(one can also calculate AUC, or area under curve).
The use of predictive models is growing, since they aim to provide fast, reliable and quite accurate
estimates of the chemicals’ activity.
These features also make them suitable for legislative
purposes, and that is why they have been included as an alternative tool for risk assessment in the new European legislation on chemical production, called
R.E.A.C.H. (Registration, Evaluation,
Authorisation and Restriction of Chemicals)
Check out the series of videos QSARs.
ORCHESTRA is an EU project, funded to disseminate recent research on in silico (computer-based) methods for evaluating the toxicity of chemicals.
A Free online QSAR course is also available
Here is a video describing how to perform QSAR in R.
History of QSAR
a must read
Modeling methods in QSAR/QSPR
Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection
help on how to format text
Turn off "Getting Started"