Assignment 1: Predictive Modeling for Chemical Compound Toxicity

What you will do in this assignment.
In this assignment, you will build a predictive model using R or Knime, that will be able to make predictions of ways that chemical compounds interfere in biochemical pathways using only chemical structure data. The computational models produced could for example become decision-making tools for government agencies in determining which environmental chemicals and drugs are of the greatest potential concern to human health.

You will do this by downloading a data set of known data on the biological activities of the Tox21 Data Challenge 2014 dataset that is provided here
or you can register and download the full training dataset. For individual datasets, please use this link. In the datasets, "1" means active, "0" means inactive. The final evaluation dataset is available for download.

First, watch this video by Abhik which runs through the process you will carry out for this assignment

In between you can also use KNIME to prepare the model a video how to use KNIME for predictive modeling is given below. You need to extend this workflow model and by including more Machine Learning Classifiers and the features into account.

Preparation and Data Cleaning
At first for the training dataset you need to remove the salts and solvents you can use RDKiT cartridge to do that. The data is actually not prepared yet for training. After cleaning up you need to generate descriptors and remove constant columns and use low variance / correlation filter to filter out more columns. You can check the videos how to create it in Knime.

Data Modeling
To generate the descriptors you can use knime to do to and then you can use knime for further modeling purpose . If you feel uncomfortable with Knime I would recommend you to use R for modeling. Since i haven't generate descriptors for you so you need to generate them by yourself in Knime and export the data as csv format for further processing in R.

There is not strict conditions you have to use KNIME you can use R for your project too. Knime will help you design workflow with less effort what you require in R.

Preparation in R
If you are using R you will need to load the following packages into R: ROCR, rpart, randomForest, party.
Codes are given in GitHub

Descriptor Generation in KNIME
To generate descriptors of chemical compounds use Knime nodes like RDKit Fingerprint ( Morgan [ Radius =4 numbits = 2048] ,MACCS) / CDK Fingerprint node for Pubchem fingerprint. If you want to use R for Machine Learning then you need to generate three kinds of fingerprints Morgan, MACCS and Pubchem and export the data to csv to processing with R.

Data Cleaning
An important part of modeling is "cleaning" the data - i.e. making sure extraneous data is removed, the data set is complete and consistent, and the variables are independent. In this case, we need to remove duplicate and highly correlated columns. After downloading the two files for your chosen assay, apply the two pieces of R code below to remove the constant (duplicate) and correlated columns. Note you will need to apply to both PubChem and MACCS sets - replace "somedata" below with the name you gave above when you read in the data (pubchem, maccs)
##Removing Constant or Near-Constant Columns
#Use dim() command to check the dimension of the dataframe. To remove constant or near constant columns from pubchem file
d <- data.frame(pubchem[,1:881]) #882 is the outcome column
dropc <- apply(d, 2, function(x) { length(which(x == 0))/length(x) > .9 })
d <- d[, !dropc]
Before doing final prediction and plotting of your dataset do cross validation(Note cross validation is not explained in the video)

3. Cross Validation test
Now the data is clean, we will use the data to do cross validation test's. You might want to know among the various algorithms like Naive Bayes, SVM and Random Forest which has better performance. One way to check the performance is performing cross validation.We will use 10 fold cross validation of the full dataset in order to check the performance of the model. A code is provided which does K fold (K =5) cross validation based on Random Forest algorithm. You can change the code and see the performance of the others. Select the algorithm based on the highest AUC score and then move on to next step.

4. Data Splitting and Modeling
After performing cross validation, select the best algorithm to do final predictive model. Split the dataset into train and test set. The training set should contain 80% of the data and the test set 20% of the data for model training and testing. We will use the default parameters for the algorithm to build your final model. (Cross validation is also used to test model parameters but this is not required in this assignment. However, if you are willing to do test the performance based on different parameters you are welcome to do so.)
We have provided the full R code to do this on GitHub, and instructions are given in the video above.

5. Prediction and Plotting
We can now use the model to predict for our test set, and see how well it performs. For each model test the test set and prediction performance using the predict,prediction and performance functions. Calculate the classification statistics for your model. Plot the ROC curve for your model for two different types of fingerprints MACCS, Morgan (radius = 4 , Num of Bits = 2048), . Use the varImpPlot function from the random forest package to plot the important variables (variables of interest) for your train model.

6. Discuss your results about which models are good and which are poor.
For each model (Morgan ,pubchem and MACCS), calculate the confusion matrix (true/false positives and negatives),accuracy,sensitivity and specificity values and state at the end which model you think is better and why. A classificationsummary.R code is provided for you perform the calculations of confusion matrix, accuracy, specificity and sensitivity.

7. Submit your results
Submit a PDF file for the Assignment 1 that contains:
(i) Your ROC plots for Morgan, Pubchem and MACCS fingerprints,
(ii) The confusion matrices
(iii) A short description of which you think is the best model and why
(iv) R codes or KNIME Workflow in Zip format