Data Source
Datasets for three confirmatory bioassay screens based on microplate Alamar blue assay for identifying inhibitors of Mycobacterium tuberculosis in 7H12 broth were available at PubChem database of National Center for Biotechnology Information (NCBI) [20]. These correspond to assay identifiers: AID1626, AID1949 and AID1332 [21, 22]. All the three assays were conducted through the Tuberculosis Antimicrobial Acquisition and Coordinating Facility (TAACF) [23]. The total number of compounds tested in each assay along with the number of compounds identified as actives, inactives and inconclusive are enlisted in Additional file 3. Compounds that showed > 30% inhibition for at least one concentration in the anti-microbial activity dose response were defined as "Active". If the inhibition at all doses was < 30% in the Mtb assay, the compound was defined as "Inactive". Compounds with a percent inhibition > 80% but were not selected for follow up dose response were labeled "Inconclusive." Inconclusive compounds were not included in the training dataset to avoid uncertainty in predictive ability of the models. All the three confirmatory screens of inhibitors of Mycobacterium tuberculosis were downloaded in SDF formats. Their corresponding bioactivity data was also obtained from PubChem as a comma separated file. Figure 4 depicts the general flow of the strategy employed for data processing, analysis, model building and validation.
Descriptor generation and dataset preparation
2D Molecular descriptors for all compounds were generated in the molecular visualization software PowerMV [24]. Owing to the large number of molecules in datasets AID1626 and AID1949 these files were split to smaller files using a Perl script SplitSDFiles available from MayaChemTools [25], prior to loading in PowerMV. Bioactivity values were appended to the data as class attribute (label: Outcome). A total of 179 descriptors were generated for each of the dataset. Details of various descriptors are provided in Additional file 4 and a comparative account of contribution of each descriptor to molecular properties of all compounds is summarized in Additional file 5. The bit-string fingerprint attributes having no variation i.e. the ones containing only 0's or 1's throughout the dataset were filtered out in order to reduce the dimensionality of the dataset. The datasets were ordered by class and a bespoke Perl script was used to randomly split them into two parts: 80% of the data was used as the training-cum-validation set and 20% of data as an independent test set. In Cross-validation, a number of folds n is specified. The dataset is randomly reordered and then split into n folds of equal size. In each iteration, one fold is used for testing and the other n-1 folds are used for training the classifier. The test results are collected and averaged over all folds. This gives the cross-validated estimate of the accuracy. Accounting for the computational complexity, a 5-fold cross-validation was performed on larger dataset and a 10-fold on smaller dataset.
Machine Learning
We used the Weka data mining toolkit (version 3.6) for analysis of the data and classification experiments [26]. Weka incorporates comprehensive collection of machine learning algorithms for data mining tasks. It also incorporates tools for data pre-processing, classification, regression, clustering, association and visualization. The toolkit is also well-suited for developing new machine learning schemes.
Classification Algorithms
Classification refers to an algorithmic procedure that attempts to assign each input value, a given set of classes. The classification process requires building a classifier (model) which is a mathematical function that assigns class (ex. active/inactive) labels to instances defined by a set of attributes (ex. descriptors). In the present study, we compared four state-of-the-art classifiers namely Naive Bayes, Random Forest, SMO and J48. The salient features of each of the classifiers are described below-
Naïve Bayes (NB) is based on Bayes rule [27]. The Naïve Bayes classifier learns from training dataset, the conditional probability of each attribute given the class label. This approach assumes that all descriptors are statistically independent (i.e. presence of one has no effect on the other) and therefore considers each of them individually. Bayes' theorem finds the probability of an event occurring given the probability of another event that has already occurred. The probability of a molecule to be in one or the other class is considered to be proportional to the ratio of members in each of the class that share the descriptor value. The overall probability of activity is computed by the product of the individual probabilities.
Random Forests (RF) are an ensemble of trees [28]. In order to grow these ensembles, random vectors are generated from the training set where a different set of descriptors is used to build each tree. Random vectors are drawn by randomly selecting a subset of features to grow each tree. After a large number of trees are grown, the vote for the most popular class is performed.
Sequential Minimal Optimization (SMO) is an implementation of Support Vector Machines (SVM) [29]. A SVM is a hyperplane that separates members of one class from members of another class (actives and inactives in this case) with maximum margin. Unlike SVM which requires solving a large quadratic programming (QP) optimization problem, SMO breaks this problem into smallest possible QP problems and solves them analytically. SMO is less costly in terms of the computation time and at the same time has the ability to handle large datasets compared to SVM, which makes it easy to implement it on very large datasets.
J48 is an implementation of C4.5 decision tree learner [30]. A decision tree model is constructed from root to leaves in a top-down fashion by choosing the most appropriate attribute at each node i.e. a decision node. A leaf indicates a class.
Training Classifiers
Majority of the bioassay datasets are imbalanced where one class is represented by a large number of molecules while the other is not [31] which is observed in the present datasets also (Additional file 3). This class imbalance problem cannot be dealt successfully with standard data mining methods as most original classifiers assume equal weighting of the classes and that all misclassification errors cost equally. However, this assumption is not true in real world applications [32]. Cost-sensitive learning is often used to deal with datasets with very imbalanced class distribution. Introducing misclassification cost for standard classifiers makes them cost sensitive and provide them with the ability to predict a class that leads to the lowest expected cost [33]. Setting of the misclassification cost is almost always arbitrary as no general rule exists for it. It has previously been shown that misclassification cost depends more on the base classifier used rather than minority class ratio or number of attributes [10].
There are two ways to introduce misclassification cost in classifiers: one to design cost sensitive algorithms directly and the other is to use a wrapper that converts existing base classifiers into cost-sensitive ones. The later is called meta-learning [32]. In Weka Meta-learners are used to make the base classifiers cost sensitive. MetaCost [34] first uses bagging on decision trees to obtain reliable probability estimations of training examples, relabels the classes of training examples and then uses the relabeled training instances to build a cost-insensitive classifier. CostSensitiveClassifier [35], use a cost-insensitive algorithm to obtain the probability estimations of each test instance and then predicts the class label of the test examples.
For our datasets that are using a two class schema (i.e. active/inactive) cost sensitivity is introduced by using a 2 × 2 (for binary classification) cost matrix. The four sections of a cost matrix can be read as: True Positives (TP) - actives classified as actives; False Positives (FP) - inactives classified as actives; True Negatives (TN) - inactives classified as inactives; False Negatives (FN) - actives classified as inactives. One of the key point considered during development of the classifiers is that % of false negatives is more important than % of false positives for compound selection. To attain this, one could minimize the number of false negatives at the expense of increasing the false positive. Increasing misclassification for false negatives would lead to increase in both false positives and true positives. The % of false positives can easily be kept under check by setting an upper limit on FP rate. In this case the limit of FP rate was set to a maximum of 20%. The misclassification cost for false negatives could then be increased until this limit is achieved. Cases where standard classifiers were producing this result cost-sensitivity were not used and only default classifiers were used.
Default Weka options were used for building NB and RF models whereas in J48 unpruned tree option and in SMO buildlogisticmodels option was employed. For J48, MetaCost is used as it works better for unpruned trees. It treats the underlying classifier as a black box requiring no knowledge of its functioning. With NB, RF and SMO the standard CostSensitiveClassifier was used.
Model Performance Evaluators
Various statistical binary classification performance measures were used to evaluate the results. A True Positive Rate (TPR) is the proportion of actual positives which are predicted positives (TP/TP+FN). False Positive rate (FPR) is ratio of predicted false actives to actual number of inactives (i.e. FP/FP+TN). Accuracy indicates proximity of measurement of results to the true value. It can be calculated as (TP+TN/TP+TN+FP+FN). Sensitivity (TP/TP+FN) relates to the test's ability to identify positive results whereas Specificity (TN/TN+FP) relates to the test's ability to identify negative results. A test with high sensitivity and specificity has a low error rate. A Balanced Classification Rate (BCR) (0.5*(sensitivity+specificity)) defined as mean of sensitivity and specificity gives a combined criteria of measurement that gives a balanced accuracy for unbalanced datasets. In addition to BCR, Matthews correlation coefficient (MCC) is also employed to judge performance of unbalanced datasets. Its value ranges from -1 to +1. A Receiver Operating Characteristic (ROC) curve is a graphical plot of TPR vs. FPR for a binary classification system. ROC space is defined by FPR and TPR on X and Y axes respectively. The Area under Curve (AUC) value reported by a ROC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.