Exploration of interaction scoring criteria in the CANDO platform

Objective Ascertain the optimal interaction scoring criteria for the Computational Analysis of Novel Drug Opportunities (CANDO) platform for shotgun drug repurposing to improve benchmarking performance, thereby enabling more accurate prediction of novel therapeutic drug-indication pairs. Results We have investigated and enhanced the interaction scoring criteria in the bioinformatic docking protocol in the newest version of our platform (v1.5), with the best performing interaction scoring criterion yielding increased benchmarking accuracies from 11.7% in v1 to 12.8% in v1.5 at the top10 cutoff (the most stringent one) and correspondingly from 24.9 to 31.2% at the top100 cutoff. Electronic supplementary material The online version of this article (10.1186/s13104-019-4356-3) contains supplementary material, which is available to authorized users.


Generation of compound-proteome interaction signatures
We quantify the compound-protein interaction strength using combinations of the OBscore and/or BSscore described above. When applied to the corresponding libraries, this generates a compound-protein interaction matrix (Figure 1c), where each row of this matrix, i.e. the compound-proteome interaction signature, describes how each compound interacts with the entire multiorganism protein library.
In CANDO v1.5, we use the OBscore and BScore to populate the interaction matrix for the following pipelines: Best OB, Best BS, Best OB+BS, and Best OBxBS. The values in the matrix for each compound-protein interaction in the first two pipelines use the OBscore; Best OB is the highest OBscore between the compound and all predicted binding site ligands for each protein, while Best BS is the OBscore that corresponds to the best local binding site prediction using COFACTOR. The last two pipelines involve adding and multiplying the OBscore and BSscore for each compound-protein interaction; the highest sum or product between the compound and the predicted binding site ligands was chosen as the interaction score.

Calculating interaction signature similarities
The similarity between every compound-proteome interaction signature is compared to all other signatures ( Figure 1d) using the root-mean-square deviation (RMSD). This procedure generates a symmetric matrix (with zeroes along the diagonal) of similarity scores that are hypothesised to represent how functionally similar each compound is to all the others in the context of the protein structure library.

Ranking drug lists and benchmarking metrics
The RMSDs in each row of the compound-compound similarity matrix are sorted to yield ranked similarity lists for each compound (Figure 1e). Each drug associated with an indication is left/held out and checked to see if it is captured within a certain cutoff in the ranked list to any of the other remaining ones [associated with that indication] (Figure 1f). The cutoffs used typically are top10, top25, top50, and top100, reflecting the top ranked 10-100 similar compounds for a given drug.
This procedure is repeated iteratively for all drugs associated with every indication for a particular cutoff, resulting in the indication accuracy. Mathematically, indication accuracy is calculated using the formula c d · 100, where c is the number of times at least one drug with the same indication was captured within a particular cutoff and d is the total number of drugs approved for that indication. Taking the mean of these accuracies (for all 1439 indications with at least two approved drugs) gives the average indication accuracy for a pipeline at a particular cutoff.
The other benchmarking metrics used are the average pairwise accuracy which is a weighted average of all indication accuracies based upon the number of approved drugs for each indication, and indication coverage, which is the number of indications with a non-zero accuracy (i.e., at least one approved drug that was left out was successfully recaptured within a cutoff).

Generation of random controls
We devised two types of random controls. First, we generated random compoundproteome interaction matrices to compare the efficacy of v1 and the v1.5 pipelines against a control. For each compound-protein interaction score we randomly selected a value from a uniform distribution between 0.0 and 1.0 to populate a 3,733 (compounds) by 46,784 (proteins) interaction matrix. We benchmarked this matrix, as discussed in a previous section, to ascertain the benchmarking metrics (average indication accuracy, average pairwise accuracy, and indication coverage) for all cutoffs (top10, top25, top50, and top100). This protocol was repeated 100 times and the resulting averaged metrics were used as the random control.
As a second random control, we calculated the hypergeometric distribution for the leave-one-out benchmarking protocol at each cutoff using Equation 1.
where p is the probability of recapturing at least one drug approved (k) for the same indication as the "left out" drug in the topn cutoff, considering a population (N ) of 3732 compounds and K being the number of approved drugs for the indication. The probability was calculated and averaged across all indications, for which the number of approved drugs, K, varies.

Additional results
Increasing the signal in the matrices and pipeline in CANDO v1.5 Using the v1.5 pipeline we decreased the noise in the compound-proteome interaction matrix relative to v1. Specifically, in v1 more than 800 compounds in our library had completely null interaction signatures (as shown in Figure S1), i.e., every compound-protein interaction for the ≈ 800 compounds received a score of 0.0. We reduced that number to less than 50 compounds, all of which are chemical ions that fail in the current version of the bioinformatic docking protocol. In addition, ≈ 66% of all compound-protein interaction scores in v1 was assigned a score of 0: Each compound-proteome interaction signature in v1, on average, contained 38,492/46,784 (median = 36,778) null or zero interaction scores, whereas the average number of null interaction scores within v1.5 has decreased significantly to 792 (median = 40).
These changes resulted in an increase from 11.7% in v1 to 12.8% in v1.5 for the top10 average indication accuracy, corresponding to 9.4% increase between versions. Furthermore, a greater improvement in accuracy between versions was observed at higher cutoffs, with a 25% increase for the top100 average indication accuracy. This indicates a greater capability of our platform to recapture known drugs for each indication, as well as to more accurately predict putative repurposable drugs for all indications.

Variation of OBscore and BSscore threshold values
In v1, we used a threshold value of 1.1 for the ROCSscore and BSscore to determine if a protein-compound interaction would occur, based on an analysis of structureligand complexes. [16] For v1.5, we benchmarked the Best OB matrix for each incremental increase in OBscore (0 to 1) and BSscore (0 to 2) values individually to determine how these thresholds influence the overall benchmarking result.
As shown in Figure S2, both OBscore and BSscore thresholds were increased incrementally and independently from 0 to their corresponding theoretical maxima of 1 and 2 respectively. As the OBscore threshold is increased, the resulting accuracies and coverages decrease, eventually approaching zero (Fig. S2). This is because the Interaction score Interaction score Additional file 1: Figure S1 Increased signal in v1.5 interaction matrix. In v1, about 800 compounds had all-zero signatures, wheres in v1.5 all of these compounds have non-zero signatures. The average indication accuracy increased 1.1% (from 11.7% to 12.8%) for the top10 cutoff, and by 6.3% for the top100 cutoff (from 24.9% to 31.2%). Upgrading the CANDO v1 pipeline to v1.5 has resulted in the signal-to-noise ratio being increased (i.e., more interactions calculated) by about 20% for the compound-proteome interaction matrix.
ligand and compound must have near identical chemical similarity to have a high OBscore. However, few of the compound in the CANDO library are chemically similar to the ligands present in the binding sites of the template proteins from COFACTOR, let alone identical (where the OBscore would be 1). The results from Figure S3 indicates that as the OBscore threshold increases, the number of nonzero OBscores in the interaction matrix will decrease, and the average indication accuracies and coverage will decrease correspondingly ( Figure S2).
In contrast, the incremental increase in the BSscore threshold does not significantly affect benchmarking performance (Fig. S2). A large portion of the protein structure library used by the CANDO platform consists of solved PDB structures that overlap highly with the COFACTOR template library. This results in a considerable amount of signal remaining at the highest thresholds ( Figure S3). However, the indication coverage appears to decrease by ≈ 30 as the threshold is increased, meaning the high OBscore cutoff is resulting in the average indication accuracies to remain consistent, however, the accuracy is averaged over fewer number of indications.  Figure S2 Change in benchmarking performance when varying the OBscore or BSscore thresholds. The OBscore and BSscore thresholds were set to a value between 0.0-1.0 and 0.0-2.0, respectively; scores above the threshold value were used to populate the compound-proteome interaction matrix and scores below were set to zero. Matrix generation and benchmarking was performed over the full range at increments of 0.1 for OBscore and 0.2 for BSscore. The average indication accuracy, average pairwise accuracy, and indication coverage are shown for four compound-compound similarity list cutoffs: top10 (purple), top25 (magenta), top50 (red), and top100 (yellow). The plots for the OBscore show that as the threshold is increased towards the theoretical maximum of 1.0, the resulting accuracies diminish to nearly 0%. This is because as the threshold becomes more stringent, the number of compounds that have a strong similarity (OBscore greater than the threshold) to the binding site ligands approaches zero, therefore more of the drugs will have near-zero proteome signatures. If there is no signal to discern compound-proteome interaction signature similarity, then the benchmarking produces decreasing accuracies with each increment to the threshold value. In contrast, the BSscore plots show that as the threshold is increased toward the theoretical maximum of 2.0, there is negligible fluctuation in the average indication and pairwise accuracies because there exist predicted binding sites for most proteins in our library that receive a BSscore of 2.0 from COFACTOR. Based upon these results, we used a lower cutoff for the OB and BSscores to obtain the best benchmarking performance.