Enzyme mechanism prediction: a template matching problem on InterPro signature subspaces

Background We recently reported that one may be able to predict with high accuracy the chemical mechanism of an enzyme by employing a simple pattern recognition approach: a k Nearest Neighbour rule with k = 1 (k1NN) and 321 InterPro sequence signatures as enzyme features. The nearest-neighbour rule is known to be highly sensitive to errors in the training data, in particular when the available training dataset is small. This was the case in our previous study, in which our dataset comprised 248 enzymes annotated against 71 enzymatic mechanism labels from the MACiE database. In the current study, we have carefully re-analysed our dataset and prediction results to “explain” why a high variance k1NN rule exhibited such remarkable classification performance. Results We find that enzymes with different chemical mechanism labels in this dataset reside in barely overlapping subspaces in the feature space defined by the 321 features selected. These features contain the appropriate information needed to accurately classify the enzymatic mechanisms, rendering our classification problem a basic look-up exercise. This observation dovetails with the low misclassification rate we reported. Conclusion Our results provide explanations for the “anomaly”—a basic nearest-neighbour algorithm exhibiting remarkable prediction performance for enzymatic mechanism despite the fact that the feature space was large and sparse. Our results also dovetail well with another finding we reported, namely that InterPro signatures are critical for accurate prediction of enzyme mechanism. We also suggest simple rules that might enable one to inductively predict whether a novel enzyme possesses any of our 71 predefined mechanisms.


Findings
Identification of unknown protein functions is essential for understanding biological processes and beyond [1,2]. Enzymes are proteins whose function is to catalyse chemical reactions in a living cell. Ascertaining enzymatic mechanisms can have important applications for pharmaceutical and industrial processes in which catalysts are involved [1]. For example, identifying the catalytic mechanism(s) of an enzyme could lead to designing new biocatalysts that give significant cost savings over non-biological alternatives in sectors such as laundry, deodorants, foods and agriculture [1].
Unlike predicting enzymatic functions at the level of the chemical reaction performed [2][3][4], the problem of predicting by which molecular mechanism a particular enzyme operates has not been well researched [1]. Two of us, De Ferrari and Mitchell, have recently looked into this question. In that work, we utilised a pattern recognition approach to predict chemical mechanisms from enzyme sequences [1]-to the best of our knowledge, that study was the first attempt to predict enzymatic mechanism in this way.
One notable aspect of that work was the excellent prediction success rate of over 96 % for 248 test enzymes-albeit in a leave-one-out setting-even  [5,6] was the algorithm employed for pattern classification. The k 1 NN rule is well known to be highly sensitive to errors in the training set [7], in particular when the training dataset is small [7][8][9]. For example, the number of training examples required for a k 1 NN rule to achieve high classification or prediction accuracy grows exponentially with the number of irrelevant features (noise) [7,9].
In the light of the "anomaly" described above, we have re-analysed that mechanism dataset and our previous classification results-mainly to understand and explain, if possible, the high prediction success rate achieved.
In the following section, we briefly describe our previous work. The "Results" section presents our new findings, and the final section gives our concluding remarks.
To our knowledge, our study was the first attempt at bulk prediction of enzymatic mechanism from protein sequence [1]. The predictive model was an empirical and observational model [10] based on the concept of pattern classification.
Formally, a pattern classification problem deals with the optimal assignment of an object to one of J predefined classes, categories or labels, � = ω 1 , ω 2 , . . . , ω J , whereby it is assumed that the object is adequately characterized by L features, x i with i = 1, 2, …, L. Typically, the object is represented by an L-dimensional vector x, whose elements (x 1 , x 2 …, x i ) are discriminatory features that ideally can identify the object with a low misclassification error rate. In this regard, the classification task is equivalent to establishing a mapping from the feature space χ into the class space Ω, such that x ∈ χ is assigned to its appropriate class label ω j ∈ � , where j = 1, 2, …, J. Each point in the class space has a corresponding region(s) or subspace(s) in the feature space defined by the L features.
In our previous study, the feature x i denotes absence (0) or presence (1) of an InterPro signature for an enzyme sequence, i.e., x i = {0, 1}. In other words, χ was a binary feature space χ = {0, 1} L . The class space Ω comprised J discrete points each representing one of the enzyme mechanism labels ω j , extracted from Version 3.0 of the MACiE (Mechanism, Annotation and Classification in Enzymes) database [11][12][13].
The mapping algorithm was the simple k 1 NN classifier. This algorithm can be basically viewed as a dictionary search [14]. That is to say, all the data points allotted for training are stored in a memory (a dictionary in χ), and a test data point is classified to the class label or labels ω j of the closest point in the dictionary, i.e., in χ. The specific (1) f : χ → � implementation used in our calculations was Mulan's BRKNN algorithm [5,15].
Generally speaking, the integration process carried out by InterPro's curators removes many of the redundant signature matches that might otherwise occur. This results in a relatively small number of InterPro signatures being present for the typical sequence in this dataset. Thus, the squared nearest neighbour distance often takes small integer values, and it is common to find plural nearest neighbours an equal distance away. In this case, the label (or label set) most common amongst the ring of nearest neighbours is assigned.
The mechanism dataset consists of 248 enzymes annotated against 71 MACiE labels, where each enzyme is represented by 321 InterPro signatures-i.e., L and J are 321 and 71, respectively. We employed a leave-one-out validation scheme: 247 of the enzymes whose mechanisms were known were utilised as a "dictionary" and the mechanism(s) of the one remaining enzyme was predicted, this processes being repeated 248 times. The simple pattern recognition approach yielded an excellent prediction success rate of over 96 % for the 248 test enzymes.

Methods
In the present work, we are not directly concerned with the question of defining enzyme mechanisms; instead, we just use the mechanism dataset. We focus on finding the reasons why the k 1 NN rule gave us such good classification results for this small dataset, its size being limited by the considerable experimental effort required to characterise enzyme mechanisms.
While directly visualising the 321 dimensional feature space χ = {0, 1} L=321 would be impossible, we were able to go through the dataset manually. The mechanism dataset was represented by a 248-by-323 matrix whose rows were the 248 enzymes, and the first and last columns contained the enzyme names (the enzyme sequence's UniProt accession number) and their associated mechanism class labels, respectively. The remaining 321 columns denoted the 321 InterPro signature features.
We systematically swapped the 321 columns containing the InterPro signature features while keeping the rows and the first and last columns of the matrix fixed.

Results
After a number of iterations, we ended up with a block diagonal version of the original data matrix, see Fig. 1. The figure, a heat map of the data matrix, seems to explain why k 1 NN yielded the excellent classification results [1]. In the figure, the abscissa denotes InterPro signatures, whereas the vertical axis represents the enzyme sequence's Uni-Prot accession number and the corresponding MACiE enzymatic mechanism labels of the form M0123. The colour yellow signifies that feature x i (InterPro signature) is present for the enzyme, while the red colour indicates that feature x i is absent for the enzyme.
According to Fig. 1, the 321 InterPro signatures are highly discriminating features. Enzymes that possess the same enzymatic mechanism ω j reside in a subspace (region) in χ = {0, 1} L=321 which barely overlaps with neighbouring regions. The inset in Fig. 1 depicts the heatmap of the portion of the dataset that corresponds to the enzymes (and their InterPro signature features) that have MACiE enzymatic mechanism label M0218, i.e. ω j = M0218. Note that a subspace for a given mechanism can be a composite (union) of non-overlapping "sub-subspaces". The sharing of the M0218 label by two separate non-homologous sequences illustrates the presence of two distinct proteins, firstly pancreatic lipase and secondly colipase, in the reactive complex.
Out of our 71 regions, only the two regions representing enzymes with MACiE mechanisms ω j=30 = M0348 and ω j=35 = M0269 completely overlap. The same four InterPro signature features represent the enzymes that show mechanisms M0348 and M0269, highlighted in red in Table 1.
We suggest that our block data-matrix could be employed as an enzymatic mechanism prediction tool-a  template against which to match novel enzymes to ascertain their potential enzymatic mechanisms in regard to the 71 mechanisms in the mechanism dataset. In this work, our mechanism dataset was re-analysed to ascertain as to why a simple but high variance classifier yielded such excellent classification results.
We hope that we have provided a reasonable explanation; the mechanism dataset matrix is block diagonal in the feature and class spaces. In other words, the features (almost) uniquely codify the chemical mechanism of a given enzyme.
Based on these observations, we have also made the suggestion that one might be able to utilise the dataset matrix as an enzymatic mechanism prediction tool.