Harnessing the evolutionary information on oxygen binding proteins through Support Vector Machines based modules

Objectives The arrival of free oxygen on the globe, aerobic life is becoming possible. However, it has become very clear that the oxygen binding proteins are widespread in the biosphere and are found in all groups of organisms, including prokaryotes, eukaryotes as well as in fungi, plants, and animals. The exponential growth and availability of fresh annotated protein sequences in the databases motivated us to develop an improved version of “Oxypred” for identifying oxygen-binding proteins. Results In this study, we have proposed a method for identifying oxy-proteins with two different sequence similarity cutoffs 50 and 90%. A different amino acid composition based Support Vector Machines models was developed, including the evolutionary profiles in the form position-specific scoring matrix (PSSM). The fivefold cross-validation techniques were applied to evaluate the prediction performance. Also, we compared with existing methods, which shows nearly 97% recognition, but, our newly developed models were able to recognize almost 99.99 and 100% in both oxy-50 and 90% similarity models respectively. Our result shows that our approaches are faster and achieve a better prediction performance over the existing methods. The web-server Oxypred2 was developed for an alternative method for identifying oxy-proteins with more additional modules including PSSM, available at http://bioinfo.imtech.res.in/servers/muthu/oxypred2/home.html. Electronic supplementary material The online version of this article (10.1186/s13104-018-3383-9) contains supplementary material, which is available to authorized users.


Introduction
Oxygen is an essential part of the atmosphere and is necessary to sustain the most terrestrial life of living organisms as it used in respiration and regulation of a variety of cellular functions. The oxygen binding proteins (oxyproteins) of various organisms considerably differ from one another and classified mainly on their structure and physiochemical properties as hemoglobin, hemocyanin, hemerythrin, myoglobin, leghemoglobin, and erythrocruorin. Each oxy-proteins have its own functional characteristics and structure with unique oxygen-binding capacity [1][2][3][4][5][6][7][8][9][10][11].
A number of computational methods have been proposed for identifying functional proteins on their primary sequences using machine learning approaches [12][13][14]. These methods are always needful to improve or to find new features for identifying protein family and their classes, sub-classes to avoid negative prediction or to reduce false positive rates.
In 2007, Muthukrishnan et al. developed Oxypred method for predicting oxygen-binding proteins using the simple amino (AC) and dipeptide composition (DC). The growing of protein sequence databases and availability of newly annotated sequences of oxy-proteins in the post genomic era, retrospectively encouraged us to introduce a new improved version of forged oxypred method. An attempt was made to include a recently generated highly non-redundant dataset in the development of Oxypred2 with a different protein features [15]. Recently, it has observed that the use of evolutionary profile in the form of a position-specific scoring matrix (PSSM) predicted various functional proteins with a higher accuracy [16,17]. Hence, we applied many approaches, including the PSSM based evolutionary profile to improve prediction quality of oxy-proteins.
In this study, recently generated two different cutoff non-redundant datasets 50 and 90% were applied to develop Oxypred2. The difference between current and previous study reflected that PSSM and Hybrid approach, confusion matrix analysis, prediction score graphs, and ROC analysis has been added as extra features.
The many different prediction features are always important to understand their functional behavior aspects [18][19][20][21]. Here, we compared prediction performance of 50 and 90% similarity datasets in all modules to find the best identification of oxy-proteins. The prediction results and their complete analysis show that the developed method Oxypred2 is an improved version and alternative method for identifying oxy-proteins.

PSSM-profile
The PSSM profile provides the evolutionary information about residues conservation at a given position in a protein sequence. The construction of PSSM profile was generated using GPSR package available at http:// www.imtec h.res.in/ragha va/gpsr/. We applied GPSR programs for PSI-BLAST searches against the nonredundant (nr) database using different iterations with a cutoff E value 0.001 [34,35]. Further, each value has been normalized the range between 0 and 1 by the following equation, In 0-1 value, minimum scores consider as "0, " and the maximum scores become "1".

Evaluation models
We applied fivefold cross-validation techniques, as it was done by many investigators with SVM as the prediction engine. In this technique, the dataset was divided into five sets consisting of nearly equal number of sequences, where four sets used for training and remaining set for testing. The training and testing set was carried out five times in such a way that each part was used once for testing, and the whole process was repeated 20 times.
The objectives of our classifieds are to discriminate the oxy-protein from those of negative discipline, and the following terminology used to evaluate of our classifier as, • True positive (TP)-a protein is identified as an oxyprotein by both classifier and oxy-proteins model. • True negative (TN)-a protein is not identified as a oxy-protein by either the classifier or oxy-protein model. • False positive (FP)-a protein is identified as positively as oxy-protein by the classifier, but not by the oxy-protein model. • False negative (FN)-a protein is identified as oxyprotein by the oxy-protein model but not by the classifier.

Results
Determining the relative amino acid composition will give a characteristic profile for protein [39]. Here, we calculated average AC composition of oxy-proteins according to their median scores. We observed that the residues Ala and Phe are present > 0.5% in oxy-50 sequences, which compared to non-oxy-50% sequences. In oxy-90 residues Ala, Phe, His and Lys are more 0.5% than non-90 sequences. In the oxy-50 classification dataset, residues Ala, Lys and Val are > 2, 3, 2% in Leg, hemo, and myo.
Ala and Arg residues are very less (− 3%) in Hcy-50 and Leg-50 sequences respectively. In 90% oxy-datasets, Ala residue is 2% more in Ery-90 and leg-90, Glu, Lys, and Val are present 3% more in heme, myo, and leg proteins respectively. Ala, Glu, and Arg are less 2% in hcy, ery and leg proteins, results shown in Fig. 1, Additional file 1: Figure S1 and Additional file 2: Figure S2. In sub-classes, sequence length profile of oxy-50 and 90 were compared, found most of the sequences of heme and hemo proteins belong to the range between 101 and 200. The other proteins are distributed in different length ranges (Additional file 3: Figure S3). In AC approach prediction, we achieved the maximum accuracy was 82.05, and 87.79% in oxy-50 and oxy-90 datasets. DC-method, maximum accuracy was 80.42 and 84.81% in oxy-50 and oxy-90 respectively. The complete prediction results are shown in Additional file 4: Table S1, and the classification approach  Table 1. The evolutionary profile based PSSM method have been applied to many functional protein predictions [40,41]. In PSSM methods achieved the maximum accuracy was 85.10 and 81.81% in oxy-50 and oxy-90 datasets respectively. We observed that, in classification the PSSM method prediction accuracy was slightly increased in Ery, Hcy, Heme, Leg, and Myo in oxy-50 than the oxy-90 datasets.
Further, to improve the prediction accuracy, a Hybrid approach based modules were developed [42]. The prediction accuracy was 81.73 and 83.51% in oxy-50 and oxy-90 respectively. In classification, Hcy, Heme, Hemo accuracy were slightly increased in oxy-90 than oxy-50. Overall, DC and Hybrid method prediction results are shows similar in oxy-50 and oxy-90, and it doesn't show any significance differences (Table 1).
In order to verify the prediction performance of their developed models, we also did the ROC analysis with our original data, and achieved area under the curve (AUC) 0.894 and 0.959 in oxy-50 and oxy-90 (Additional file 5: Figure S4), in classification AUC's shown in Additional file 4: Table S2 and Fig. 2. In addition, a confusion matrix based prediction scores graphs were generated [43], to cross-check the developed model's performance on original data. According to our results, no miss-classifications occurred in the proposed models; it means no positive sequence identified as negative and no negative sequence defined as positive. So that, our developed models are good in recognizing the positive and negative sequences. At the same time, classification based models also doing the best performance recognizing positive and negative sequences. Eventhough, some sequence couldn't identified by their own class models, rather identified by other class models. In oxy-50 datasets 3-Ery, 10-Hemo and 5-Myo sequences are not recognized by their models in all approaches. Rather, it recognized by other sub-class models. In oxy-90 datasets, 2, 4, 2 sequences of Ery, Hemo and Myo are confused and not recognized by their models, but identified by other models. Interestingly, some sequences of Ery, Hemo, and Myo are not identified by their models and other models too. The complete confusion matrix results of both oxy-50 and oxy-90 shown in Additional file 4: Table S3. The prediction score graphs are mainly developed to show the performance of models in separation of positive and negative sequences. According to the graphs, separation with maximum margins shown in DC, PSSM and Hybrid approaches. However, the confusion matrix result shows that some sequences are very similar between Ery, Hemo, Myo, and these sequences may be evolutionary important (Additional file 6: Figure S5 and Additional file 7: Figure S6). Also, we compared prediction profile performance of accuracy, sensitivity, and specificity on threshold level. We found that most of the classes are showing better performance in the 0-1.5 thresholds, mostly the ACC, Sen and Sep scores are associated with a particular point threshold, but few of them doesn't show any connections over the thresholds. Ery-50 and 90 AC data's are not showing association with ACC, Sen, and Sep, but in DC and PSSM approaches, both Ery-50 and Ery-90 data's are having connections in negative thresholds. Interestingly, in hybrid approach, Ery-50 data shown in negative threshold, but Ery-90 appeared at positive threshold. In Hcy Class, AC-90 data shown at negative threshold, rest all approaches appears in positive threshold (0-1.5). However, all Heme and Hemo class data's are joining in positive threshold in all approaches. In Leg class, only AC-50 shown in negative and rest all approaches in positive threshold. In Myo class, AC-50 does not shown cross, but DC-90 and PSSM-50 at "0" threshold. Hybrid-90 shown in positive threshold and all other approaches in negative thresholds. Moreover, in most cases, accuracy and specificity data's are similar (Additional file 8: Figure S7).
In Oxypred2 study, average ACC, Sen, and Sep from − 1.5 to +1.5 thresholds and compared the performance of both oxy-50 and oxy-90 sub-classes in all approaches. We observed that, Ery and Myo sensitivity data increased in oxy-90 than oxy-50. Moreover, all sub-classes showing more than 80% ACC, Sen and Sep in oxy-90. In oxy-90 classification, heme and hemo's specificity is less 80% in PSSM and Hybrid, but it slightly better than oxy-50 average data. In all approaches, Ery class sensitivity data improved in oxy-90 than oxy-50 (Additional file 9: Figure S8). In PSSM method, prediction accuracy was increased than AC and DC methods.
In order to have comparison with our new and existing method (oxypred) using blind data contains 502 oxy-proteins, which were not present in our datasets. According to oxypred AC and DC methods identified 96.61% (485) and 97.81% (491) respectively. But oxypred-2 of oxy-50 models identified as 98, 99, 99, and 99% and oxy-90 models recognized as 99.20, 100, 100 and 100% in AC, DC, PSSM and Hybrid methods respectively.

Discussion
Here, we presented an improved version of Oxypred for identifying oxy-proteins using various features [44,45]. Here we applied two different similarity cutoff datasets. All methods recognize 100% positive and negative sequences. Hemocyanin, Hemerythrin, and Leghemoglobin classes recognizing 100% in all approaches. Oxy-50 models recognizing individual sequences as 89, 98.9 and 87.5%, and in oxy-90 models as 98, 99.8 and 98.4% identified positively as erythrocruorin, hemoglobin, and myoglobin respectively.
Further, compared with existing methods, performance based on the newly retrieved dataset, which shows nearly 97% recognition. However, our newly developed models were able to identify almost 99.99% and 100% in the oxy-50 and 90 models respectively. According to our prediction results, oxy-90 models are making a better prediction than oxy-50. However, PSSM based approaches are showing better performance in identifying oxy-proteins in both cases. Also, we found less error rate, according to confusion matrix analysis. The present oxypred2 method is able to achieve better prediction in comparison to previous method in identifying oxy-proteins. This study is an alternative method for identifying oxy-proteins and hope it will be useful to the scientific community.

Limitations
• The exponential growth and availability of fresh annotated protein sequences in databases motivated us to develop an improved version. • Two different sequence similarities cutoff 90 and 50% were used with various features for predicting oxyproteins. • The oxy-90 models are making a better prediction than oxy-50 models, and our approaches are faster and achieve a better prediction performance over the existing method. • Finally, a web-server Oxypred2 has been developed for identifying oxygen-binding proteins.

Additional files
Additional file 1: Figure S1. Amino acid distribution chart of oxy-proteins along with non-oxy, difference between 50 and 90 data.
Additional file 3: Figure S3. Sequence length profile oxy-classes. Sequence length range in histogram based on oxy-subclass organizations. X-axis for sequence length range and Y-axis for number of sequences.
Additional file 4: Table S1.  .  Table S2. Performance of various SVM modules by ROC analysis. The area under curve (AUC) for different approach for the classification of oxyproteins. Table S3. Confusion Matrix. Oxypred2 developed best models performance by confusion matrix, cross checked the original oxy-class sequences, predicted by own and other models.
Additional file 5: Figure S4. ROC curve oxy-non-oxy in all approaches. The performance of oxypred2 models by receiver operating characteristic (ROC) plots in all approaches. The area under curve (AUC) was measured for all developed models. It is mainly to show the relationship between sensitivity and 1-specificity for each thresholds of the real value out-puts.