Fe(2)OG: an integrated HMM profile-based web server to predict and analyze putative non-haem iron(II)- and 2-oxoglutarate-dependent dioxygenase function in protein sequences

Objective Non-haem iron(II)- and 2-oxoglutarate-dependent dioxygenases (i2OGdd), are a taxonomically and functionally diverse group of enzymes. The active site comprises ferrous iron in a hexa-coordinated distorted octahedron with the apoenzyme, 2-oxoglutarate and a displaceable water molecule. Current information on novel i2OGdd members is sparse and relies on computationally-derived annotation schema. The dissimilar amino acid composition and variable active site geometry thereof, results in differing reaction chemistries amongst i2OGdd members. An additional need of researchers is a curated list of sequences with putative i2OGdd function which can be probed further for empirical data. Results This work reports the implementation of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Fe\left(2\right)OG$$\end{document}Fe2OG, a web server with dual functionality and an extension of previous work on i2OGdd enzymes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(Fe\left(2\right)OG\equiv \{H2OGpred,DB2OG\}\right)$$\end{document}Fe2OG≡{H2OGpred,DB2OG}. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Fe\left(2\right)OG$$\end{document}Fe2OG, in this form is completely revised, updated (URL, scripts, repository) and will strengthen the knowledge base of investigators on i2OGdd biochemistry and function. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Fe\left(2\right)OG$$\end{document}Fe2OG, utilizes the superior predictive propensity of HMM-profiles of laboratory validated i2OGdd members to predict probable active site geometries in user-defined protein sequences. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Fe\left(2\right)OG$$\end{document}Fe2OG, also provides researchers with a pre-compiled list of analyzed and searchable i2OGdd-like sequences, many of which may be clinically relevant. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Fe(2)OG$$\end{document}Fe(2)OG, is freely available (http://204.152.217.16/Fe2OG.html) and supersedes all previous versions, i.e., H2OGpred, DB2OG.

The work presented revises, updates and integrates the functionality of two servers, i.e., Fe(2)OG ≡ H2OGpred, DB2OG [18,19]. Fe(2)OG , can be used by researchers as a single-point web resource to screen protein sequence(s) for potential i2OGddactivity and shortlist putative i2OGdd members from the available pre-compiled sequence repository. The latter is searchable on the basis of taxonomy, cellular compartment and HMM-profiles of the sequences. A novel feature of Fe(2)OG is the inclusion of clinically relevant non-haem iron(II)-and 2OG-dependent dioxygenases. This includes links and preliminary analyses to several

Rationale for incorporating empirical data into a profile-based search application
Non-haem iron(II)-and 2OG-dependent-dioxygenases are characterized by variable reaction chemistry and a broad spectrum of substrates. The reverse mapping of substrate descriptors to the active site of known enzymes is well documented and can be utilized to repurpose pharmacological agents. Several theoretically sound statistical tools such as multi-class support vector machines (SVMs), artificial neural networks (ANNs), and hidden markov models (HMMs) have been utilized to garner insights into the active site geometry of an enzyme in the presence of a pharmacophore [18][19][20][21][22]. Although HMMs, as a predictive modality are non-committal, this can be rectified by mathematical filters. The transformed output can then be utilized by clustering algorithms and ANNs to generate unambiguous predictors [21,23,24]. In fact, a rigorously derived integrated HMM-ANN algorithm has been presented and used to characterize sequences which are few and closely related such as those from an enzyme family or sub family [23,24].

Mathematical basis for the algorithms deployed by Fe(2)OG
Whilst, a detailed description of the computational pipeline deployed and its relevance has already been published, the mathematical basis for these has not been addressed [18,19]. Briefly, HMM-profiles of catalytically relevant clusters and laboratory validated enzymes of the i2OGdd-superfamily (a i ∈ A ⊆ H) are utilized to score regions of an amino acid sequence. The empirical data that is considered is the presence of one or more 3D-structures, kinetic and mutagenesis data and mRNA expression levels [18]. A suitable mathematical representation is as under: Theorem: A unique set of HMM profiles (A, B ⊆ H) can exist iff there is at least one unique sub-profile.

Proof:
(1) Fe(2)OG, then, is an implementation of a particular instance of the combined HMM of sequences and available structures (A = a i |1 ≤ i ≤ 28, 2 ≤ #a i ≤ 4) [19]; URL-http://janel ia.org. The lower limit of number of the sequences in each profile (min(#a i )) Eq. (1) is implied by definition. The upper limit, however, is estimated as a proportion of the total number of sequences,

Description and utilization of Fe(2)OG i) Fe(2)OG , a predictor of the catalytic spectrum of an unknown or single function enzyme
The algorithm and code that Fe(2)OG utilizes to predict the dominant profile, in a user-defined sequence(s), has been described in detail [18]. Briefly, i2OGdd enzymes (n > 220) with available empirical data (structure, kinetic, mRNA expression) are clustered on the basis of the substrates catalyzed and/or the reaction chemistry ( Fig. 1) [18]. The enzymes present in each  (2)OG is modified and will analyze a user-defined sequence(s) for the complete set of HMM-profiles. Users may use the "Examples" option to add sample sequences or "Paste" their own selection in the text area. These, when submitted will result in a chart of closely-matched HMM-profiles across the full length of the sequence. Since, the output is based on optimally chosen parameters ( Evalue, Bitscore) , these need to be manually added and carefully chosen before submission. The database, too, is updated and has been extensively curated. Here, putative i2OGdd members are classified and arranged on the basis of their taxonomy and dominant cellular compartment. These are arranged in an easy-to-select matrix format. The sequences of this database can also be queried and recursively analyzed using the logical operators (AND, OR) for the presence of combinations of HMM-profiles. The results of these searches can be downloaded and investigated further. The list of sequences not analyzed is also presented along with links to several human putative i2OGdd members. Abbreviations: 2OG, 2-oxoglutarate; HMM, Hidden Markov Models; i2OGdd, non-haem iron(II)-and 2OG-dependent dioxygenases; H2OGpred HMM-based prediction of putative 2-oxoglutarate function; DB2OG database of sequences that results from a generic-2OG HMM-based query of UniprotKB 'functional'-group, (2 ≤ #a i ≤ 4) Eqs. (1) and (2) are then aligned and assigned a HMM-profile (Figs. 1 and 2) [18]. A database of these HMM-profiles is used to probe the catalytic spectrum of a user-defined sequence as per the stringency specified. Unlike H 2OGpred , Fe(2)OG, compares a query sequence(s) with all, rather than isolated HMM-profiles (Fig. 2) [18]. The rationale for this alteration is that since the catalytic profile of an unknown sequence(s) is debatable, a generic analysis rather than a specific one is a better indicator of i2OGdd-like activity. Furthermore, sequences with known function can also be investigated for other reaction chemistries. Clearly, in both cases the analysis with individual profiles is superfluous and may be omitted (Table 1A). The tabulated list of relevant cognate substrates, for each profile is also available and may be used as a reference (Figs. 1 and 2). In addition, to the overt directives of use, users can also sample the functionality of Fe(2)OG by clicking the "Examples" button (StepP1) (Fig. 2). This loads bonafide i2OGdd sequences into the text area which can be analyzed in accordance with the steps that are outlined subsequently. These include choice of threshold parameter (Evalue, Bitscore) and assignment of a suitable numerical value (StepsP2, P3) (Fig. 2). The output comprises a tabular summary of suitably matched profiles with detailed statistics and exhaustive pair-wise alignments of all supra-threshold matches (Fig. 2). Since, Fe(2)OG has dual functionality, the user can submit this independently (StepsP1 − P3 → Submit) (Fig. 2).

ii) Fe(2)OG , a repository of i2OGdd-like sequences
The second component of Fe (2)OG is a flat-file database. This comprises a pre-compiled and updated list of i2OGdd-like sequences (n AB = 4496) (Fig. 2). This is accomplished by constructing a generic-HMM after combining representative (n ∼ 80) i2OGdd enzymes from each 'functional'-group. This is then used to query UniprotKB for probable matches (n AB ) [19]. The downloaded sequences are analyzed and assigned a dominant cellular compartment (n A ) [19]. Sequences, which are not amenable to these preliminary investigations are annotated as such (n B ) . Users can download updated lists of these sequences (n A = 3429, n B = 1067) (Fig. 2). This is facilitated by arranging the sequences as a matrix of compartments (p) and taxonomy (q) AB = y pqr ∈ ab pq ; p = 10, q = 7, r ∈ N . Fe(2)OG , also uses the logical operators ( {AND, OR} ) to formulate an advanced HMM profile-based query to partition the sequences ( StepS1 ; Fig. 2) [19]. Another modification introduced in Fe (2)OG is the omission of the "All sequences"-option ( StepS1 ) (Fig. 2). The rationale for this amendment, is that users may require sequences specific to one or more HMM-profiles ( Figs. 1 and 2). Since, each profile is based on a specific reaction chemistry, users will also possess, a priori, a definitive list of probable ligands to characterize the kinetics of their search result with (Fig. 1, Table 1A). Furthermore, the entire database (n A ) is accessible with the "OR" and "Include these profile(s)", if the user so chooses ( StepsS1, S2 ) (Fig. 2). The other fraction could not be further classified and is presented only in terms of their respective taxonomies (n B = 1067) . Here, too, the user can submit this independently (StepsS1, S2 → Submit) (Fig. 2).

Comparative analysis and biomedical relevance of Fe(2)OG
Despite the similarity in algorithms and general usage, Fe(2)OG , offers several new and upgraded features ( Table 1). These include links to i2OGdd members which are uncharacterized and clinically relevant, whilst offering researchers a tool to extend the catalytic profiles of known enzymes. Additionally, the list of sequences with putative i2OGdd function is updated and non-redundant. The i2OGdd are amongst the largest group of non-haem dioxygenases and can arguably compete in importance with the more established cytochrome P450 ( CYP 450 ) superfamily of haem monooxygenases (Fig. 1). The differential activity of i2OGdd members in response to fluctuating concentrations of oxygen and iron also suggest a system-level function in sensing and thence regulating the uptake, utilization and release of these micronutrients [25,26]. In fact, clinical data is available for several i2OGdd enzymes. This includes phytanoyl-CoA hydroxylase, hypoxia-inducible Proline hydroxylases, collagen modifiers (Proline-and Lysine-hydroxylases) and DNA/ mRNA-demethylases (Table 1B) [27][28][29][30][31][32][33][34][35][36][37][38]. The analysis by Fe (2)OG results in a small subset (≈ 24%, n = 17) of enzymes and are grouped into mitochondrial, cytosolic and extracellular fractions (Additional file 1: Text S1a). However, a larger proportion (≈ 76%, n = 53) remains unclassified and merits a deeper investigation (Additional file 1: Text S1b).

Limitations
Fe (2)OG , is an online web resource that is dedicated to expanding the knowledge base of non-haem iron(II)-and 2OG-dependent-dioxygenase superfamily of enzymes amongst scientists and clinicians. Fe(2)OG , can predict whether an unknown protein sequence(s) possesses i2OGdd-activity. It also provides preliminary analyses (taxonomy, cellular compartment) and an analytic tool (sequence-based, logical) to shortlist enzyme candidates from a pre-compiled list of sequences. Since, newer sequences are constantly becoming available, Fe(2)OG will require constant updates to its core of HMM-profiles and the raw sequences that are queried for putative function, thereof, to remain relevant to the biomedical community. However, since this information is dependent on available empirical data, an annual update might suffice. Fe(2)OG , is also not exhaustive and lacks structuralmodels and simulation data for its members. These short comings will be addressed in future studies.