Model selection in the reconstruction of regulatory networks from time-series data
- Eugene Novikov^{1}Email author and
- Emmanuel Barillot^{1}
https://doi.org/10.1186/1756-0500-2-68
© Novikov et al; licensee BioMed Central Ltd. 2009
Received: 24 January 2009
Accepted: 05 May 2009
Published: 05 May 2009
Abstract
Background
A widely used approach to reconstruct regulatory networks from time-series data is based on the first-order, linear ordinary differential equations. This approach is justified if it is applied to system relaxations after weak perturbations. However, weak perturbations may not be informative enough to reveal network structures. Other approaches are based on specific models of gene regulation and therefore are of limited applicability.
Findings
We have developed a generalized approach for the reconstruction of regulatory networks from time-series data. This approach uses elements of control theory and the state-space formalism to approximate interactions between two observable nodes (e.g. measured genes). This leads to a reconstruction model formulated in terms of integral equations with flexible kernel functions. We propose a library of kernel functions that can be used for the first insights into network structures.
Conclusion
We have found that the appropriate kernel function significantly increases the accuracy of network reconstruction. The best kernel can be selected using prior information on a few nodes' interactions. We have shown that it may be already possible to select models ensuring reasonable performance even with as small as two known interactions. The developed approaches have been tested with simulated and experimental data.
Keywords
Findings
Two sources of experimental data are generally used in the reconstruction of regulatory networks: steady-state and time-series experiments. Steady-state data [1, 2] are generated by measuring the expression levels of every gene (or protein concentrations) when a system relaxes into a steady state after a perturbation. There are many publications [3–5] reporting different methods for the network reconstruction from the steady-state data. Time-series data represent the expression levels measured at a number of time points following global or local perturbations of a system [6, 7]. If these perturbations do not bring the system far from a steady state, the relaxation into the steady state is approximated by a set of the first-order, linear ordinary differential equations (LODE) [6, 8, 9]. Time-series experiments do not require as many perturbations as steady-state experiments, thus avoiding perturbations that may be not easy to design [10, 11]. Moreover, analysis of time-series data allows us to investigate the dynamics of regulatory interactions, which is not possible from the steady-state data.
However, it has been shown [4, 5] that the network reconstruction is more difficult from the time-series data than from the steady-state data. The authors have envisaged two possibilities to improve the reconstruction. One is to collect more time series from additional perturbations. The other one is to perform time-series experiments where an investigated system demonstrates richer dynamics. The latter case is advantageous because it may generate more informative data without performing extra experiments. This can be done either by applying stronger perturbations or by monitoring system dynamics controlled by internal factors (e.g. cell-cycle processes). In both cases, the LODE models can hardly be justified as it is difficult to ensure that a system does not strongly deviate from a steady state. More sophisticated system dynamics needs more detailed formalizations on gene/molecular interactions. Many attempts to improve the basic LODE model can be found in recent publications [12–14]. In most cases, the authors suggest to model the combined regulatory effect of a number of regulatory factors by a particular non-linear function. Additionally, the second-order differential equations are sometimes invoked to reproduce gene expression profiles [14, 15].
In this paper, we are looking for a generic approach to approximate interactions between the observable nodes in a network. The generic approach allows us to systematically apply specific models and, eventually, to define the most appropriate model using available experimental data and, possibly, prior knowledge on the nodes' interactions. The developed approaches were tested with simulated and experimental data.
Mathematical framework
where F_{ i }is a functional reproducing behaviour, Y_{ I }(·), of a set of observable nodes I based on signals, Y_{ O }(·), from a, possibly different, set of observable nodes O, and W_{ i }is a vector of "internal" parameters of control node i. Note that some non-trivial behaviour can be assigned to the observable nodes as well. It may account for instrumental distortions, specifics of image processing, normalization, etc.
with w_{ ij }(t) = C_{ ij }exp(tA_{ ij })B_{ ij }representing the influence of node j on the regulation of node i. Although every link (control node) is unique and should be modelled in a specific way, little prior knowledge on molecular interactions does not allow us to postulate specific models for every link. Therefore, we are looking for universal models that can approximate any control node.
This model approximates system relaxation into a steady state after a small perturbation. However, it is difficult to confirm that perturbations are small enough to justify model (5).
where L is the number of terms, u_{l, ij}are the coefficients encoding for the regulation of node i by node j and τ_{ l }are the characteristics times that can be either set as prior values or estimated from experimental data. The background functions b_{ i }(t, t_{ 0 }) can also be developed, but we will keep them constant as, with little data, more complicated models for b_{ i }(t, t_{ 0 }) can fit the data without identifying any link.
Kernel functions
Equation | w _{ ij } (t) | Model |
---|---|---|
(6) | u _{1, ij} | P1 |
u_{1, ij}+ u_{2, ijt} | P2 | |
(7) | u_{1, ij}exp{-t/(0.1T)} | E1 |
u_{1, ij}exp{-t/(0.9T)} | E2 | |
u_{1, ij}exp{-t/(0.1T)} + u_{2, ij}exp{-t/(0.9T)} | E3 | |
(8) | u_{1, ij}(1 + t/(0.1T))^{-1} | I1 |
u_{1, ij}(1 + t/(0.9T))^{-1} | I2 | |
u_{1, ij}(1 + t/(0.1T))^{-1} + u_{2, ij}(1 + t/(0.9T))^{-1} | I3 |
Discussion on the parameter identifiability for the developed models can be found in [Additional file 2].
Network reconstruction is done by fitting the developed models to experimental data. Among different fitting strategies [17], the forward selection (FS) algorithm has shown reasonable performance, in particular for sparse networks, and therefore, it has been adopted in this paper. We refer to [18] for the details on the implementation of the FS algorithm. A more robust modification of the FS algorithm has also been tested as described in [Additional file 3].
We can use prior knowledge on the nodes' interactions to select the best network reconstruction model from the pre-defined library (Table 1). We look for the kernel function w_{ ij }(t) that reconstructs the prior links with the highest accuracy. The description of the adaptive model selection (AMS) algorithm can be found in [Additional file 4].
Testing
We compared the performances of the eight kernel functions from Table 1 as well as the LODE regulatory model (5) using simulated and experimental data. Three artificial systems were used for testing: the oscillating network in E. coli, called repressilator [19], the mitogen-activated protein kinase (MAPK) cascade [20] and the glycolysis pathway in yeast [21]. We also used the yeast (Saccharomyces cerevisiae) cell cycle microarray time-series data [22] to demonstrate applicability of the developed approach to real experimental data. The positive predictive value (PPV) and sensitivity (Se) were applied to estimate the performance. Further details on the artificial and real systems used for testing and description of the testing procedure can be found in [Additional file 5].
Se at 50 generated links for the three artificial systems (E. COLI repressilator (A), MAPK cascade (B) and yeast glycolysis pathway (C)) and three yeast cell cycle microarray time-series datasets
Models | A | B | C | alpha | elu | cdc15 |
---|---|---|---|---|---|---|
LODE | 0.46 | 0.12 | 0.16 | 0.23 | 0.19 | 0.27 |
P1 | 0.32 | 0.19 | 0.20 | 0.35 | 0.42 | 0.27 |
P2 | 0.41 | 0.23 | 0.18 | 0.35 | 0.31 | 0.35 |
E1 | 0.47 | 0.25 | 0.16 | 0.38 | 0.31 | 0.23 |
E2 | 0.32 | 0.24 | 0.20 | 0.31 | 0.35 | 0.31 |
E3 | 0.60 | 0.27 | 0.17 | 0.15 | 0.27 | 0.08 |
I1 | 0.35 | 0.18 | 0.18 | 0.31 | 0.23 | 0.15 |
I2 | 0.32 | 0.24 | 0.21 | 0.27 | 0.35 | 0.27 |
I3 | 0.59 | 0.23 | 0.16 | 0.19 | 0.19 | 0.12 |
For the yeast cell cycle time-series data, the polynomial models (P1 and P2) were the most powerful. For the alpha dataset and for the elu dataset, P1 had the highest performance whereas P2 was the most accurate for cdc15. Note that, in each case, the best performing models (P1 and P2) also outperformed the LODE model. Comparing different experiments, we see that cdc15 led to less accurate predictions. This indicates that this experiment requires more elaborated reconstruction models or more representative datasets.
From Fig. 2 and Table 2, we can conclude that the "optimal" models were different for the artificial and real systems. The obtained results suggest that no unique model exists to ensure reasonable performance for different systems and therefore the most appropriate models should be searched for each system.
However, in some cases with two prior links, the AMS algorithm relatively often selected the models that were rather poor as judged by the results presented in Fig. 2. For example, for the artificial yeast glycolysis pathway or real alpha dataset, the bi-exponential E3 model was selected almost as often as other, better performing, models. This indicates that the E3 model was more adequate just for certain links and not for any link in the networks. Therefore, we can conclude that the network reconstruction model should be link-specific, that is different models may be assigned to different links.
The performance of the AMS algorithm using independent set of artificial data described in [5] is presented in [Additional file 6].
Conclusion
We have presented a generalized approach for the regulatory network reconstruction, that gives us an easy possibility to create and to test different inference models and, potentially, to identify appropriate models from experimental data. We have shown that even with as small as two prior links it is already possible to select models ensuring reasonable performance. Further discussion and perspectives for further research are given in [Additional file 7].
Declarations
Acknowledgements
This work was supported by the Institut Curie and the Ligue Nationale Contre le Cancer. E.N. and E.B. are members of the Equipe Biologie des Systèmes from the Service de Bioinformatique of Institut Curie, équipe labellisée par La Ligue Nationale Contre le Cancer.
Authors’ Affiliations
References
- Wagner A: How to reconstruct a large genetic network from N gene perturbations in fewer than N2 easy steps. Bioinformatics. 2001, 17: 1183-1197. 10.1093/bioinformatics/17.12.1183.View ArticlePubMedGoogle Scholar
- Ideker TE, Thorsson V, Karp RM: Discovery of regulatory interactions through perturbation: inference and experimental design. Pac Symp Biocomput. 2000, 5: 302-313.Google Scholar
- Werhli AV, Grzegorczyk M, Husmeier D: Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks. Bioinformatics. 2006, 22: 2523-2531. 10.1093/bioinformatics/btl391.View ArticlePubMedGoogle Scholar
- Soranzo N, Bianconi G, Altafini C: Comparing association network algorithms for reverse engineering of large-scale gene regulatory networks: synthetic versus real data. Bioinformatics. 2007, 23: 1640-1647. 10.1093/bioinformatics/btm163.View ArticlePubMedGoogle Scholar
- Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D: How to infer gene networks from expression profiles. Molecular Systems Biology. 2007, 3: 78-PubMed CentralView ArticlePubMedGoogle Scholar
- Bansal M, Gatta GD, di Bernardo D: Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics. 2006, 22: 815-822. 10.1093/bioinformatics/btl003.View ArticlePubMedGoogle Scholar
- MacCarthy T, Pomiankowski A, Seymour R: Using large-scale perturbations in gene network reconstruction. BMC Bioinformatics. 2005, 6: 11-10.1186/1471-2105-6-11.PubMed CentralView ArticlePubMedGoogle Scholar
- D'haeseleer P, Liang S, Somogyi R: Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 2000, 16: 707-726. 10.1093/bioinformatics/16.8.707.View ArticlePubMedGoogle Scholar
- Kim J, Bates DG, Postlethwaite I, Heslop-Harrison P, Cho KH: Least-squares methods for identifying biochemical regulatory networks from noisy measurements. BMC Bioinformatics. 2007, 8: 8-10.1186/1471-2105-8-8.PubMed CentralView ArticlePubMedGoogle Scholar
- Basso K, Margolin AA, Stolovitzky G, Klein U, Della-Favera R, Galifano A: Reverse engineering of regulatory networks in human B cells. Nature Genetics. 2005, 37: 382-390. 10.1038/ng1532.View ArticlePubMedGoogle Scholar
- Dojer N, Gambin A, Mizera A, WilczyMski B, Tiuryn J: Applying dynamic Bayesian networks to perturbed gene expression data. BMC Bioinformatics. 2006, 7: 249-10.1186/1471-2105-7-249.PubMed CentralView ArticlePubMedGoogle Scholar
- Bonneau R, Reiss DJ, Shannon P, Facciotti M, Hood L, Baliga NS, Thorsson V: The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biology. 2006, 7: R36-10.1186/gb-2006-7-5-r36.PubMed CentralView ArticlePubMedGoogle Scholar
- Vu TT, Vohradsky J: Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae. Nucleic Acids Research. 2007, 35: 279-287. 10.1093/nar/gkl1001.PubMed CentralView ArticlePubMedGoogle Scholar
- Chang WC, Li CW, Chen BS: Quantitative inference of dynamic regulatory pathways via microarray data. BMC Bioinformatics. 2005, 6: 44-10.1186/1471-2105-6-44.PubMed CentralView ArticlePubMedGoogle Scholar
- Perrin BE, Ralaivola L, Mazurie A, Bottani S, Mallet J, d'Alché-Buc F: Gene networks inference using dynamic Bayesian networks. Bioinformatics. 2003, 19: ii138-ii148. 10.1093/bioinformatics/btg1071.View ArticlePubMedGoogle Scholar
- Sontag ED: Mathematical Control Theory: Deterministic Finite Dimensional Systems. 1998, Springer, New York, 2View ArticleGoogle Scholar
- van Someren EP, Wessels LFA, Reinders MJT, Backer E: Searching for limited connectivity in genetic network models. Proceedings of the Second International Conference on Systems Biology. 2001, 222-230.Google Scholar
- Novikov E, Barillot E: Regulatory network reconstruction using an integral additive model with flexible kernel functions. BMC Systems Biology. 2008, 2: 8-10.1186/1752-0509-2-8.PubMed CentralView ArticlePubMedGoogle Scholar
- Elowitz MB, Leibler S: A synthetic oscillatory network of transcriptional regulators. Nature. 2000, 403: 335-338. 10.1038/35002125.View ArticlePubMedGoogle Scholar
- Huang CHF, Ferrell JE: Ultrasensitivity in the mitogen-activated protein kinase cascade. Proc Natl Acad Sci USA. 1996, 93: 10078-10083. 10.1073/pnas.93.19.10078.PubMed CentralView ArticlePubMedGoogle Scholar
- Pritchard L, Kell DB: Schemes of flux control in a model of Saccharomyces cerevisiae glycolysis. Eur J Biochem. 2002, 269: 3894-3904. 10.1046/j.1432-1033.2002.03055.x.View ArticlePubMedGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9: 3273-3297.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.