- Technical Note
Model selection in the reconstruction of regulatory networks from time-series data
BMC Research Notesvolume 2, Article number: 68 (2009)
A widely used approach to reconstruct regulatory networks from time-series data is based on the first-order, linear ordinary differential equations. This approach is justified if it is applied to system relaxations after weak perturbations. However, weak perturbations may not be informative enough to reveal network structures. Other approaches are based on specific models of gene regulation and therefore are of limited applicability.
We have developed a generalized approach for the reconstruction of regulatory networks from time-series data. This approach uses elements of control theory and the state-space formalism to approximate interactions between two observable nodes (e.g. measured genes). This leads to a reconstruction model formulated in terms of integral equations with flexible kernel functions. We propose a library of kernel functions that can be used for the first insights into network structures.
We have found that the appropriate kernel function significantly increases the accuracy of network reconstruction. The best kernel can be selected using prior information on a few nodes' interactions. We have shown that it may be already possible to select models ensuring reasonable performance even with as small as two known interactions. The developed approaches have been tested with simulated and experimental data.
Two sources of experimental data are generally used in the reconstruction of regulatory networks: steady-state and time-series experiments. Steady-state data [1, 2] are generated by measuring the expression levels of every gene (or protein concentrations) when a system relaxes into a steady state after a perturbation. There are many publications [3–5] reporting different methods for the network reconstruction from the steady-state data. Time-series data represent the expression levels measured at a number of time points following global or local perturbations of a system [6, 7]. If these perturbations do not bring the system far from a steady state, the relaxation into the steady state is approximated by a set of the first-order, linear ordinary differential equations (LODE) [6, 8, 9]. Time-series experiments do not require as many perturbations as steady-state experiments, thus avoiding perturbations that may be not easy to design [10, 11]. Moreover, analysis of time-series data allows us to investigate the dynamics of regulatory interactions, which is not possible from the steady-state data.
However, it has been shown [4, 5] that the network reconstruction is more difficult from the time-series data than from the steady-state data. The authors have envisaged two possibilities to improve the reconstruction. One is to collect more time series from additional perturbations. The other one is to perform time-series experiments where an investigated system demonstrates richer dynamics. The latter case is advantageous because it may generate more informative data without performing extra experiments. This can be done either by applying stronger perturbations or by monitoring system dynamics controlled by internal factors (e.g. cell-cycle processes). In both cases, the LODE models can hardly be justified as it is difficult to ensure that a system does not strongly deviate from a steady state. More sophisticated system dynamics needs more detailed formalizations on gene/molecular interactions. Many attempts to improve the basic LODE model can be found in recent publications [12–14]. In most cases, the authors suggest to model the combined regulatory effect of a number of regulatory factors by a particular non-linear function. Additionally, the second-order differential equations are sometimes invoked to reproduce gene expression profiles [14, 15].
In this paper, we are looking for a generic approach to approximate interactions between the observable nodes in a network. The generic approach allows us to systematically apply specific models and, eventually, to define the most appropriate model using available experimental data and, possibly, prior knowledge on the nodes' interactions. The developed approaches were tested with simulated and experimental data.
We apply elements of control theory  to develop a generalized model of the network dynamics. A regulatory network (Fig. 1) is represented as a bipartite graph with two types of nodes: observable nodes reproducing measurable characteristics (e.g. gene expression levels), and non-observable, or control, nodes controlling the interactions between the observable nodes. Each control node i can be modelled as:
where F i is a functional reproducing behaviour, Y I (·), of a set of observable nodes I based on signals, Y O (·), from a, possibly different, set of observable nodes O, and W i is a vector of "internal" parameters of control node i. Note that some non-trivial behaviour can be assigned to the observable nodes as well. It may account for instrumental distortions, specifics of image processing, normalization, etc.
The goal of the network reconstruction is to identify parameters W i encoding for the interactions between the observable nodes. For that, functional F i in (1) has to be further developed. It is frequently assumed that the cooperative regulatory contribution from different observable nodes is a sum of the contributions from each node, so that equation (1) can be written as:
where n is the number of observable nodes, y i (t) is the measured response of observable node i, F ij is a functional characterized by a set of parameters W ij converting measured profile, y j (t), at node j to measured profile, y i (t), at node i, and b i (t, t 0 ) is the output of non-regulated observable node i. We consider pair-wise controls F ij as linear, continuous, time-invariant, finite-dimensional, single input-single output control systems that can be modelled using the state-space formalism:
where X ij (t) is the state vector and A ij is the state matrix, y j (t) is the input value and B ij is the input vector, y j (t) is the output value and C ij is the output vector. We also assume that F ij are in a steady state prior to the input perturbation y j (t) starting at time t = t 0 , that is X ij (t 0 ) = 0. Integrating (3) and combining n regulatory inputs as in (2) yields
with w ij (t) = C ij exp(tA ij )B ij representing the influence of node j on the regulation of node i. Although every link (control node) is unique and should be modelled in a specific way, little prior knowledge on molecular interactions does not allow us to postulate specific models for every link. Therefore, we are looking for universal models that can approximate any control node.
This model approximates system relaxation into a steady state after a small perturbation. However, it is difficult to confirm that perturbations are small enough to justify model (5).
Equation (4) allows us to create a number of less restrictive models that can cover broader spectrum of dynamical behaviours. These models can integrate prior knowledge or can be further refined in experimental data analysis. In this report, we use the following representations for w ij (t):
where L is the number of terms, ul, ijare the coefficients encoding for the regulation of node i by node j and τ l are the characteristics times that can be either set as prior values or estimated from experimental data. The background functions b i (t, t 0 ) can also be developed, but we will keep them constant as, with little data, more complicated models for b i (t, t 0 ) can fit the data without identifying any link.
Discussion on the parameter identifiability for the developed models can be found in [Additional file 2].
Network reconstruction is done by fitting the developed models to experimental data. Among different fitting strategies , the forward selection (FS) algorithm has shown reasonable performance, in particular for sparse networks, and therefore, it has been adopted in this paper. We refer to  for the details on the implementation of the FS algorithm. A more robust modification of the FS algorithm has also been tested as described in [Additional file 3].
We can use prior knowledge on the nodes' interactions to select the best network reconstruction model from the pre-defined library (Table 1). We look for the kernel function w ij (t) that reconstructs the prior links with the highest accuracy. The description of the adaptive model selection (AMS) algorithm can be found in [Additional file 4].
We compared the performances of the eight kernel functions from Table 1 as well as the LODE regulatory model (5) using simulated and experimental data. Three artificial systems were used for testing: the oscillating network in E. coli, called repressilator , the mitogen-activated protein kinase (MAPK) cascade  and the glycolysis pathway in yeast . We also used the yeast (Saccharomyces cerevisiae) cell cycle microarray time-series data  to demonstrate applicability of the developed approach to real experimental data. The positive predictive value (PPV) and sensitivity (Se) were applied to estimate the performance. Further details on the artificial and real systems used for testing and description of the testing procedure can be found in [Additional file 5].
The dependencies of PPV on the total number of links are presented in Fig. 2. The Se values at 50 generated links are collected in Table 2. Among the three artificial systems, the choice of a model was the most critical for the E. coli repressilator. In this case, the best reconstruction was achieved with the bi-exponential E3 model. The LODE model performed better than random reconstruction but still worse than E3. All tested kernels were significantly better than random link assignment for the MAPK cascade. All kernels also outperformed the LODE model in this case. However, there is still a notable (and statistically significant) difference between the kernels. The yeast glycolysis network (Fig. 2c) was the most difficult to reconstruct because many times series were similar and hardly distinguishable by the reconstruction algorithm. Nevertheless, several models (P1, P2, E2, E3, and I2) demonstrated the performance different from random. The LODE model could not outperform the random prediction in this case.
For the yeast cell cycle time-series data, the polynomial models (P1 and P2) were the most powerful. For the alpha dataset and for the elu dataset, P1 had the highest performance whereas P2 was the most accurate for cdc15. Note that, in each case, the best performing models (P1 and P2) also outperformed the LODE model. Comparing different experiments, we see that cdc15 led to less accurate predictions. This indicates that this experiment requires more elaborated reconstruction models or more representative datasets.
From Fig. 2 and Table 2, we can conclude that the "optimal" models were different for the artificial and real systems. The obtained results suggest that no unique model exists to ensure reasonable performance for different systems and therefore the most appropriate models should be searched for each system.
We applied the AMS algorithm [Additional file 4] to the same three artificial systems and three experimental datasets. As at each run the prior links were different, the selected model might also be different. Therefore, we counted number of times each model from Table 1 was selected in the 100 runs. The results for 2 and 10 prior links are shown in Fig. 3. We found that the higher performing models from Fig. 2 were selected more often than the lower performing ones. Moreover, reasonable model recognition could be already achieved with only two prior links. As expected, the increase in the number of prior links led to better model identification.
However, in some cases with two prior links, the AMS algorithm relatively often selected the models that were rather poor as judged by the results presented in Fig. 2. For example, for the artificial yeast glycolysis pathway or real alpha dataset, the bi-exponential E3 model was selected almost as often as other, better performing, models. This indicates that the E3 model was more adequate just for certain links and not for any link in the networks. Therefore, we can conclude that the network reconstruction model should be link-specific, that is different models may be assigned to different links.
As the AMS algorithm may select poor performing models, the overall performance of the network reconstruction is lower than for the best performing model. However, even with as small as two prior links, AMS is already better than random model selection, as illustrated in Fig. 4. If the performance of different models is not very different (as for the MAPK cascade), the prediction of the AMS algorithm is close to random. If, however, a certain model demonstrates clear advantage (as, for example, for the E. coli repressilator), the AMS algorithm can identify this model leading to the performance substantially higher than by random selection.
We have presented a generalized approach for the regulatory network reconstruction, that gives us an easy possibility to create and to test different inference models and, potentially, to identify appropriate models from experimental data. We have shown that even with as small as two prior links it is already possible to select models ensuring reasonable performance. Further discussion and perspectives for further research are given in [Additional file 7].
Wagner A: How to reconstruct a large genetic network from N gene perturbations in fewer than N2 easy steps. Bioinformatics. 2001, 17: 1183-1197. 10.1093/bioinformatics/17.12.1183.
Ideker TE, Thorsson V, Karp RM: Discovery of regulatory interactions through perturbation: inference and experimental design. Pac Symp Biocomput. 2000, 5: 302-313.
Werhli AV, Grzegorczyk M, Husmeier D: Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks. Bioinformatics. 2006, 22: 2523-2531. 10.1093/bioinformatics/btl391.
Soranzo N, Bianconi G, Altafini C: Comparing association network algorithms for reverse engineering of large-scale gene regulatory networks: synthetic versus real data. Bioinformatics. 2007, 23: 1640-1647. 10.1093/bioinformatics/btm163.
Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D: How to infer gene networks from expression profiles. Molecular Systems Biology. 2007, 3: 78-
Bansal M, Gatta GD, di Bernardo D: Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics. 2006, 22: 815-822. 10.1093/bioinformatics/btl003.
MacCarthy T, Pomiankowski A, Seymour R: Using large-scale perturbations in gene network reconstruction. BMC Bioinformatics. 2005, 6: 11-10.1186/1471-2105-6-11.
D'haeseleer P, Liang S, Somogyi R: Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 2000, 16: 707-726. 10.1093/bioinformatics/16.8.707.
Kim J, Bates DG, Postlethwaite I, Heslop-Harrison P, Cho KH: Least-squares methods for identifying biochemical regulatory networks from noisy measurements. BMC Bioinformatics. 2007, 8: 8-10.1186/1471-2105-8-8.
Basso K, Margolin AA, Stolovitzky G, Klein U, Della-Favera R, Galifano A: Reverse engineering of regulatory networks in human B cells. Nature Genetics. 2005, 37: 382-390. 10.1038/ng1532.
Dojer N, Gambin A, Mizera A, WilczyMski B, Tiuryn J: Applying dynamic Bayesian networks to perturbed gene expression data. BMC Bioinformatics. 2006, 7: 249-10.1186/1471-2105-7-249.
Bonneau R, Reiss DJ, Shannon P, Facciotti M, Hood L, Baliga NS, Thorsson V: The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biology. 2006, 7: R36-10.1186/gb-2006-7-5-r36.
Vu TT, Vohradsky J: Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae. Nucleic Acids Research. 2007, 35: 279-287. 10.1093/nar/gkl1001.
Chang WC, Li CW, Chen BS: Quantitative inference of dynamic regulatory pathways via microarray data. BMC Bioinformatics. 2005, 6: 44-10.1186/1471-2105-6-44.
Perrin BE, Ralaivola L, Mazurie A, Bottani S, Mallet J, d'Alché-Buc F: Gene networks inference using dynamic Bayesian networks. Bioinformatics. 2003, 19: ii138-ii148. 10.1093/bioinformatics/btg1071.
Sontag ED: Mathematical Control Theory: Deterministic Finite Dimensional Systems. 1998, Springer, New York, 2
van Someren EP, Wessels LFA, Reinders MJT, Backer E: Searching for limited connectivity in genetic network models. Proceedings of the Second International Conference on Systems Biology. 2001, 222-230.
Novikov E, Barillot E: Regulatory network reconstruction using an integral additive model with flexible kernel functions. BMC Systems Biology. 2008, 2: 8-10.1186/1752-0509-2-8.
Elowitz MB, Leibler S: A synthetic oscillatory network of transcriptional regulators. Nature. 2000, 403: 335-338. 10.1038/35002125.
Huang CHF, Ferrell JE: Ultrasensitivity in the mitogen-activated protein kinase cascade. Proc Natl Acad Sci USA. 1996, 93: 10078-10083. 10.1073/pnas.93.19.10078.
Pritchard L, Kell DB: Schemes of flux control in a model of Saccharomyces cerevisiae glycolysis. Eur J Biochem. 2002, 269: 3894-3904. 10.1046/j.1432-1033.2002.03055.x.
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9: 3273-3297.
This work was supported by the Institut Curie and the Ligue Nationale Contre le Cancer. E.N. and E.B. are members of the Equipe Biologie des Systèmes from the Service de Bioinformatique of Institut Curie, équipe labellisée par La Ligue Nationale Contre le Cancer.
The authors declare that they have no competing interests.
EN developed the model, performed software implementation and drafted the manuscript. EB conceived of the study and participated in coordination. All authors read and approved the final manuscript.