 Research note
 Open Access
 Published:
Statistical inference for ordinal predictors in generalized additive models with application to Bronchopulmonary Dysplasia
BMC Research Notes volume 15, Article number: 112 (2022)
Abstract
Objective
Discrete but ordered covariates are quite common in applied statistics, and some regularized fitting procedures have been proposed for proper handling of ordinal predictors in statistical models. Motivated by a study from neonatal medicine on Bronchopulmonary Dysplasia (BPD), we show how quadratic penalties on adjacent dummy coefficients of ordinal factors proposed in the literature can be incorporated in the framework of generalized additive models, making tools for statistical inference developed there available for ordinal predictors as well.
Results
The approach presented allows to exploit the scale level of ordinally scaled factors in a sound statistical framework. Furthermore, several ordinal factors can be considered jointly without the need to collapse levels even if the number of observations per level is small. By doing so, results obtained earlier on the BPD data analyzed could be confirmed.
Introduction
Bronchopulmonary Dysplasia (BPD) is a chronic lung disease often found in preterm infants with lungs not fully developed. Disturbance of lung development and severity of BPD is caused by various peri and postnatal factors including prematurity itself, as well as pre and postnatal infections [1]. BPD is measured on ordinal scale with grades 0, 1, 2, 3, but often dichotomized as 0: ‘no/mild BPD’ and 1: ‘moderate/severe BPD’. One goal of the study reported here is to investigate whether the time after birth some specific bacteria were found for the first time in the children’s upper airways has an effect on BPD. Initially, \(n = 102\) preterm infants with a birth weight < 1000 g and gestational age \(\le\) 32 \(+\) 0 weeks were analyzed within a retrospective cohort study at the tertiary perinatal center of JustusLiebigUniversity Giessen (Germany) between January 2014 and June 2017. Two infants, however, had to be excluded from further analyses at some point due to missing information on some bacterial colonization. Earlier analyses already showed that the later bacteria were found, the lower the risk of developing BPD [2]. However, it is not fully understood yet which specific bacteria have an effect, and in which way. Therefore we will draw special attention to the time period/week (after birth) three types of bacteria—gram negative/positive and pathogenic—were found for the first time in the upper airway of the respective child. Although ‘time’ is supposed to be a continuous variable, information is only available in a discretized way here, because samples were only obtained once a week. Furthermore, the last category ‘week > 6’ is open/censored. If an observation is falling in the last category, we only know that until week six the respective germ had not been found yet. So, the corresponding covariate may only be considered as categorical but ordinal.
Besides the information on bacteria, some additional risk factors need to be taken into account, such as the weight and sex of the child, the number of days antibiotics and steroids were given, or information on multiples. For doing so, a logit model with categorical/ordinal predictor ‘bacteria/week’ and additional, potentially confounding covariates may be fit. Typically, a categorical predictor is included as dummycoded factor, ignoring, however, the information on the categories’ ordering (if so). In the case presented, additional problems are caused by the fact that some categories/levels only have a few observations, and sometimes all those observations are falling in the same response category. Consequently, some coefficients in a logit model fit via usual maximum likelihood tend towards \(\pm \infty\).
For preventing numerical problems like inflating regression coefficients, penalization can be a viable solution [3]. Furthermore, a penalty term can be used for exploiting/respecting a covariate’s ordinal scale level. Following [4,5,6,7], for instance, a difference/smoothing penalty might be put on adjacent dummy coefficients of the ordinal factor when fitting the model. An approach that has already been applied successfully in medical research; see, e.g., [8, 9]. In the BPD application, however, the question naturally arises how to test for significance of the ordinal predictor in the penalized setting. In a linear model with normal errors, this can be done using a (restricted) likelihood ratio test [10,11,12,13], after rewriting the ordinal penalty as a mixed model [14,15,16]. However, the corresponding test is not available for generalized linear models, such as the logit model considered here. In this note, we will illustrate how technology developed for generalized additive models [17, 18] can be used to fit generalized linear and additive models with ordinal smoothing penalty, and conduct further statistical inference.
Main text
Methods
Given a response y with distribution from a simple exponential family, and a set of covariates \(x_1,\ldots ,x_p\), a generalized additive model [17] has the form:
where \(\mu\) is the (conditional) mean of y given the covariates, h is a (known) response function, and \(\eta\) is comparable to the linear predictor in generalized linear models [19, 20]. The difference to a generalized linear model is that nonlinear functions \(f_j\), \(j=1,\ldots ,p\), are allowed in \(\eta\), but still the structure of \(\eta\) is additive. Of course, if \(f_j\) are restricted to be linear, a generalized linear model is obtained as a special case. In a (generalized) additive model, however, it is usually only assumed that functions \(f_j\) are reasonably smooth; and one way to fit such models, as for instance implemented in the popular R package mgcv [18, 21], is to specify a set of basis functions for each predictor and to employ an appropriate, quadratic smoothing penalty on the corresponding basis coefficients. That means, we assume that
with \(B_{j1}(x),\ldots ,B_{jq_j}(x)\) being a reasonably rich set of basis functions chosen for function \(f_j\), and \(\beta _{j1},\ldots ,\beta _{jq_j}\) are the corresponding basis coefficients. When fitting those basis coefficients to the data, a penalty term \(J_j(\beta _j)\) is typically added for each covariate \(x_j\), penalizing wiggly basis coefficients and thus wiggly functions \(f_j\). The strength of the penalty and hence the amount of smoothing is controlled through a tuning parameter, often denoted by \(\lambda _j\).
Now suppose you have a categorical predictor \(x_j\) with levels \(1,\ldots ,k_j\). Then, there is a somewhat natural basis: the basis of (dummy) functions (\(l=1,\ldots ,k_j\))
Since we know that \(x_j\) can only take values \(1,\ldots ,k_j\), we do not need to think about the type and number of basis functions, placing of knots, etc., as we usually do with continuous covariates. If now a (quadratic) firstorder difference penalty
is put on the basis/dummy coefficients \(\beta _j = (\beta _{j1},\ldots ,\beta _{jk_j})^\top\), this gives exactly the smoothing penalty as mentioned above [4,5,6,7]. Alternatively, the secondorder penalty
can be used [14]. One of the benefits of considering ordinal predictors along with quadratic difference penalties in the framework of generalized additive models is that after implementing basis (3) in the appropriate way, gam() from mgcv can be used directly to fit a generalized linear/additive model with ordinal predictor(s) as needed for the BPD data. Besides pure model fitting, however, this provides us with additional tools; in particular, builtin estimation of the penalty/smoothing parameter(s) via (restricted) maximum likelihood ((RE)ML), further statistical inference, such as confidence intervals, and checking significance of smooth terms. Those tools utilize the mixed model and Bayesian interpretation of quadratic smoothing penalties on basis coefficients such as (4) and (5); compare [18, 22, 23] for details.
Addon functions implementing the ordinal basis for use within mgcv have been made publicly available through R package ordPens [24]. After installing and loading ordPens, the gam() function from mgcv can be used with smooth terms s(..., bs = “ordinal”, m = 1) or s(..., bs = “ordinal”, m = 2) for the first and secondorder penalty, respectively. See the ordPens manual (R function ordSmooth()) for details and examples. To investigate whether the pvalues of Waldtype tests with respect to smooth terms as provided by summary.gam() are reliable if using the ordinal basis/smoothing penalty, we used the confounder model, i.e., the model with information on bacteria removed, to estimate BPD probabilities. That means, the null hypothesis that the effect of (ordinal, bacteriaspecific predictor) x is zero, is true by construction in this hypothetical model, because fitted BPD probabilities do not depend on x, given the other covariates. Using those probabilities, we simulated ‘new’ BPD response data, fit the model with smooth ordinal x added (and smoothing parameter estimated by REML), and stored the pvalue of x. For x, we used information on gram negative/positive and pathogenic bacteria, respectively. As noted above, the corresponding ordinal factor gives the week colonization by the respective type of (oral) bacteria was detected for the first time. For each type of x, this was repeated 1000 times.
Results
Figure 1 shows QQplots of the pvalues observed on the simulated data employing the first and secondorder penalty, respectively. Since the distribution of smaller pvalues is particularly relevant when testing with usual \(\alpha \le 0.1\), we restrict plotting to that area. It is seen that pvalues obtained when employing the firstorder penalty (red) are typically too small. Problems with the firstorder penalty can be explained by the fact that the null space of the corresponding smooth term has dimension zero (compare the mgcv manual). In other words, the null hypothesis in the framework of mixed models (which is used for estimation here), a zero variance component, is on the boundary of the parameter space, which means that standard theory does not apply [10, 11]. Results for the secondorder penalty (blue), by contrast, look very encouraging. Consequently, we will only report results on the actually observed BPD data for the secondorder penalty below. In earlier analyses [2], separate models were fit for each type of bacteria, and the first two weeks were collapsed to make (unpenalized) model fitting with dummycoded, ordinal factors feasible. Thanks to the penalties presented here, we are now able to include all three predictors jointly while using all information with the resolution available.
Table 1 (top) shows the results for the parametric terms if using the secondorder penalty (5) for the smooth terms. In particular, it is seen/confirmed that low birth weight is a risk factor for BPD, and also male infants and multiples have an increased risk of developing BPD. Antenatal steroids, by contrast, may decrease the risk. Results for the different types of bacterial colonization, which are included as ordinal predictors with smooth effects, are also given in Table 1 and Fig. 2. We see that the only significant effect is detected for pathogenic bacteria. The fitted function (Fig. 2, right) gives the impression that early detection is associated with increased risk of BPD. Statistical uncertainty, however, is very large (due to the small number of samples with week/level 1) as indicated by the confidence interval. When excluding information on gram negative and positive bacteria from the model, results for the remaining terms (Table 1, bottom) as well as fitted functions/coefficients for pathogenic bacteria (not shown) look very similar as before. In summary, our results using the ordinal smoothing approach are in line with earlier analyses [2], but allow for considering all three ordinal predictors (gram negative/positive, pathogenic bacteria) jointly, without the need to collapse levels.
Limitations
With respect to the application/BPD data, the main limitation is the small number of samples in week 1 for pathogenic bacteria. Since the shape of the function in Fig. 2 (right) depends on the coefficient for week/level 1, this shape should not be overinterpreted here. From a technical point of view, if using the secondorder penalty (5), a problem can occur with confidence intervals in terms of undercoverage if the fitted coefficient function is close to being linear (compare Fig. 2, left). This problem is also found for (generalized) additive models with continuous covariates, and the suggested fix is to change the target of inference to the smooth term plus the overall model intercept [18, 25]. Furthermore, our implementation does not include extensions like ordinal smoothing spline isotonic regression [7]. Finally, it should be noted that all statements made and conclusions drawn in this article refer to statistical inference if smoothing parameters are estimated by REML (or ML). When using GCV (which is the default in mgcv!), results should be treated with caution.
Availability of data and materials
The software presented together with examples is publicly available on CRAN through open source R addon package ordPens [24]. The data [2] that support the findings of this study are available on reasonable request from H.E. (harald.ehrhardt@paediat.med.unigiessen.de). An earlier yet extended version of this article providing further technical details and simulation studies is available at https://arxiv.org/abs/2102.01946.
Abbreviations
 BPD:

Bronchopulmonary Dysplasia
 (RE)ML:

(Restricted) maximum likelihood
 GCV:

Generalized crossvalidation
References
Gronbach J, Shahzad T, Radajewski S, Chao CM, Bellusci S, Morty RE, Reicherzer T, Ehrhardt H. The potentials and caveats of mesenchymal stromal cellbased therapies in the preterm infant. Stem Cells Int. 2018;2018:9652897.
Lauer T, Behnke J, Oehmke F, Bäcker J, Gentil K, Chakraborty T, Schloter M, Gertheiss J, Ehrhardt H. Bacterial colonization within the first six weeks of life and pulmonary outcome in preterm infants \(<\) 1000g. J Clin Med. 2020;9:2240.
Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67.
Gertheiss J, Tutz G. Penalized regression with ordinal predictors. Int Statis Rev. 2009;77:345–65.
Tutz G, Gertheiss J. Rating scales as predictors—the old question of scale level and some answers. Psychometrika. 2014;79:357–736.
Tutz G, Gertheiss J. Regularized regression for categorical data (with discussion). Statis Model. 2016;16:161–200.
Helwig NH. Regression with ordered predictors via ordinal smoothing splines. Front Appl Math Statis. 2017;3:15.
Cieza A, Oberhauser C, Bickenbach J, Chatterji S, Stucki G. Towards a minimal generic set of domains of functioning and health. BMC Public Health. 2014;14:218.
Glass SM, Ross SE. Modified functional movement screening as a predictor of tactical performance potential in recreationally active adults. Int J Sports Phys Ther. 2015;10:612–21.
Crainiceanu CM, Ruppert D. Likelihood ratio tests in linear mixed models with one variance component. J R Statis Soc B. 2004;66:165–85.
Crainiceanu CM, Ruppert D, Claeskens G, Wand MP. Exact likelihood ratio tests for penalized splines. Biometrika. 2005;77:91–103.
Greven S, Crainiceanu CM, Küchenhoff H, Peters A. Restricted likelihood ratio testing for zero variance components in linear mixed models. J Comput Graph Statis. 2008;17:870–91.
Scheipl F, Greven S, Küchenhoff H. Size and power of tests for a zero random effect variance or polynomial regression in additive and linear mixed models. Comput Statis Data Anal. 2008;52:3283–99.
Gertheiss J, Oehrlein F. Testing relevance and linearity of ordinal predictors. Electron J Statis. 2011;5:1935–59.
Gertheiss J. Anova for factors with ordered levels. J Agricult Biol Environ Statis. 2014;19:258–77.
Sweeney E, Crainiceanu C, Gertheiss J. Testing differentially expressed genes in dose–response studies and with ordinal phenotypes. Statis Appl Genet Mol Biol. 2016;15:213–35.
Hastie T, Tibshirani R. Generalized additive models. London: Chapman & Hall; 1990.
Wood SN. Generalized additive models: an introduction with R. 2nd ed. Boca Raton: CRC Press; 2017.
Nelder JA, Wedderburn RWM. Generalized linear models. J R Statis Soc A. 1972;135:370–84.
McCullagh P, Nelder JA. Generalized linear models. 2nd ed. New York: Chapman & Hall; 1989.
R Core Team: R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2021). R Foundation for Statistical Computing. https://www.Rproject.org/.
Wood SN. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Statis Soc B. 2011;73:3–36.
Wood SN. On pvalues for smooth components of an extended generalized additive model. Biometrika. 2013;100:221–8.
Gertheiss, J., Hoshiyar, A.: ordPens: selection, fusion, smoothing and principal components analysis for ordinal variables. (2021). R package version 1.0.0. https://CRAN.Rproject.org/package=ordPens.
Marra G, Wood SN. Coverage properties of confidence intervals for generalized additive model components. Scand J Statis. 2012;39:53–74.
Acknowledgements
The authors would like to thank the Editor and an anonymous reviewer for their constructive feedback that helped us to improve the paper.
Funding
Open Access funding enabled and organized by Projekt DEAL. This work was supported in part by Deutsche Forschungsgemeinschaft (DFG) under Grant GE2353/21.
Author information
Affiliations
Contributions
JG conducted the data analysis/numerical experiments and wrote the manuscript. FS implemented the method for use within mgcv. TL and HE collected the data and were involved in the discussion of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The results presented only refer to a secondary data analysis. The original study [2] had been conducted following the rules of the Declaration of Helsinki of 1975, revised in 2013. The retrospective analysis was approved by the ethics committee of the JustusLiebigUniversity Giessen (Az 97/14).
Consent for publication
Not applicable (secondary data analysis).
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Gertheiss, J., Scheipl, F., Lauer, T. et al. Statistical inference for ordinal predictors in generalized additive models with application to Bronchopulmonary Dysplasia. BMC Res Notes 15, 112 (2022). https://doi.org/10.1186/s13104022059954
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13104022059954
Keywords
 Chronic lung disease
 Logit model
 Ordinal data
 Regularization
 Smoothing penalty