 Research note
 Open Access
 Published:
Robustness of zeroaugmented models over generalized linear models in analysing fertility data in Nigeria
BMC Research Notes volume 12, Article number: 815 (2019)
Abstract
Objective
Fertility is a count data usually rightly skewed and exhibiting large number of zeros than the distributional assumption of the generalized linear models (GLMs). This study examined the robustness of zeroaugmented models over GLMs to fit fertility data across regions in Nigeria. The 2013 Nigeria Demographic and Health Survey data were used. The fertility models fitted included: Poisson, negative binomial, zeroinflated Poisson, zeroinflated negative binomial, hurdle Poisson and hurdle negative binomial. Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) were used to identify the model with best fit (α = 0.05).
Results
The percentage of zero count in the fertility responses were 21.3, 23.9, 31.1, 30.7, 37.6 and 42.4 in North West, North East, North Central, South West, South South and South East regions respectively. In all the six regions in Nigeria, the zeroaugmented models were better than the generalized linear models except for North Central. Extensively, the zeroaugmented negative binomial based models were of better fit than their Poisson based counterparts; or in rare cases maybe indistinguishable. However, specific family of zeroaugmented model is recommended for each region in Nigeria.
Introduction
Count events frequently occur in all disciplines. In demography, count data like number of children ever born, number of deaths, and number of migration times have been previously modelled by Poisson regression [1]. One of the important assumptions guiding the use of Poisson distribution; is the equality of mean and variance which may not be feasible in reality. If this assumption is violated, the estimation method will produce biased estimates, inefficient standard errors, and misleading confidence interval and pvalues [2]. Based on this limitation, researchers have recommended the use of negative binomial distribution which have an additional parameter that accounts for the usual occurrence of overdispersion in count outcomes; thus, relaxing the constraint of equality of mean and variance [3].
Researchers have also argued that, count events are mainly characterized with large number of zeros [4,5,6,7] and this situation make modeling count data using both Poisson and negative binomial model inappropriate. Although, Poisson and negative binomial distribution assume possibilities of having zero counts but data may consist of large number of zero responses which violate the distributional assumptions of both models often referred to as the excess zero problems. Several studies have modelled fertility experience based on the distribution of the fertility pattern in different countries [3, 8,9,10,11,12,13,14] with a view to identifying factors influencing fertility. In Nigeria, the determinants of fertility have been examined using Poisson regression to account for the count nature of the variable [9, 11] and also negative binomial to account for overdispersion or heterogeneity [3, 8]. Aside the limitation of the use of Poisson and negative binomial models for fertility data in Nigeria, the analysis is often conducted at national level thus neglecting some of the consequences of cultural diversities at regional level.
Nigeria has six regions defined by sociocultural differences which have implication on fertility. Striking variation exists in fertility across these regions ranging from total fertility rate (TFR) of 4.3 in South South, to 6.7 in North West [15]. Nigeria is the most populous country in Africa with population figure of about 200 million, the population of each of the six regions in the country is more than that of some countries like Togo, Republic of Benin, Liberia, Malawi, to mention a few [16]. Thus, modelling fertility data at national level and with the use of a particular model is likely to be fraught with hidden errors due to the peculiarities of the number of zeros and level of skewness inherent across regional data structures. Therefore, different models may be suitable for fertility at different regions. The current study extends [7] and modelled fertility data in each of the regions in Nigeria with six different distributions and evaluates the performance of the models for their suitability in each region.
Main Text
Methods
Data collection and utilization
The 2013 National Demography and Health Survey (NDHS) dataset was used for the implementation of the model fit. Data collection procedure involved a multistage cluster sampling technique. Prior to the survey, Nigeria was demarcated into smaller units regarded as enumeration areas (EAs) called clusters. This demarcation takes into consideration of the state boundaries to prevent merging of clusters within states. The respondents were selected from each cluster based on rural–urban allocation of specific numbers of clusters in the country. The current study used individual recode data with the information provided by women of childbearing age (15–49 years). Further information about the sampling strategy used for data collection can be accessed in the data originator’s website [15].
Data management
The outcome variable of interest was fertility which was measured by the number of children ever born (CEB), obtained from a total sample of 38,948 women. The data were weighted and the clustering effect was adjusted for in the various count models but unweighted for the skewness test and descriptive summaries of children (Additional file 1). To examine the correlation between CEB and background characteristics of women, a pairwise correlation test based on Bonferroni correction [17] for each region was conducted, 12 variables were used for the model fit: residence, women educational level, religion, ethnicity, wealth index, contraceptive use, currently residing with partner, number of other wives, age at first sex, husband educational level, women working status and husband/partners’ age. All these independent variables were retained for North Central and North West. For South East, South South and South West, residing with partner, number of wives, partner’s education was removed with an additional variable, women work status excluded for North East due to collinearity. All analyses were performed using Stata 15.0 at 0.05 level of significance.
Generalized linear models
Poisson model
The most common technique employed to model count data is Poisson regression. It has a usual feature of equality of mean and variance. Its probability mass function is given as:
Where \({\text{y}}_{\text{i}}\) denote the random variable of the count response, that is, number of children ever born [18, 19].
Negative binomial model
The negative binomial (NB) distribution is a twoparameter distribution combining the Poisson distribution and the Gamma distribution (Gamma–Poisson mixture). It relaxes the assumption of equality of mean and variance, thus accounting for unobserved heterogeneity in count data [19,20,21,22]. Its probability mass function is given as:
The mean and variance of the negative binomial distribution are E [yµ, α] = µ and V [yµ, α] = µ (1 + αµ). Where α is the dispersion parameter (if α > 0 and µ > 0). Special cases of the negative binomial include the Poisson (α = 0) and the geometric (α = 1) [19].
Zeroinflated models
For the zeroinflated Poisson (ZIP), the first process consist of a Poisson distribution that generates counts, some of which may be zerosampling zero, and the second process is governed by binary distribution (logit or probit) for zero valuesstructural zeros [23]. Given variable y_{i}, The ZIP model probability mass function has two model components as follows:
The outcome variable \(y_{i}\) is a nonnegative integer, \(\mu_{i}\) is the expected Poisson count for the ith individual; \(p\) is the probability of extra zeros.
Similarly to the ZIP, the zeroinflated negative binomial (ZINB) model is employed to account for both overdispersion and excess zero problems. For dependent variable y_{i} with many zeros, the ZINB model probability mass function is given as:
where α ≥ 0 is an overdispersion parameter [22].
Hurdle models
In the hurdle Poisson (HP) model, the first part is the hurdle at zero, which addresses the “few” or “more” zero outcome than the distributional assumption of the Poisson model and the second part governs the truncation part or positive outcomes [2, 19, 23]. Given a variable \(y_{i}\). the HP probability distribution is given as:
where µ is the mean of the Poisson model, when \(\left( {1  p} \right) > { \exp }\left( {  \mu } \right)\), the data contain more zeros relative to the Poisson model.
The hurdle negative binomial (HNB) is used when the hurdle model is appropriate and the data exhibit overdispersion [19, 24]. The HNB model is given as:
The mean and variance of the HNB distribution are given as µ and µ (1 + µ/r) respectively, the quantity µ(1 + µ/r) is a measure of dispersion [22].
Model assessment and evaluation
The model selection criterion was based on the maximum likelihood estimates of the model parameter, using the loglikelihood and the Information Criterion (IC)—Akaike (AIC) and Bayesian (BIC). A lower IC value implies that the model is of better fit [25, 26]. An IC values with difference greater than 10 implies that the model with a smaller IC is superior, a value difference of 4 to 10 suggest a moderate superiority of one model against the other and an IC value differences less than 4 implies that the competing models are said to be indistinguishable [26].
Results
Socioeconomic and demographic characteristics of respondents
In Nigeria, 29.5% of women age 15 to 49 years had no child, this percentage is highest in South South (42.4) and lowest in North West (21.3) (Fig. 1). The mean number of children ever born was highest in North West (3.89 ± 3.36) and lowest in South South (2.32 ± 2.58). As presented in Table 1, the information reveals that the age at first sex was lower in the Northern part of the country, compared to the Southern part, South East (18.96 ± 4.35), South West (18.69 ± 3.6) and South South (17.27 ± 3.22) except for North Central (18.06 ± 3.78). A higher number of women with no education were recorded in the Northern regions and women wealth quintiles were higher in Southern regions compared to the Northern regions. About 16% of women used any method of contraceptive in Nigeria and this varies across regions.
Model selection criteria for the fitted model
The model assessments for each of the region are presented in Table 2 using the values from the AIC and BIC for evaluation basis. The hurdle negative binomial model was of best fit for North West (AIC = 45,421.19, BIC = 45,775.64) and South East (AIC = 13,767.37, BIC = 14,026.82) while the zeroinflated negative binomial provided a better fit for North East (AIC = 24,565.28, BIC = 24,828.33). Although, the zeroinflated negative binomial has a moderate superiority over the hurdle negative binomial in South South (AIC = 16,138.5, BIC = 16,411.23). For South West region, both AIC and BIC suggest that ZNB and ZIP are indistinguishable as best fit (\(ZINB \le ZIP < HNB \le HP < NB < Poisson)\) and no superiority exist between the zeroinflated models and their hurdle model analogs. In all cases, the zeromodified models were better than the GLMs, except for North Central were the BIC suggest that NB is of best fit (\(NB < HNB < ZINB < HP < ZIP < Poisson)\) contrary to the AIC and the loglikelihood (\(HNB < ZINB < HP < ZIP < NB < Poisson)\). Similarly, the models which take into account an overdispersion parameter were better than their corresponding models not accounting for overdispersion.
Discussion
This study examined the effectiveness of zeroaugmented models compared to the standard Poisson and negative binomial models widely used for modelling fertility in Nigeria [3, 9, 11]. The current analysis was conducted separately in each of the six regions in Nigeria.
The results using the AIC and BIC has a model selection reviewed that both hurdle negative binomial and zeroinflated negative binomial provide a better fit for fertility data with large number of zeros and overdispersion. Extensively, the AIC and BIC estimates from the zeroaugmented negative binomial based models (HNB and ZINB) were of better fit than their Poisson based counterparts or in rare cases maybe indistinguishable. Consequently, both excess zeros and overdispersion were recommended for fertility modelling not only at national level but also at regional levels. These findings are similar to other studies with similar data generating mechanism, containing large number of zeros [24, 27, 28]. Previous studies have noted that zeroinflated models are statistically appropriate in low fertility population studies and especially when there are large number of women with no children [13, 29].
The adjudged best model for each of the regions was used to predict the determinants of fertility peculiar to each region. For North Central, women with at least secondary level of education, partners with secondary education and women not working are factors driving low fertility. Secondary education, Igbo and higher age at first sex are factors determining low fertility in the North East. Residing in rural areas, secondary education, tertiary education, poorer women compared to poor women, no other wives, higher age at first sex and women not working are factors determining low level of fertility in the North West. Urban residence, women not working and increasing women educational level are factors responsible for low level of fertility in the South East. Increasing level of women education, wealth index, high age at first sex and women not working are drivers of low fertility in South South. Secondary and higher level of education, urban residency and women not working are factors contributing to low fertility level in the South West (Additional file 2).
In conclusion, the assessment in this paper provides evidence to support that fertility count data usually rightly skewed with excess zeros should be modelled using the zeroaugmented models with negative binomial variant.
Limitation
Children ever born (CEB) was captured in NDHS based on the reported full birth history of women of reproductive age. There is likelihood of gross underreporting of CEB due to cultural beliefs and norms of reporting actual number of births.
Availability of data and materials
This study used a secondary dataset from Measure DHS program, the dataset can be accessed after due permission from the DHS program archive and can be downloaded at https://dhsprogram.com/data/dataset/Nigeria_StandardDHS_2013.cfm?flag=0.
Abbreviations
 AIC:

Akaike Information Correction
 BIC:

Bayesian Information Correction
 CEB:

children ever born
 DF:

degree of freedom
 EAs:

enumeration areas
 HNB:

hurdle negative binomial
 HP:

hurdle Poisson
 IC:

information criterion
 NB:

negative binomial
 SD:

standard deviation
 TFR:

total fertility rate
 ZIP:

zeroinflated Poisson
 ZINB:

zeroinflated negative binomial
References
 1.
Hilbe JM. Modeling Count Data. In: Lovric M, editor. International Encyclopedia of Statisticsl Science. Berlin: Springer; 2011.
 2.
Winkelmann R, Zimmermann KF. Recent developments in count data modelling: theory and application. J Econ Surv. 1995;9(1):1–24.
 3.
Alaba OO, Olubusoye OE, Olaomi JO. Spatial patterns and determinants of fertility levels among women of childbearing age in Nigeria. South Afr Fam Pract. 2017;59(4):143–7.
 4.
Hur K, Hedeker D, Henderson W, Khuri S, Daley J. Modeling clustered count data with excess zeros in health care outcomes research. Heal Serv Outcomes Res Methodol. 2002;3(1):5–20.
 5.
Yusuf OB, Afolabi RF, Ayoola AS. Modelling excess zeros in count data with application to antenatal care utilisation. Int J Stat Probab. 2018;7(3):22.
 6.
Samsudin S, Moffatt PG. Modelling count data with excess zeros: an application to health care utilisation data. Malaysian J Econ Stud. 2014;51(2):201–15.
 7.
Kareem YO, Yusuf OB. Statistical modeling of fertility experience among women of reproductive age in Nigeria. J Stat Appl. 2018;8(1):23–33.
 8.
Adebowale AS. Ethnic disparities in fertility and its determinants in Nigeria. Fertil Res Pract. 2019;5(3):1–16.
 9.
Akpa OM, Ikpotokin O. modeling the determinants of fertility among women of childbearing age in Nigeria: analysis using generalized linear modeling approach. Int J Humanit Soc Sci. 2012;2(18):7–11.
 10.
Dana DD. Binary logistic regression analysis of identifying demographic, socioeconomic, and cultural factors that affect fertility among women of child bearing age in Ethiopia. Sci J Appl Math Stat. 2019;6(3):65.
 11.
Fagbamigbe AF, Adebowale AS. Current and predicted fertility using Poisson regression model: evidence from 2008 Nigerian demographic health survey. Afr J Reprod Heal. 2014;18(1):71–83.
 12.
Pandey R, Kaur C. Modelling fertility: an application of count regression models. Chin J Popul Resour Environ. 2015;13(4):349–57.
 13.
Poston DLJ, McKibben SL. Using zeroinflated count regression models to estimate the fertility of U. S. women. J Mod Appl Stat Methods. 2003;2(2):10.
 14.
Silva JMCS, Covas F. A modified hurdle model for completed fertility. J Popul Econ. 2000;13:173–88.
 15.
National Population Commission (NPC) [Nigeria] and ICF International. Nigeria demographic and health survey. Abuja, Nigeria, and Rockville, Maryland. Rockville: NPC and ICF International; 2013. p. 2014.
 16.
United Nations .World Popul. Prospect; 2012 Revis. Popul. Di., New York, 2013.
 17.
Armstrong RA. When to use the B onferroni correction. Ophthalmic and Physiol Optics. 2014;34(5):502–8.
 18.
Rodriguez G. Poisson Models for Count Data, Chapter 4. 2007. p. 1–50. http://data.princeton.edu/wws509/notes/c4.pdf. Accessed 30 Dec 2017.
 19.
Cameron AC, Trivedi PK. Essentials of Count Data Regression (Chapter 15). A Companion to Theoretical Econometrics. Malden: Blackwell Publishing Ltd.; 1999.
 20.
Reese R. The Poisson and negative binomial distributions. 2016. Available from: http://stats.stackexchange.com/questions/37814/poissonistoexponentialasgammapoissonistowhat\nhttp://math.usu.edu/jrstevens/biostat/PoissonNB.pdf
 21.
Baum CF. Models for count data and categorical response data. Adelaide: Boston College and DIW Berlin, University of Adelaide; 2010.
 22.
Yesilova A, Kaydan MB, Kaya Y. Modeling insectegg data with excess zeros using zeroinflated regression models. Hacettepe J Math Stat. 2010;39(2):273–82.
 23.
Lam KF, Xue H, Cheung YB. Semiparametric analysis of zeroinflated count data. Biometrics. 2006;62:996–1003.
 24.
Chipeta MG, Ngwira BM, Simoonga C, Kazembe LN. Zero adjusted models with applications to analysing helminths count data. BMC Res Notes. 2014;7:856.
 25.
Vrieze SI. Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychol Methods. 2012;17(2):228–43. https://doi.org/10.1037/a0027127.
 26.
Pan W. Akaike’s information criterion in generalized estimating equations. Biometrics. 2001;57(1):120–5.
 27.
Hu MC, Pavlicova M, Nunes EV. Zeroinflated and hurdle models of count data with extra zeros: examples from an HIVrisk reduction intervention trial. Am J Drug Alcohol Abuse. 2011;37(5):367–75.
 28.
Desjardins CD. Modeling zeroinflated and overdispersed count data: an empirical study of school suspensions. J Exp Educ. 2016;84(3):449–72.
 29.
Melkersson M, Rooth DO. Modeling female fertility using inflated count data models. J Popul Econ . 2000;13(2):189–203. http://www.jstor.org/stable/20007710.
Acknowledgements
The Authors acknowledge the kind permission of Measure Demographic and Health Survey to use the data for this study.
Funding
This research received no grant from any funding agency in public, commercial or notforprofit sectors.
Author information
Affiliations
Contributions
YOK conceived the original idea of the study, design the study, analyzed the data and drafted the manuscript. IMB contributed to the design of the study, interpretation of findings and revision of the manuscript. ASA contributed to the conception of the study, interpretation, and revision of the manuscript. JOA contributed to statistical analysis, interpretation and revision of the manuscript. OBY contributed to the conception and design of the study, statistical analysis and revision of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Yusuf Olushola Kareem.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Kareem, Y.O., MorhasonBello, I.O., Adebowale, A.S. et al. Robustness of zeroaugmented models over generalized linear models in analysing fertility data in Nigeria. BMC Res Notes 12, 815 (2019) doi:10.1186/s1310401948525
Received:
Accepted:
Published:
Keywords
 Zeroaugmented models
 Fertility
 Best fit model
 Nigeria