Application of random survival forests in understanding the determinants of under-five child mortality in Uganda in the presence of covariates that satisfy the proportional and non-proportional hazards assumption

Background Uganda just like any other Sub-Saharan African country, has a high under-five child mortality rate. To inform policy on intervention strategies, sound statistical methods are required to critically identify factors strongly associated with under-five child mortality rates. The Cox proportional hazards model has been a common choice in analysing data to understand factors strongly associated with high child mortality rates taking age as the time-to-event variable. However, due to its restrictive proportional hazards (PH) assumption, some covariates of interest which do not satisfy the assumption are often excluded in the analysis to avoid mis-specifying the model. Otherwise using covariates that clearly violate the assumption would mean invalid results. Methods Survival trees and random survival forests are increasingly becoming popular in analysing survival data particularly in the case of large survey data and could be attractive alternatives to models with the restrictive PH assumption. In this article, we adopt random survival forests which have never been used in understanding factors affecting under-five child mortality rates in Uganda using Demographic and Health Survey data. Thus the first part of the analysis is based on the use of the classical Cox PH model and the second part of the analysis is based on the use of random survival forests in the presence of covariates that do not necessarily satisfy the PH assumption. Results Random survival forests and the Cox proportional hazards model agree that the sex of the household head, sex of the child, number of births in the past 1 year are strongly associated to under-five child mortality in Uganda given all the three covariates satisfy the PH assumption. Random survival forests further demonstrated that covariates that were originally excluded from the earlier analysis due to violation of the PH assumption were important in explaining under-five child mortality rates. These covariates include the number of children under the age of five in a household, number of births in the past 5 years, wealth index, total number of children ever born and the child’s birth order. The results further indicated that the predictive performance for random survival forests built using covariates including those that violate the PH assumption was higher than that for random survival forests built using only covariates that satisfy the PH assumption. Conclusions Random survival forests are appealing methods in analysing public health data to understand factors strongly associated with under-five child mortality rates especially in the presence of covariates that violate the proportional hazards assumption. Electronic supplementary material The online version of this article (doi:10.1186/s13104-017-2775-6) contains supplementary material, which is available to authorized users.

(2) HR = 1, indicates that individuals in the two categories are at the same hazard of experiencing the event. HR > 1 indicates that individuals in category (X = 1) are at a high hazard of experiencing the event. Lastly, when HR < 1, individuals in category (X = 0) are at a high hazard of experiencing the event.
To use the Cox-proportional hazard model, all the covariates entered in the model must satisfy the proportional hazards assumption. This implies that the model can give invalid results in situations where the PH assumption is violated. The split-rule mentioned in step 2 of the algorithm is very important in tree building. In this article, we use the log-rank and the log-rank score split-rules.

Method 2 Survival trees and random survival forests
Algorithm 1 : Survival tree algorithm 1: At each node randomly select √ p-covariates from p-covariates as candidates for splitting the node into two daughter nodes. 2: Compute the impurity measure based on a predetermined split-rule at the node on each covariate selected in step 1. 3: Split the node into two daughter nodes (α and β) using the value of the impurity measure. The best covariate split maximizes the difference between the two daughter nodes. 4: Recursively repeat steps 2 and 3 by treating each daughter node as a root node. 5: Stop if a node is terminal, i.e., has no less than d 0 > 0 unique observed events.

The log-rank split-rule
Suppose a node h can be split into two daughter nodes α and β. The best split at a node h, on a covariate x at a split point c * is the one that gives the largest log-rank statistic between the two daughter nodes [ciampi1987recursive]. The log-rank statistic for a split on x at a given covariate value c * is defined as: where d α,j is the number of events in daughter node α at time point j. The expected number of events in daughter node α, E (D α,j ) and its variance are given by: where d j is the total number of observed events at time point j. R α,j is the number of individuals at risk in node α at time point j and R j the combined number at risk in daughter nodes α and β. The algorithm for building a survival tree using the split-rule based on the log-rank statistic is given in Algorithm 2 below.
Algorithm 2 : The log-rank survival tree algorithm 1: At each node randomly select √ p-covariates from p-covariates as candidates for splitting the node into two daughter nodes. 2: At a node h, compute the log-rank statistic impurity measure defined above for daughter nodes α and β formed by all possible splits on all covariates considered for splitting at the node. 3: Choose the covariate that has the largest significant log-rank statistic calculated from one of the daughter nodes created by the splits. Partition the node into two daughter nodes based on the values of the covariate obtained from the split with the largest statistic. 4: Recursively repeat steps 2 and 3 by treating each daughter node as a root node. 5: The node is terminal if it has no less than d 0 > 0 unique observed events.

The log-rank score split-rule
The log-rank score split-rule [hothorn2003exact] is a modification of the logrank split-rule defined above. It uses the log-rank scores [lausen1992maximally]. Given r = (r 1 , r 2 , . . . , r N ), the rank vector of survival times with their indicator variable (T, δ) = ((T 1 , δ 1 ), (T 2 , δ 2 ), . . . , (T N , δ N )) , and that a = a (T, δ) = (a 1 (r), a 2 (r), . . . , a N (r)) denotes the score vector depending on ranks in vector r. Assume that the ranks order the predictor variables in such a way that x 1 < x 2 < . . . < x N . The log-rank scores for an observation at T l is given by: is the number of individuals that have died or censored before or at time T k . The log-rank score statistic is defined as: whereā and S 2 a are the mean and sample variance of the scores {a j : j = 1, 2, . . . n}. The best split is the one that maximizes |i (x, c )| over all x j s and possible splits c .

Random survival forests
Generally, trees are unstable and hence researchers have recommended the growing of an entire forest [breiman2001random, dietterich2002ensemble]. Random survival forests [ishwaran2008random, ishwaran2014randomforestsrc] are considered to be the solution to the problems of using a single survival tree. The random survival forests algorithm implementation [ishwaran2008random] is given as: Algorithm 3 : Random survival forest algorithm 1: Draw B bootstrap samples from the original data set. Each bootstrap sample excludes about 30% of the data and this is called out-of-bag (OOB) data. 2: Grow a survival tree for each bootstrap sample. At each node randomly select √ p from p covariates as candidates for splitting. Split the node by selecting the covariate that maximizes the difference between daughter nodes using a predetermined split rule. 3: Grow the tree to full size under the constraint that a terminal node should have no less than d 0 > 0 unique deaths. 4: Calculate the cumulative hazard (CH) for each tree. Average to obtain the ensemble prediction. 5: Using OOB data, calculate prediction error for the ensemble cumulative hazard.
Integrated Brier scores [graf1999assessment] are used to compare the predictive performance of all the split-rules used in random survival forests for this study. At a given time point t, the Brier score for a single subject is defined as the squared difference between observed survival status (e.g., 1=alive at time t and 0=dead at time t) and a model based prediction of surviving time t . Using the test sample of size N test , Brier scores at time t are given by: Where G (t|x) ≈ P (C > t|X = x) is the Kaplan-Meier estimate for the conditional survival function of the censoring times. The integrated brier score(IBS) are given as: It is common practice to hold out part of the available data to validate the model. This is done to avoid the problem of overfitting that arises from using the same dataset to train and test the model. The available data however, is often not large enough to divide it into the test and train data set. We therefore used a 10−fold cross-validation approach where the data set is split into 10 datasets of approximately equal size and the IBS is calculated on each fold left-out while training the model on the other 9 datasets.