A training manual for event history data management using Health and Demographic Surveillance System data

Objective The objective of this research note is to introduce a training manual for event history data management. The manual provides a first comprehensive guide to longitudinal Health and Demographic Surveillance System (HDSS) data management that allows for a step-by-step description of the process of structuring and preparing a dataset for the calculation of demographic rates and event history analysis. The research note provides some background information on the INDEPTH Network, and the iShare data repository and describes the need for a manual to guide users as to how to correctly handle HDSS datasets. Results The approach outlined in the manual is flexible and can be applied to other longitudinal data sources. It facilitates the development of standardised longitudinal data management and harmonization of datasets to produce a comparative set of results. Electronic supplementary material The online version of this article (doi:10.1186/s13104-017-2541-9) contains supplementary material, which is available to authorized users.


Introduction
Computing individual exposure and basic rates requires a standard format core residency file. In the case of the INDEPTH Network, this format allows member Centres to provide surveillance data in a standard way for repository and analytical purposes. However, the proposed longitudinal data format is not limited to the use of HDSS data: it is applicable to any type of longitudinal data, i.e. data obtained from surveillance, registers or from retrospective or panel surveys. In fact, much data collection conducted by INDEPTH Network member Centres involve part surveillance data (often in the form of retrospective data), part survey data (often in the form of panel data). 1 In addition to its first goal, computing rates and conducting analysis, event history analysis or core residency file format allows for the calculation of longitudinal data quality metrics: 1. The very process of creating the core residency file first involves checking basic inconsistencies on dates and types of events: out-of-range values, coding errors, unusual frequencies, etc. 2. The second and most crucial step in checking quality involves the construction of a matrix crossing (current) events with following-events, referred to as the event consistency matrix to check the coherence of event sequences. Ordering of events is easily checked through this matrix and details of order inconsistencies can be produced to make corrections in the base files before event history data is processed further. 3. The first and second steps are internal consistency checks. The third step involves computing the basic rates defining the entry and exit from the population under study and checking these rates against known trends (external consistency checks) from the same population, from similar populations or from larger spatial units (region, country). These basic rates are birth and inmigration rates, and death and out-migration rates, in the context of surveillance systems or registers. It is usually not possible to check these rates for retrospective data sources because the dead and the emigrants are not accounted for.
Employing a standardised format facilitates pooling data for multi-site analysis. It further allows for variable naming, labelling of categorical variables and display formatting to be applied consistently. Standardisation helps create a common language and enables the sharing of codes for data management and analysis. The longitudinal data structure being the same, statistical programmes can be understood by a large community.

Constructing the core residency file
The surveillance data associated with a particular individual over the course of her exposure to demographic surveillance are represented by a series of event records. The first event for any individual will be either enumeration, birth, or in-migration. This will be followed by a number of records corresponding to each observed event involving that individual, such as out-migration, death, or end of observation (censoring). All these basic events define entry or exit from a population under observation, i.e. a population defined within time and spatial boundaries. In other words, the basic events define the residency status of each individual recorded in the database: this is why the longitudinal data comprising these events is named the "core residency file". Each event has a set of standard attributes ( Table 1) that are common to all events, followed by a series of attributes specific to the event. All variables in bold font are compulsory for inclusion in the core residency file.

Destination (same as for Origin)
Location exit EXT = 5 The event of leaving a residential location within the surveillance area to take up residence in another residential location within the surveillance area.
LocationId_dest The LocationId of the destination location within the surveillance area to which the individual moved. Location entry ENT = 6 The event of taking up residence in a residential location within the surveillance area following a location exit event. Note that location exit and entry are actually two parts of the same action of changing residential location, and as such they happen on the same event date.  Table 3 presents an example of an episode file, with start event and end event in each record. The entries in italic font, i.e. after an individual's death or out-migration, do not usually appear in the original record. When no event occurred by the end of observation, the variables EndEventCode and EndEventDate might be missing (individual A). It is necessary to define a censoring date, i.e. a date at which observation ends (here 31 Dec 2010) and to create an extra record spanning the time between the last recorded event and the end of observation date. In the Stata programme 2 (2.5.2), the same censoring date is proposed for all individuals, but this consistency across individuals is not absolutely necessary (although much easier to manage). What is absolutely necessary is that for a particular individual, the censoring date is the same in all data files.  Table 4 presents an example of a basic EHA file, i.e. a core residency file. Extracted from the original "residency episode file" (with start and end events for each exposure episode), the following core residency file is obtained by sorting recorded events for each individual by dates of occurrence (Stata programme 1). Note that all records for an individual end with an OBE (end of observation) record. After defining a censoring time (OBE) for all individuals (Stata programme 2), the "residence" variable can be computed (Stata programme 3). This file format is called the long format.

Check-list for core residency file
1. The aim is to get a file with variables as specified in Table 1 (variables in bold font mandatory). The Stata programme 1 (2.5.1) helps to convert an "episode file" (where each episode of residence is marked with a start event and an end event in a single record) into a core residency file where all episodes are ordered according to the date of events. 2. Missing values: make a distinction between 'not applicable' (NA) because of logical conditions (e.g. no education level for children under 6), 'don't know' (DK) as a response category (i.e. the respondent answered but does not know), 'refusal' (RE) as a response category (i.e. the respondent does not wish to answer), and 'missing' which means a response should be recorded but is not due to data collection, data entry, or data processing hazards. 3. The variable EventCode in the EHA file should include: a. at least the following events: BTH, ENU, IMG, OMG, DTH, OBE (definitions in Table 2) b. preferably internal moves: EXT, ENT (see comment 5 below) 4. All 'hanging cases' (when the residency status or even the survival of a resident absent at the moment of interview is unknown) should have been dealt with. This is why an OBE is compulsory to define the last reliable date of observation (right-censoring date). The Stata programme 2 (2.5.2) is provided to create the censoring observation (OBE) if not readily available in the original dataset. The section below explains how to create an OBE using the "stset" and "stsplit" commands. 5. Some sites do not record internal moves. If they do, an entry ENT should follow an exit EXT, as these codes mark the change of household within the HDSS. An EXT not followed by an ENT is considered a 'hanging case'. 6. It is best to include "events" coded OBS and DLV. However, these events are not required for analysing mortality or migration. 7. To achieve this 'long format', all individuals must have at least two records sorted by date, from the earliest to the latest. 8. After sorting the dataset at the end of Step 2, a consistency matrix described in the next section (2.4) will help identifying inconsistencies in the order of events. 9. The "residence" variable depends on the logical sequence of core events and on the time criterion for residence (Step 4, 2.5.4). The variable "residence" is a binary variable that defines the period of exposure in the population under study, which is essential to all event history analysis. It is constructed using two sources of information:  The logical sequence of core events: ENU, BTH, DTH, IMG, OMG, OBE; as well as the following incidental events: EXT, ENT, OBS, OBL (accounted for in the Stata programme).  The time criterion for residence, which is specific to each site. Continuous registration systems (such as HDSS, registers etc.) use different time criteria for residency: some use a 1month duration of residence in the site to be considered a true resident, some use a 6-month duration of residence, some use another time threshold to determine residency. For comparative analysis purposes, it is highly advisable to use a common time criterion. 6 months is generally the preferred time criteria as it has become almost an international standard in demographic analysis.  This time criterion applies to entry or exit from the HDSS that is uses to determine residency status. It does not apply to the duration from birth to out-migration or death.

Time ordering in milliseconds
By default, Stata assumes that successive events should be ordered in time and not simultaneous (a time gap should separate each event from the next). With time in milliseconds (%tc display format, double storage format), all events that were originally entered in days (as usual in HDSS data collection) will be set at 00:00 of that day (i.e. the first second of that day). Stata does not offer a time format in minutes or hours, so it is necessary to set time in millisecond to order events that occurred the same recorded day. It will be necessary to set a different time of the day for different type of events to avoid confusion for successive events that occurred the same day. The advised procedure is the following: 1. Set all events to 12:00 (i.e. midday) 2. Set date of birth to 12:00 3. Set all event that means entering the population to 06:00 (i.e. ENU, BTH, IMG) 4. If applicable, set EXT event to 06:00 5. Set all event that means exiting the population to 18:00 (i.e. OMG, DTH) 6. If applicable, set ENT event to 18:00 That way, baby who died on the same day they were born will always die 12 hours after they were born. Also, individuals who out-migrated on the same day they in-migrated will always out-migrate 12 hours after they in-migrated. Other events that do not relate to entering or exiting the population (i.e. OBE, DLV, OBS, OBL, IPT, PER, AGE) will be dated at 12:00.

Creating OBE using "stsplit" command after "stset" command
The easiest way to create an OBE event (end of observation) for all individuals is to use the "stset" command and then the "stsplit" command. The "stset" command is meant for event history analysis: it is creating the censoring variable using the event variable (EventCode) and time variables (EventDate, datebeg) conditional on residence in the HDSS (residence==1).
The program than run "stset" and "stsplit" is in Section 2.6.1 below. Despite time changes explained in the previous section 2.4, it may be that "stset" produces a warning: entry on or after exit (datebeg>EventDate) PROBABLE ERROR Stata actually means (datebeg>=EventDate). Because datebeg should never be greater than EventDate, this is an indication that datebeg is actually equal to EventDate on a particular record, i.e. that two successive records end the same date, at the same hour. The direct consequence is that the record where datebeg=EventDate will be discarded. Usually it has no consequence on the analysis (for example when a DLV just occurred the same day as OBS or AGE) but it is important that these potential errors are checked and corrected at least for the event that relate to entering or exiting the population (i.e. ENU, BTH, IMG, OMG, DTH) and for internal moves (i.e. EXT, ENT).
Once the "stset" command has been executed, then the "stsplit" command can be used to set an OBE date for all individuals (see programme in section 2.6.1 for details). After each "stsplit", it is always safe to sort the data again by IndividualId and EventDate and, at the time of analysis, compute again the censoring variable (see section 5.3.2.).

Producing and interpreting the event consistency matrix
Beyond errors in dates explained in the previous section, logical errors in the order of events have to be corrected. It is highly probable that when first compiling the dataset some inconsistencies are present in relation to the order of events. The event consistency matrix helps to identify these: a. The first event should always be ENU, IMG or BTH. b. The last event should be 'missing' or OBE, OMG or DTH. c. Common errors include events that occurs twice in a row, e.g. two successive OMG, (a resident would need to IMG before OMG again). d. Some sequences are impossible, e.g. DTH before BTH, IMG, or OMG.

The
Step 3 provides the code for producing the event consistency matrix.
To produce this matrix one needs a variable for the preceding event "EventPrec" that crosses with the current event "EventCode". The program is quite straightforward, however, the interpretation less straightforward. The following matrix will help in identifying the inconsistencies. For core residency events, the matrix is fairly simple: The crossed cells indicate errors: where there is a consistent recording of events, there should be no observations in these cells. The cell with a question marks suggest that deaths might be reported by proxy respondents after out-migration, but this is highly improbable and should be checked against the original data files. All other cells will be populated with varying numbers of observations. To note, in cohort studies there should be no missing events before IMG (cell marked with "c") since observation of all individuals starts with BTH or ENU, depending on whether the cohort is a birth cohort or not. To note, in cohort studies there should be a missing event before ENU or BTH. Also, in retrospective surveys (cell and column marked with "r"), there should be no OBE after OMG, and there should be no DTH events.
To check inconsistencies, it is best to produce listing of identifiers, e.g. for cases of BTH following ENU: The matrix can be more complex when other events (EXT, ENT, OBS, OBL…) are added but the same interpretation applies. The cells with question marks for EXT as a preceding event indicate 'hanging cases' that should be checked against the original data files:

Stata programmes
Several steps are necessary to generate the core residency file described above: -Step 1: Converting episode file into core residency file Step 1 is not necessary if a core residency file was provided from the beginning. Check the nature of the input file first. label define eventlab 1 "ENU" 2 "BTH" 3 "IMG" 4 "OMG" /// 5 "EXT" 6 "ENT" 7 "DTH" 9 "OBE" 10 "DLV", modify lab val EventCode eventlab * If EventDate and DoB are stored as integer variable %td or %d * transform the format of the variable into "double": %tc local formatdate: format EventDate if ("`formatdate'"=="%d" | "`formatdate'"=="%td") { replace EventDate=cofd ( * => baby who died on the same day they were born * will always die 12 hours after they were born * => people who out-migrate on the same day they in-migrated * will always out-migrate 12 hours after they in-migrated * Other events that do not relate to entering or exiting the population * will be dated at 12:00, i.

Adding events linked to ego, e.g. fertility events
The events that change the residency status of the individual form the structure of the core residency file. They are sufficient to compute birth, death, in-and out-migration rates. However, other events may be added to complete the biography of each individual: these are incidental events defined as events that do not change the individual's residency status. They may be changes in activity, marital status, socioeconomic status, education, etc. One of these key biographical events is the delivery of a child.
This section uses the example of deliveries although any type of biographical event that relates to the individual could be used, e.g. field visit, hospital visit, vaccination, marriage, spouse death, spouse migration, etc. Deliveries (live births in particular) are needed to compute fertility rates and this is why we have chosen to present this as an example. To apply the procedure described below to another type of event, you just need to replace "deliveries" with the other event. If however, the event requires that the event date be imputed, it is best to refer to section 4.
Whatever the event, the crucial aspect to understand in this section is merging. Adding events that do not change residency status requires using a special form of merging, i.e. merging according to time. A command has been designed to achieve this using a Stata ado file, "tmerge.ado", with the corresponding help menu "tmerge.hlp". The Stata programmes are available in the Annexes of this document. IndividualId A number uniquely identifies individuals. Here, the identifier is that of the mother. The mother may not necessarily be a resident of the site.

Individual Identifier
FatherId Optional: In case the identifier of the father is known. The father may not necessarily be a resident of the site.

Child identifier ChildId
The event of a pregnancy ending after 28 weeks of gestation, which may or may not result in the birth of one or more individuals. If the delivery results in a birth of an individual, then it will be represented in the core residency file by a BTH event. If only live births have been recorded in the system, it is usually easier to extract these births from the core residency file using the condition that they can be linked to a mother and possibly a father.  Table 8 presents an example of an EHA file that includes deliveries of children. This file does not show the compulsory variables ChildID and ChildSex for the sake of saving space. These births are attached to identifiers of mothers, and possibly fathers. Complete birth histories might have been collected retrospectively, thus resulting in births that occurred before the onset of observation or outside the spatial limits of the site. Regardless, all births occurring within the site should be recorded. It is possible that some births in the site are attached to non-resident mothers (a child may be born to a mother who never resided in the site, admittedly a rare case). Whatever the case (births within site only or complete birth histories), all deliveries relating to mothers (and possibly fathers) who ever resided in the site should be included in the "residency event file". Comments on deliveries for each of the 6 individuals A to F:

Example of EHA file with birth history
A. The simplest case: no delivery in or out of the site. B. Most common case: all deliveries occur in the site. C. More complex case: some deliveries occur in the site, others do not. The last delivery was simultaneous with the death of the mother. When analysing female adult mortality, it is best to artificially separate by one day (or some hours) the delivery and the maternal death (duplicate the observation and add one day or some hours to the date of death of the mother). D. Some deliveries may have occurred before enumeration if retrospective birth histories were collected.
Because delivery may occur the same day as a residential event (e.g. entry into a new household), it is best to artificially separate by one day (or some hours) the entry and the delivery. E. Guess what is wrong here! Check for date or event inconsistencies. F. It might be the case that no delivery actually took place within the site, delivery may even take place after out-migration, when the mother is not (or no longer) resident in the site but still on the identification system. The delivery may take place in the site while the mother was a temporary resident (e.g. she came to deliver at her own mother's home). Note that not all of a woman's deliveries may have been recorded, so the birth rank here starts at 2 for the first recorded delivery.

Procedure to add fertility events
Three steps are necessary to generate the file described above: -

3.2.3
Step 3: Merge file A and B into file C according to time using "tmerge.ado" (see Annexes): 10. Copy the Stata programs "tmerge.ado" and "tmerge.sthlp" (or "tmerge.hlp") into your "…\StataXX\ado\base\t\" sub-directory. 11. Merge file A and B using command "tmerge", replacing 'yoursite' by the name of your site: tmerge IndividualId residency2(EDate) deliveries(dateDLV) yoursite_coreDLV(EventDate) The new file yoursite_coreDLV will contain the core residency file with delivery events. If all individuals who have had deliveries have an end of observation OBE and if all these individuals are represented in both file A and B, there should be no error message after "tmerge". If you do have an error message and thus cannot complete step 3, then you have to check steps 1 and 2. Do not attempt to go further if "tmerge" does not work properly.

Adding other biographical events defining changes of status
Core event dates are recorded directly by the surveillance system. This is the case of birth, death, and migrations. Some other events such as delivery, field visit, hospital visit, vaccination, marriage, spouse death, spouse migration, etc., may also be recorded to the day (see previous section). However, many other events need to be reconstructed from differences in status variables recorded at different field visits. This would be the case for education status, employment status, matrimonial status, etc.
For example, education status may be recorded every three years. If a change in education level is recorded for an individual with no precise date of change, then it may be necessary to impute a date of change using the available information. Suppose a change from completed primary to completed secondary is recorded between rounds three years apart, the procedure could be to attribute a date of change using age and calendar information: the change probably occurred on the day the last school year results were announced when the individual reached 18-year old.
For other events that are not tied to a school-year or any particular seasonal calendar, it is better to attribute a mid-term date between two rounds. This would be the case of employment status or matrimonial status. The following procedure is given for this more general example.  Table 10 presents an example of an EHA file that includes change of status at mid-term between rounds. Only changes of status between rounds are recorded, i.e. there could be several changes between rounds but only the difference between two successive rounds is recorded.

4.2.2
Step 2: Merge file A and B into file C according to time using "tmerge.ado" (see Annexes): 8. If not done before, copy the Stata programs "tmerge.ado" and "tmerge.sthlp" (or "tmerge.hlp") into your "…\StataXX\ado\base\t\" sub-directory. 9. Merge file A and B using command "tmerge", replacing 'yoursite' by the name of your site: tmerge IndividualId residency2(EDate) status(ObservationDate) yoursite_corestatus(EventDate) The new file yoursite_corestatus will contain the core residency file with change of status. If all individuals with change of status observations have an end of observation OBE, and if all these individuals are represented in both file A and B, there should be no error message after "tmerge". If you do have an error message and thus cannot complete step 3, then you have to check steps 1 and 2. Do not attempt to go further if "tmerge" does not work properly.

Format the new EventDate as a date variable:
format EventDate %tc

Adding duration events
Categorising periods after or before some specific events can be extremely useful for longitudinal analysis. Often, time-to-event is period-dependent or duration-dependent. For example, the six months before delivery define a period of confirmed pregnancy. Similarly, the few months after delivery are characterised by post-partum amenorrhea, which length depends on breastfeeding and other factors. Also, the changes in economic, social, and political macro-contexts from one calendar period to the next can be superimposed on biographies to better explain individuals' behaviour. Finally, while analysis time if often age (i.e. duration since birth), duration since a particular biographical event (e.g. duration since migration, marriage or first delivery) may serve as analysis time. In that case, controlling for age (or age group) as an independent variable can be useful.
In all cases, the procedures below create time-varying covariates from the files compiled using procedures from previous sections. In other words these procedures do not necessitate extra data: they consist of handling time using data at hand only and the resulting variables are powerful tools for data analysis.

Duration specific to individual's own biography
Given the importance of migration in HDSS settings, and its potential impact on other behaviours, the following procedures are meant to create three types of variables: the first one to define periods following in-migration, the second to define time spent outside the HDSS after out-migration and before returning to the HDSS, and the third to make a distinction between new in-migrants and return migrants. The rationale would be the same to define periods before or after any other events such as deliveries, marriage or employment spells.
With regards to migration status, the respondents can only be qualified as "permanent resident" at the time of enumeration if data were collected at enumeration on the duration of residence in the HDSS area. Otherwise, one should in principle define permanent residence after some threshold, say 3 or 5 years, and begin the analysis only after 5 years running the HDSS. In practice, the uncertainty about permanent residency diminishes with time after enumeration and one can accept some uncertainty starting 3 years after enumeration. In the absence of data on duration of residence at enumeration, our advice is to impose left-censoring at least 3 years after first enumeration (round 0) as the migration variables are not reliable when the HDSS is less than 3 years-old. Therefore, in absence of precise data collected at enumeration, the "permanent resident" status is an approximation to be interpreted cautiously in the first 3 years of the HDSS and with more confidence as one moves away from the enumeration date.  Out-migrated after more than 2 years but less than 5 years after in-migrating.

Create in-migration status
In the example described above, periods 6 months, 2 years and 5 years after in-migration are defined. Three steps are necessary to generate the file: -Step 1: Generate count variable of periods following in-migration.
-Step 2: Generate periods according to the duration of residence since last in-migration.