Are disadvantaged children more likely to be excluded from analysis when applying global positioning systems inclusion criteria?

Objective When using global positioning systems (GPS) to assess an individual’s exposure to their environment, a first step in data cleaning is to establish minimum GPS ‘inclusion criteria’ (a set of rules used to determine which GPS data are able to be included in analyses). Care is needed at this stage to avoid any data exclusion (data loss) systematically biasing results in terms of characteristics of the environment and participants. The extent of potential systematic bias in sample retention due to GPS data loss and application of GPS inclusion criteria is unknown. The aim of this study was to describe differences in sample size and socio-demographic characteristics of the retained sample when applying three different GPS inclusion criteria. The study assessed 7-day GPS data collected from children (aged 9–13 years) recruited from nine schools in Auckland, New Zealand as part of the Kids in the City study. Results Participants from ethnic minorities and those attending schools in lower socioeconomic areas were disproportionately excluded from the retained samples. This highlights potential equity implications in basing the assessment of exposure—which ultimately influences research results on the relationship between environment and health—on non-representative GPS data.


Introduction
Increasingly, researchers are using global positioning systems (GPS) to track where people go, and to more precisely assess exposure to the environment compared to self-report or the residential neighbourhood [1]. Researchers have used GPS to explore relationships between the environment and diverse outcomes such as diet [2] unhealthy food purchasing [3], physical activity [4], and alcohol use [5]. GPS has also been used to assess exposure to pollution [6], routes travelled [7], independent mobility [8], and time spent indoors or outdoors [9].
Missing and erroneous data is a known issue with GPS [10][11][12]. GPS data may not be recorded for a number of reasons including signal drop out due to loss of satellite visibility, signal acquisition times, dead batteries, or data loss during download [12][13][14][15][16]. Recorded GPS data may be erroneous due to participants not wearing/losing the GPS device and signal scatter due to loss in satellite visibility [12][13][14][15][16]. Some data loss could be associated with participant characteristics and lead to systematic bias in study results [17].
Similarly, there is potential systematic bias due to application of GPS inclusion criteria used to determine whether a participant has sufficient data to reliably estimate behaviours of interest. Despite this, few GPS studies report their GPS inclusion criteria, and there are no standards among those that do [17].
Application of inclusion criteria has resulted in significant differences in characteristics of samples retained compared to those excluded for analysis of data from other wearable devices such as accelerometers [18][19][20]. However, no research has investigated the impact of applying inclusion criteria to GPS data. Furthermore, Meseck et al. [11] is the only study that has evaluated bias associated with GPS data loss. Therefore, this study aims to compare descriptive differences in sample size and sociodemographic characteristics of excluded/included participants when applying three different GPS inclusion criteria.

Methods
Data from the Kids in the City (KITC) study were used. Detailed methods are described elsewhere [21]. Children aged 8-13 years (109 males, 141 females) from Auckland, New Zealand were recruited from nine schools with diverse built environment characteristics and school socio-economic status (SES).
Participant demographic characteristics (sex, age, ethnicity, number of household cars) were collected from parents/caregivers in a computer-aided telephone interview. Number of household cars was a proxy for household SES. School SES data was sourced from the New Zealand Ministry of Education.
The shortest road network distance between each participant's home and the nearest school entrance was calculated using geographic information systems (GIS). Home addresses were geocoded and school entrance points manually digitised based on entrance locations visible in satellite imagery. A 2011 'improved road centreline' dataset was downloaded from http://www.koord inate s.com. Non-walkable road segments (motorways and on-ramps) were removed before analysis. GIS analyses were undertaken in ArcGIS 9.3 (ESRI Inc, Redlands, CA).
Seven consecutive days of GPS data were collected using QStarz BT-Q1000 and BT-Q1000XT units (Qstarz International Inc., Taiwan). The only relevant difference between the units was the greater storage capacity of the BT-QT1000XTs. Both units had sufficient storage for the study.
Data were collected during school terms in 2011 and 2012. GPS units were worn on a belt and collected data every 10 s. Participants recorded when they put on and took off the belt. During weekdays the research team visited the school to download the previous day's GPS data and charge units. On Fridays the children were given chargers and instructed to charge the units each weekend night. Weekend GPS data were downloaded by the research team on Monday at school.
Three GPS inclusion criteria were developed, applied and assessed.

Inclusion criterion 1
Inclusion criterion 1 was as inclusive as possible while also requiring minimally valid GPS and address data.
1. The home address was able to be geocoded; and 2. Participants reported a single home address; and 3. GPS data were recorded at the home address; and 4. Three or more hours of GPS data were collected during the 7-day data collection period.

Inclusion criterion 2
Investigating spatio-temporal location patterns from GPS data requires sufficient data points on different days of the week and times of the day. Ideally, this would mean using an inclusion criterion with a high minimum number of hours per day for different days of the week. However, participants with missing data may also have periods of high quality GPS data (e.g., due to spending time in locations with poor satellite visibility) and strict inclusion criteria may exclude otherwise potentially useful data. Therefore, the following approach was taken. First, the GPS data were divided into three categories: weekdays before school, weekdays after school, and weekends. Weekdays before school included GPS points recorded on weekdays, starting from the time the GPS was put on and ending at the start of school (based on school start time). Weekdays after school included GPS points recorded on weekdays from the end of school (based on school end time) and ending at the time the GPS was removed for the day. The different school start and end times were taken into account when categorising the GPS data. Weekends included all GPS data recorded on a Saturday or Sunday.
Next, the following additional inclusion criteria were applied to the complete GPS dataset: 1. At least 2 weekdays with at least 30 min before school data; and 2. At least 2 weekdays with at least 2 h after school data; and 3. At least 5 h of total weekend data.
The number of valid days and the duration of valid GPS data were determined by considering the population, the purpose of the broader KITC study, and building on criteria used in published literature [22,23].

Inclusion criterion 3
The third criterion was based on inclusion criteria that had been applied to accelerometer data in the KITC study [24]: 1. Weekdays required at least three non-school hours of GPS data and weekends required at least 7 h of GPS data; and 2. Each participant required at least two valid days of weekday data and one valid weekend day.
The number and percentage of included participants within categories of important demographic characteristics (school, sex, age, ethnicity, number of cars, distance to school) were calculated for each sample (full, criterion 1, criterion 2, criterion 3). Percentage retention for each category (e.g., number of males in criterion 3/number of males in full sample), and the percentage in each category compared to the total participants in the criterion (e.g., number of males in criterion 3/number of participants in criterion 3) were calculated for each criteria. For each characteristic, Pearson Chi square tests were used to compare the proportions between the full sample and each of the criterion.

Results
One participant did not supply any demographic or GPS data, leaving 253 participants included in this analysis. Table 1 presents characteristics of the full sample alongside those for the sample under each of the three GPS criteria. Increasingly strict inclusion criteria reduced sample size (up to 81% loss for criteria 3).
With the exception of sex, percentage of the sample retained at criteria 1-3 varied for the socio-demographic characteristics assessed. Different distance to school and age categories had similar percentage retentions at criterion 1, but by criterion 2 and 3 varied more. There was no clear pattern between distance to school and percentage retained, nor between age and percentage retained.
The most marked variation in percentage retained was for school attended and ethnicity. Only 95.7, 17.4, and 0% of participants in school 2 were retained in criteria 1, 2, and 3 respectively compared with 96.7%, 40.0%, 26.7% of participants from school 6. None of the Māori participants and a relatively low percentage of Samoan (13.2%) and Other Pacific Island (9.1%) participants were retained when applying criterion 3, compared to 35.1% of Europeans and 23.9% of Indian/Asian/other. Table 1 also presents percentage of participants retained in each socio-demographic category in relation to the total number of participants in each criterion, revealing how the loss of numbers in the sample affects the representation in the sample. There was little change in the representation of females/males when each criterion was applied. However, the same could not be said for the other socio-demographic characteristics, with the most notable differences occurring again for school attended and ethnicity. Table 1 also presents p-values from the Chi square tests to provide an estimate of bias. There was evidence of a difference in proportions between the full sample and at least one of the criterion within ethnicity (criterion 1, 2, 3), school (criterion 2, 3) and age (criterion 3) categories.

Discussion
This study aimed to describe the impact of applying different GPS inclusion criteria. While it is obvious that the application of increasingly strict inclusion criteria will reduce the sample size, this study highlighted the dramatic reduction in sample size in our GPS dataset of New Zealand children. Of greater concern was the finding that the sample retention exhibited sociodemographic bias, and likely environmental bias due to the location of schools in diverse environments. Yet inclusion criteria are important to ensure data are as representative of participants' behaviour as possible. Ultimately, there is a trade-off between ideal criteria and maximising the retained sample size. Improving compliance of different subgroups, more comprehensive analysis of this trade-off, and the development of standardised GPS inclusion criteria are important knowledge gaps for researchers to address in future research.
As demonstrated here, applying certain inclusion criteria can result in small sample sizes, emphasising the importance in taking care to minimise data loss. Bias due to data loss may occur due to participant and device factors, some of which may be reduced by researchers following strategies such as testing GPS devices prior to use [12], setting up the devices to only collect necessary data (and save memory) with appropriate epochs [12], using devices that don't require participants to charge them, checking the device is working during data collection [21], providing participants with clear instructions [12], sending reminder messages to participants to charge the device [12,18], and providing a voucher as an incentive to participants [18].
Results highlighted differences in sample retention between schools. When applying criterion 3 the percentages of participants retained from schools 1, 2, 3, and 4 were lower than the other five schools.
Our descriptive results demonstrated striking differences in retention of participants by ethnicity, adding impetus to addressing a widely acknowledged challenge within child health research: that of engaging children and families from lower socioeconomic backgrounds and minority ethnic populations [25][26][27]. Māori and Pacific Island participants at schools with lower socioeconomic status were disproportionately excluded when applying stricter inclusion criteria. Māori and Pacific Islanders and those with lower socio-economic status, also have poorer health [28,29], highlighting potential equity implications in basing the assessment of exposure-which impacts research results-on nonrepresentative GPS data.

Conclusion
GPS allows researchers to measure exposure to the environment more precisely than self-report or using the residential neighbourhood as a proxy for exposure. In doing so, it is important to ensure that the GPS data represent the population and behaviours of interest. Researchers using GPS data should consider and report application of GPS inclusion criteria where relevant. In deciding on appropriate inclusion criteria, it is important to consider the research question and use of GPS data. Appropriate criteria may vary for different research questions and study populations. Assessment of socioeconomic and environmental biases in missing GPS data is needed to ensure appropriate interpretation of results.

Limitations
There may have been bias in the selection of participants into the study, which we were unable to account for. While our findings are sample specific, they highlight a potential issue that future studies could test by collecting and analysing and reporting details of GPS data loss. This study did not assess environmental attributes. However, since the schools were located in different environments, it is likely that there would have been an environmental bias in the retained samples.
GPS data were only collected for 7 consecutive days, which are arguably not representative of typical behaviour. However, GPS data quality reduces with longer measurement periods [13].