The general social survey-national death index: an innovative new dataset for the social sciences

Background Social epidemiology seeks in part to understand how social factors--ideas, beliefs, attitudes, actions, and social connections--influence health. However, national health datasets have not kept up with the evolving needs of this cutting-edge area in public health. Sociological datasets that do contain such information, in turn, provide limited health information. Findings Our team has prospectively linked three decades of General Social Survey data to mortality information through 2008 via the National Death Index. In this paper, we describe the sample, the core elements of the dataset, and analytical considerations. Conclusions The General Social Survey-National Death Index (GSS-NDI), to be released publicly in October 2011, will help shape the future of social epidemiology and other frontier areas of public health research.


Addressing new paradigms in public health
As new knowledge is accumulated, the field of public health periodically undergoes paradigm shifts. For instance, the sanitation and clean water revolution was brought about in part by our understanding of the germ theory [1]. This, augmented by the development of vaccines, antibiotics, and removal of disease vectors, ultimately led to an epidemiologic transition from an environment in which infectious disease took lives at a young age to one in which chronic disease in older age became the major focus [2].
This new paradigm required an entirely new way of thinking about health and disease intervention. This challenge was met with longitudinal studies, such as the Framingham Heart Study, which ultimately showed that smoking, drinking, poor diet, and a sedentary lifestyle were major risk factors for poor health [3]. This realization in part led to the development of large national datasets, such as the National Health Interview Survey. Such surveys not only allow researchers to better understand and track health risks nationwide, but they are also valuable resources for generating and testing the hypotheses that move public health forward. These datasets, have, in turn, helped researchers recognize that social factors-such as income and education-are among the most important determinants of population health [4,5]. This discovery provided a small but important push toward the current public health paradigm shiftthat economic characteristics and social environments are key to human health and longevity [5].
However, a dataset has not yet been built that allows researchers to study the relationship between these characteristics and mortality. The General Social Survey-National Death Index (GSS-NDI) is an important first step toward creating a dataset that will help researchers understand how such social environments shape health.

The evolution of social epidemiology
Social risk factors for poor health, such as poverty, failing schools, community characteristics, and weak social ties, appear to be powerful determinants of one's health and life expectancy [5][6][7][8][9][10][11][12]. Such risks are often called "upstream risks" because they form the source of later health threats, and these later health threats, in turn cascade into further health problems [13]. For example, low educational attainment is associated with behavioral risk factors later in life because of a poor understanding of those risks. Low educational attainment also makes it difficult to obtain a good job that affords health insurance and a home in a low crime neighborhood. These risks can also ripple to further risks "downstream." Exposure to crime, for instance, can lead to psychological stress, which increases allostatic load and may, in turn, lead to heart disease [14]. Stress and crime victimization can also alter one's emotional states (e.g., lead to pessimism or misanthropy), further damaging one's health and overall sense of well-being [15][16][17]. Psychosocial risk factors have also been linked to immune and endocrine system disruption, putting us at risk for infections like the common cold and leading to premature aging of our cells [14,18,19]. This way, we see that a single event early in life can damage health and well-being over the entire life course, with one event leading to a cascade of other events. The net effect of this cascade is a shorter, less healthy life.
Psychological risks also extend outward into the social spheres around us. Perceived racism may cause health problems among minorities who feel stigmatized, [20,21] but individuals who hold racist beliefs also create distress within the larger communities within which they live, and may therefore be harming themselves and nonminorities within their communities [22]. Even more intriguing, our social networks may not only determine whether we drink or smoke, they may even have a profound effect on the shape and size of our bodies [23,24]. Thus, we might infer that social environments influence the nature of our social ties, which in turn shape our health risks in complex ways.
The idea that social risks cause disease go back well over a hundred years, but has only recently taken root [25]. In the mid-19 th century, for instance, Rudolf Virchow posited that poor social policies were a major root cause of diseases such as typhus [26]. However, research on social factors as a cause of disease has been a piecemeal effort. This may be in part because linkages between income and health or education and health were never followed with more powerful survey tools for exploring the social causation of disease.
The need for a social epidemiology dataset As they stand, our national health datasets were designed to identify basic correlates of disease, so that these risks can be better described and tracked. But they lack critical information on one's social position, social networks, thoughts, feelings, attitudes, beliefs, and participation in civil society.
If the GSS had been designed as a health dataset from the start it would have had the potential to advance social epidemiologic studies of health disparities beyond mere identification and description to a deeper understanding of the underlying mechanisms. It does contain a few health measures, such as self-rated health and basic information on behavioral risk factors such as smoking (albeit not in all years). However, it was primarily designed to answer traditional questions in sociology and political science. Fortunately, it is possible to link the GSS to prospective mortality data by cause of death for the years 1978 onward via the NDI, so it can be used to answer many new health questions. The GSS-NDI is rich in sociological variables and can provide some of the critical information necessary to better explore the relationship between mortality and social position, beliefs, social cognition, and social relationships.

Construction and content
The GSS-NDI linked data from 18 waves of the GSS (1978 through 2002) to death certificate data from 1979 through 2008. We have not yet linked more recent years of the GSS to the NDI because very few deaths occur among subjects in the 4-to 6-year period following the survey. It is straightforward and inexpensive to create further linkages as additional mortality data becomes available.

Sampling
Up through 2006, the GSS sampled only English speaking subjects 18 and over in the non-institutionalized population. Spanish was added as a survey language that year onward, but will not be available in the current GSS-NDI dataset. The GSS employs a multi-stage probability sample. Interviewers conduct face-to-face canvassing after 3 PM on weekdays and on weekends and holidays [27]. Over the 1978-2002 GSS surveys response rates ranged from 70% to 82%. Information on nonrespondents is available from the National Opinion Research Center (NORC) at the University of Chicago, host to the GSS. This information can be used to test for non-response bias.

National Death Index matching
Each subject within the GSS from 1978 through 2002 has a paper record securely stored with identifying information at NORC. To generate a matching file for the NDI, it was necessary to pull paper records for the GSS. Quality control checks included a comparison of names in household enumeration file with names in the questionnaire and a comparison of entered values with GSS public use data. NORC submitted 33,091 records from the GSS to the NDI for linkage. Of these records, 38 were outright rejected and dropped from the dataset because of incomplete or inaccurate information. An additional 223 records were dropped because there were multiple matches between GSS and NDI records. The final sample was therefore 32,830 subjects.
The NDI utilizes a probabilistic matching algorithm to link identifiers from user provided data with death certificate information. This algorithm is designed to maximize the number of correct matches between a survey record and a death certificate, and minimize the number of incorrect matches. The NDI is capable and willing to conduct multiple matches between record files and death certificates. By attempting various permutations of potential errors in a given record file, the probability of a match is increased.
Therefore, prior to submission of GSS records to the NDI, NORC created duplicates of the 660 GSS subject records that had an incomplete birth month, incomplete birth year, or a possible inconsistency in spelling of first or last name of the respondent, yielding a total of 9,824 duplications. These inconsistencies were identified only for those cases in which the paper record was initially entered incorrectly into the computer system (and subsequently caught on the second round of cross-checking), under the assumption that these permutations were common. In cases where the month of birth was missing, NORC created 12 copies of the original record, one with each of the twelve months inserted. Where the age did not match the birth year on the GSS, both birth years were entered.
One issue that affected the match between the GSS identifiers and those in the NDI is that the GSS does not have Social Security numbers for the majority of its subjects. In the GSS, the provision of one's Social Security number was voluntary, and only collected from 1993 onward. Of the GSS records returned as potential death certificate matches, only about 21% had a valid Social Security number.
Roughly 83 to 92 percent of deceased individuals, and 92 to 99 percent of living persons, would be expected to be correctly identified for datasets missing Social Security numbers [28]. The present dataset would be expected to have a higher success rate because Social Security numbers were available for some subjects. This not only improves the likelihood of correct matching for those subjects with Social Security numbers, but also provides an important internal check to help determine the extent to which missing data is or is not influencing the number of successful or incorrect matches.

Match assessment
Records flagged by the NDI as potential matches were not necessarily interpreted as deaths. Rather, because the NDI utilizes a lenient set of criteria to identify potential matches, these matches are regarded as a set containing both true and false death certificate matches. Of the 22,062 GSS records (or 67% of all records) that the NDI identified as having at least one potential death record match, 5,561 GSS records were matched with a single NDI record. However, 16,501 of the GSS records matched more than one NDI record, with one GSS case being linked to as many as 1,200 NDI records.
To guide the user in assessing the death record matches, the NDI provides a probabilistic matching score. This score was generated by the NDI using previous matching data in which the decedents were known, and provides a reliable way of ascertaining each participant's vital status. Using this algorithm, the NDI generates a suggested vital status (either dead or alive) for each matched NDI record along with the probabilistic matching score.
However, the NDI's matching algorithm uses a weighed set of key identifiers (Social Security number, first and last name, date of birth, gender, race, state of birth, and state of residence) to assign the probabilistic matching score. This procedure disproportionately weighs the number of correct digits in the Social Security number, which was not appropriate for matching GSS records (because many such numbers were missing).
After providing a probabilistic matching score, the NDI produces a class rating that includes five possible classes, dependent on the level of agreement on specific identifiers (Table 1). According to this system, all class 1 matches are considered by the NDI to be an exact match (a death), and all class 5 matches are considered by the NDI to be a false match (alive).
For the GSS-linked records there were no class 1 matches and a limited number of class 2 matches due to the lack of Social Security information and the fact that the GSS did not collect data on participant's state of birth. Subsequently, even for those few records that matched exactly on Social Security number, lack of state of birth information demoted those from class 1 to class 2, and matches that would have met criteria for class 3 were assigned to class 4.
To correct for our missing matching variables, we used a modified cutoff score that lowered the threshold for obtaining a status 1 match (vital status 'deceased') by the exact amount that the NDI gives records matching on seven or eight digits of the Social Security number. Where the GSS records had Social Security number available, the NDI's suggested cutoff score was accepted as accurate.
To identify the correct match, we first selected the match with a combination of the highest class and score provided by the NDI. Of the 22,062 GSS records that the NDI linked with one or more NDI records, we identified 9,285 GSS records (or about 42% of matched records) as having a vital status of deceased. Of these GSS records identified as deaths, the score plus class evaluation flagged 6,504 NDI records (about 70%) as an exact match, but 30.5% of GSS records were linked to more than one NDI record with precisely the same score and class combination. To resolve these multiple matches, we examined the degree to which GSS identifiers agreed with NDI identifiers, and flagged those with the highest number of agreeing identifiers.
The final version of the GSS-NDI dataset contains 32,830 total records, of which 9,271 are deaths. Table 2 shows the sample size and number of subjects identified as deceased for each wave of the GSS-NDI, and Table 3 shows the sample sizes and number of deceased individuals broken down by gender, race, age category and whether the subject was born in a foreign country.

Cause of death linkage
Of the 9,271 GSS records determined to have a vital status of 'deceased', 99.84% were linked to underlying cause of death information. Until the year 1999, the NDI provided users with cause of death using codes from the Ninth Revision of the International Classification of Diseases (ICD-9) and Tenth Revision (ICD-10) codes thereafter. In order to unify the different coding schemes, we collapsed cause of death into 285 mutuallyexclusive categories using the single-level Clinical Classification Software (CCS) system.

Robustness check
Whereas the NDI's method for evaluating matches relied heavily on the Social Security number, our algorithm places emphasis on identifiers used in the match. Our objective was to ensure that our technique was conservative enough to eliminate false positives and provide an accurate record matching. We tested this by utilizing a sub-sample of our matches -the 927 records that were qualified as 'class 2' matches by the NDI's algorithm and thus had a valid Social Security number.
To perform our comparison, we took these matches and modified them to appear as though these records were missing Social Security numbers. This was achieved by reducing the probabilistic score of each match by the Social Security number weight. Next, we re-ran our evaluation method and selected the best match according to the criteria described above. This method identified 784 deaths out of the 927 records that were initially categorized as class 2 matches, proving to be somewhat more stringent than the NDI's evaluation scheme, which designates 860 of the class 2 Table 1 Definition of Matching Classes from the NDI Class 1: Exact match on SSN, (all nine digits), first name, middle initial, last names, sex, state of birth, birth month and birth year. Class 2: Class 2: SSN matches on at least seven digits and one or more of the other items from Class 1 may not match. Note: Some matched cases are moved from Class 2 to Class 5 because of an indication that the reported SSN belongs to the spouse. This includes those cases for which the SSN is known and matches, but the first name and sex do not agree.
Class 3: SSN unknown but eight or more of first name, middle initial, last name, birth day, birth month, birth year, sex, race, marital status, or state of birth match.
Class 4: Same as Class 3 but less than eight items match.
Class 5: SSN is known but doesn't match. Note: Some matched cases are moved from Class 5 to Class 3 because of an indication that one of the SSN's (on the user record or on the death certificate) may have been reported incorrectly but a significant number of other data items are in agreement.

Utility and Discussion
The GSS-NDI represents a new and essential tool to be used by social epidemiologists. The GSS-NDI will have time trends available for roughly 1000 variables, which will allow researchers to examine relationships between social factors and mortality over time. Because the sample is nationally-representative, repeated variables also open the door to the creation of synthetic cohorts. (E.g., subjects who are age 20-24 in 1980 can be thought of as subjects who are 25-29 in 1985.) Several thousand other variables are available in the GSS-NDI, although many of these were asked during a particular wave of the GSS, and therefore do not have potential for time trend analyses.
In addition, the primary sampling unit information (a de-identified indicator of the geographic location of the interview, which can be employed for participant clustering) can be used for hierarchical linear models. We are currently in the process of setting up a unit that can manage requests for spatial data that cannot be released due to concerns surrounding the possibility of identifying a subject.
One variable, a four-point self-rated health scale (excellent, good, fair, and poor) is available for all years of the GSS-NDI in the public release. This will allow researchers to better probe questions of causality. For example, if we wished to examine the relationship between wealth and mortality, we could limit the analysis to respondents who report excellent or good health at the time of survey. This way, we would be more certain that we were measuring the effect of wealth on illness and not illness on wealth.
We plan on releasing the GSS-NDI to the general public in October of 2011. The dataset will be de-identified, and available for use by researchers everywhere. It has been granted approval from the Institutional Review Board at Columbia University. To ensure that subjects cannot be identified, only the year of the subject's birth and the de-identified primary sampling unit (rather than the subject's city or state of residence) will be available in the public release dataset.
In the meantime, a team of inter-disciplinary researchers is testing the dataset and developing statistical code that can be used as a toolkit for future analyses. Already underway are studies surrounding how changes in discrimination against lesbians and gays may be influencing their health, how the falling academic performance of males is influencing their health, and various research questions surrounding social capital and mortality, variations in mortality with economic cycles, and the underlying nature of the religion-mortality association.

Conclusions
The newly constructed GSS-NDI dataset was designed to meet the needs of social epidemiologists investigating emerging areas of public health. Researchers who have exhausted the capabilities of our currently available national health datasets ought to find the dataset invaluable in furthering their studies.

Availability and requirements
The dataset is expected to be publicly available in October 2011 from NORC.