Evaluation of self-reported ethnicity in a case-control population: the stroke prevention in young women study
© Cole et al; licensee BioMed Central Ltd. 2009
Received: 13 May 2009
Accepted: 18 December 2009
Published: 18 December 2009
Population-based association studies are used to identify common susceptibility variants for complex genetic traits. These studies are susceptible to confounding from unknown population substructure. Here we apply a model-based clustering approach to our case-control study of stroke among young women to examine if self-reported ethnicity can serve as a proxy for genetic ancestry.
A population-based case-control study of stroke among women aged 15-49 identified 361 cases of first ischemic stroke and 401 age-comparable control subjects. Thirty single nucleotide polymorphisms (SNPs) throughout the genome unrelated to stroke risk and with established ancestry-based allele frequency differences were genotyped in all participants. The Structure program was used to iteratively evaluate for K = 1 to 5 potential genetic-based subpopulations. Evaluating the population as a whole, the Structure output plateaued at K = 2 clusters. 98% of self-reported Caucasians had an estimated probability ≥50% of belonging to Cluster 1, while 94% of self-reported African-Americans had an estimated probability ≥50% of belonging to Cluster 2. Stratifying the participants by self-reported ethnicity and repeating the analyses revealed the presence of two clusters among Caucasians, suggesting that potential substructure may exist.
Among our combined sample of African-American and Caucasian participants there is no large unknown subpopulation and self-reported ethnicity can serve as a proxy for genetic ancestry. Ethnicity-specific analyses indicate that population substructure may exist among the Caucasian participants indicating that further studies are warranted.
Population-based case-control studies are used to identify common susceptibility variants for complex genetic traits; however, population stratification may confound their results [1, 2]. Population stratification refers to differences in allele frequencies between cases and controls due to systematic differences in ancestry, rather than association of an allele with disease. To reduce the impact of population stratification, cases and controls are ascertained from the same population and matched on self-reported ethnicity. Some studies indicate that stratifying by self-reported ethnicity (i.e. race) may not adequately adjust for population stratification, specifically in out-bred United States populations . A panel of genetic markers specific to ancestry and unlinked to the disease can be used to evaluate whether self-reported ethnicity can serve as a proxy for genetic ancestry or relatedness . Literature suggests that a panel composed of ~20-40 appropriately chosen markers (SNPs or microsatellites) is sufficient for evaluating a group based ancestry estimation , but not individual ancestry estimation. In this report, we genotyped 30 markers selected because of their differing allele frequencies between European Caucasians and Nigerians (Yoruba). We used these markers to determine whether self-reported ethnicity can accurately approximate ancestry in a large biracial population of stroke cases and controls.
Materials and methods
Characteristics by case-control status
Case (N = 361)
Control (N = 401)
Mean age (years)
39.5 ± 0.4
37.8 ± 0.4
African American (%)
Diabetes mellitus (%)
Current smokers (%)
SNP selection and genotyping
Twenty ancestry informative markers (i.e. SNPs) were chosen from a HapMap panel previously shown to differ (χ2 > 10) in allele frequencies between individuals from Utah with European ancestry (CEU) and individuals from Nigeria (YRI) . Ten additional SNPs were similarly selected from the Linkage IVb panel (Illumina, San Diego, CA).
Genotyping was conducted using DNA isolated from whole blood using the QIAamp DNA Blood Maxi Kit (Qiagen, Valencia, CA). SNP genotyping was performed by either TaqMan (Applied Biosystems, Foster, CA) or iPLEX (Sequenom, San Diego, CA) methodologies. For each SNP, genotyping for all cases and controls was performed on the same platform.
Twenty-seven SNPs listed by chromosomal location, including genotype call rates, ethnicity-specific allele frequencies, and relative difference between major allele frequency.
Major Allele Frequency-Caucasians
Major Allele Frequency - African Americans
Major Allele Frequency difference between Caucasians vs. African Americans, χ2 (p-value)
RS1824347 A/G Ψ
RS877826 A/C Ψ
RS1928533 C/T Ψ
RS1016461 C/T Ψ
RS1538956 G/T Ψ
RS1888952 C/T Ψ
RS898271 A/G Ψ
Model-based clustering for inferring population structure was performed using Structure software . An admixture ancestry model was chosen to estimate the likelihood that the observed genotypic data corresponded to K = 1 to 5 underlying subpopulations. Per standard Structure procedures, missing genotypes were still inputted. The "burn-in period" and the number of Markov Chain Monte Carlo repetitions after "burn-in" were each chosen to be 10,000. Summary statistics converged for these values. For each K, the estimated Ln of the probability of K clusters (log Pr (X | K)) was generated. Similar self-reported ethnicity-specific analyses were also performed.
The ANCESTDIST (Boolean) function of Structure was implemented to assess information about the distribution of Q, the estimated membership coefficients for each individual in each cluster. When this function is activated, the output file includes the left- and right-hand ends of the probability intervals for each q(i). (A probability interval is the Bayesian analog of a confidence interval.)
Demographic and risk factor characteristics by case-control status are described in Table 1. The mean age of the cases was 39.5 years and the mean age of control subjects was 37.8 years. Among cases, 51.5% were African American and among controls, 42.6% were African American. Cases were significantly more likely than controls to have a history of hypertension (p < .0001), diabetes (p < .0001), angina-MI (p < .0001), and to currently smoke cigarettes (p < .0001).
Table 2 lists the SNPs by chromosomal location, including genotype call rates, ethnicity-specific major allele frequencies and resultant χ2 comparison values.
Structure inference algorithm output (log Pr (X | K)) (denoted: Ln Prob) with Dirichlet parameter (α) estimating the number of populations (K) in our sample, K = 1 to 5.
K = 1
K = 2
K = 3
K = 4
K = 5
1) log Pr (X | K) plateaus at K = 2.
2) Dirichlet parameter for amount of admixture (α) converges to a value < 0.2 once the Markov chain converges.
3) Most individuals are strongly assigned to one of the two populations.
The Structure ANCESTDIST option provided the 90% probability intervals for each individual. Of the 760 individuals, 130 (17%) have overlapping probability intervals. Hence, 83% of the study population demonstrated individual ancestry proportion estimates that had non-overlapping 90% probability intervals.
Ethnicity specific exploratory analyses (demonstrated in Table 3) indicate some further substructure may be present among the self-reported Caucasians as log Pr (X | K) plateaus at K = 2 and α converges to a value < 0.2. When K = 2 among Caucasians alone, individuals distribute unevenly between the two clusters with 40% belonging to one cluster and 60% belonging to the other (data not shown). No further substructure was identified in our population of self-reported African-Americans as log Pr (X | K) does not plateau for K = 1 to 5 and α diverges.
Our results indicate that among the combined sample of African-American and Caucasian participants, self-reported ethnicity can serve as a proxy for genetic ancestry or relatedness. Furthermore, no large unknown subpopulation was identified. The ethnicity-specific analyses demonstrate no clear substructure in self-reported African American participants. This differs from the accepted idea that greater genetic diversity, as measured by linkage disequilibrium, is seen in populations of African origin. The lack of substructure in our African-American participants may be related to limitations of our panel. Interestingly, the ethnicity-specific analyses do demonstrate that some population substructure may exist among self-reported Caucasian participants. Evaluation of substructure in Americans of European decent has shown a course separation of European populations along a northeast to southwest axis . In this light, our heterogeneous urban-based Caucasian population may partially explain the substructure present in our Caucasian participants. Notably, there are plans for the SPYW population to be part of a genome wide association study (GWAS) for ischemic stroke, thereby providing many more SNPs to better characterize the substructure of both the Caucasian and African-American participants. Another limitation of our study was the relatively low call rates, most notable for SNPs genotyped via the TaqMan platform. However, this should not have influenced our results because call rates did not differ significantly between cases and controls or those of self-reported African Americans and Caucasians (data not shown).
In summary, among the combined population, a small number of individuals were genetically more consistent with the other ancestry. Specifically, with a 50% ancestry threshold, 22 self-reported African-Americans were more consistent with Caucasian ancestry, while 10 self-reported Caucasians were more consistent with African-American ancestry. This information may be incorporated into future association analyses in various ways. Individuals not satisfying an ethnicity-based ancestry threshold could simply be removed from the study. Alternatively, as mentioned above, more null markers could be genotyped to improve the ancestry classification. Lastly, a variable incorporating percentage of ancestry could be introduced into the association analyses.
Among our combined sample of African-American and Caucasian participants there is no large unknown subpopulation and self-reported ethnicity can serve as a proxy for genetic ancestry or relatedness. Ethnicity-specific analyses indicate that population substructure may exist among the Caucasian participants indicating that further studies are warranted.
We are indebted to the following members of the Stroke Prevention in Young Women research team for their dedication: Esther Berrent, Kathleen Caubo, Julia Clark, Mohammed Huq, Ann Maher, Tamar Pair, Mary Simmons, Mary J. Sparks, Mark Waring, Mark Dobbins, Latasha Williams, and Nancy Zappala.
The authors would like to acknowledge the assistance of the following individuals who have sponsored the Stroke Prevention in Young Women Study at their institution: Clifford Andrew, MD; Merrill Ansher, MD; Brian Avin, MD; Harjit Bajaj, MD; Robert Baumann, MD; Nicholas Buendia, MD, Young Ja Cho, MD; Kevin Crutchfield, MD; Terry Detrich, MD; Mohammed Dughly, MD; Boyd Dwyer, MD; Jerold Fleishman, MD; Stuart Goodman, MD, PhD; Adrian Goldszmidt, MD; Kalpana Hari Hall, MD; Aleem Iqbal, MD; Walid Kamsheh, MD; Andrew Keenan, MD; John Kelly, MD; Harry Kerasidis, MD; Mehrullah Khan, MD; Ramesh Khurana, MD; Ruediger Kratz MD; Somchai Laowattana, MD; William Leahy, MD; Alan Levitt, MD; Bruce Lobar, MD; Paul Melnick, MD; Harshad Mody, MD; Seth Morgan, MD; Howard Moses, MD; Francis Mwaisela, MD; Sivarama Nandipati, MD; Maciej Poltorak, MD; Thaddeus Pula, MD; Phillip Pulaski, MD; Neelupali Reddy, MD; Perry Richardson, MD; Solomon Robbins, MD; Michael Sellman, MD, PhD; Jack Syme, MD; Richard Taylor, MD; Dean Tippett, MD; Michael Weinrich, MD; Roger Weir, MD; Richard Weisman, MD; Laurence Whicker, MD; Robert Wityk, MD; James Yan, MD and Manuel Yepes, MD.
In addition, the study could not have been completed without the support from the administration and medical records staff at the following institutions: In Maryland: Anne Arundel Medical Center; Bon Secours Hospital; Calvert Memorial Hospital; Carroll County General Hospital; Chester River Hospital; Civista Medical Center; Department of Veterans Affairs Medical Center in Baltimore; Doctors Community Hospital; Dorchester Hospital; Franklin Square Hospital Center; Frederick Memorial Hospital; Good Samaritan Hospital; Greater Baltimore Medical Center; Harbor Hospital Center; Hartford Memorial Hospital; Holy Cross Hospital; Howard County General Hospital; Johns Hopkins Bayview; The Johns Hopkins Hospital; Kernan Hospital; Laurel Regional Hospital; Maryland General Hospital; McCready Memorial Hospital; Memorial Hospital at Easton; Mercy Medical Center; Montgomery General Hospital; North Arundel Hospital; Northwest Hospital Center; Peninsula Regional Medical Center; Prince George's Hospital Center; Saint Agnes Hospital; Saint Joseph Medical Center; Saint Mary's Hospital; Shady Grove Adventist Hospital; Sinai Hospital of Baltimore; Southern Maryland Hospital Center; Suburban Hospital; Union Hospital Cecil County; The Union Memorial Hospital; University of Maryland Medical System; Upper Chesapeake Medical Center; Washington Adventist Hospital and Washington County Hospital; in Washington DC: The George Washington University Medical Center; Georgetown University Hospital; Hadley Memorial Hospital; Howard University Hospital; National Rehabilitation Hospital; Providence Hospital; Sibley Memorial Hospital; and the Washington Hospital Center; in Pennsylvania: Gettysburg Hospital.
Dr. Cole was supported in part by the Department of Veterans Affairs, Baltimore, Office of Research and Development, Medical Research Service; the Department of Veterans Affairs Stroke Research Enhancement Award Program; the University of Maryland General Clinical Research Center (Grant M01 RR 165001), General Clinical Research Centers Program, National Center for Research Resources, NIH, and; an American Heart Association Beginning Grant-in-Aid (Grant 0665352U). Dr. Kittner was supported in part by the Department of Veterans Affairs, Baltimore, Office of Research and Development, Medical Research Service, and Geriatrics Research, Education and Clinical Center, and Stroke Research Enhancement Award Program; a Cooperative Agreement with the Division of Adult and Community Health, Centers for Disease Control and Prevention; the National Institute of Neurological Disorders and Stroke and the NIH Office of Research on Women's Health; the National Institute on Aging Pepper Center (Grant P60 12583); and the University of Maryland General Clinical Research Center (Grant M01 RR 165001), General Clinical Research Centers Program, National Center for Research Resources, NIH. Dr. Sorkin was supported by the Baltimore VA Medical Center, Office of Research and Development, Medical Research Service, and Geriatrics Research, Education, and Clinical Center; the University of Maryland Claude D. Pepper Older Americans Independence Center; the Clinical Nutrition Research Unit of the University of Maryland, and; the Baltimore VA Medical Center, Center for Excellence in Robotics.
- Thomas DC, Witte JS: Point: population stratification: a problem for case-control studies of candidate-gene associations?. Cancer Epidemiol Biomarkers Prev. 2002, 11: 505-12.PubMedGoogle Scholar
- Burnett MS, Strain KJ, Lesnick TG, de Andrade M, Rocca WA, Maraganore DM: Reliability of self-reported ancestry among siblings: implications for genetic association studies. Am J Epidemiol. 2006, 163: 486-92. 10.1093/aje/kwj057.View ArticlePubMedGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945-959.PubMed CentralPubMedGoogle Scholar
- Pritchard JK, Rosenberg NA: Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet. 1999, 65: 220-228. 10.1086/302449.PubMed CentralView ArticlePubMedGoogle Scholar
- Kittner SJ, Stern BJ, Wozniak M, Buchholz DW, Earley CJ, Feeser BR, Johnson CJ, Macko RF, McCarter RJ, Price TR, Sherwin R, Sloan MA, Wityk RJ: Cerebral infarction in young adults: The Baltimore-Washington cooperative young stroke study. Neurology. 1998, 50: 890-894.View ArticlePubMedGoogle Scholar
- Adams HP, Bendixen BH, Kappelle LJ, Biller J, Love BB, Gordon DL, Marsh EE: Classification of subtype of acute ischemic stroke. Definitions for use in a multicenter clinical trial. TOAST. Trial of Org 10172 in Acute Stroke Treatment. Stroke. 1993, 24: 35-41.View ArticlePubMedGoogle Scholar
- Shriver MD, Parra EJ, Dios S, Bonilla C, Norton H, Jovel C, Pfaff C, Jones C, Massac A, Cameron N, Baron A, Jackson T, Argyropoulos G, Jin L, Hoggart CJ, McKeigue PM, Kittles RA: Skin pigmentation, biogeographical ancestry and admixture mapping. Hum Genet. 2003, 112: 387-399.PubMedGoogle Scholar
- Price AL, Butler J, Patterson N, Capelli C, Pascali VL, Scarnicci F, Ruiz-Linares A, Groop L, Saetta AA, Korkolopoulou P, Seligsohn U, Waliszewska A, Schirmer C, Ardlie K, Ramos A, Nemesh J, Arbeitman L, Goldstein DB, Reich D, Hirschhorn JN: Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. 2008, 4: e236-10.1371/journal.pgen.0030236.PubMed CentralView ArticlePubMedGoogle Scholar