Evaluation of self-reported ethnicity in a case-control population: the stroke prevention in young women study

Background Population-based association studies are used to identify common susceptibility variants for complex genetic traits. These studies are susceptible to confounding from unknown population substructure. Here we apply a model-based clustering approach to our case-control study of stroke among young women to examine if self-reported ethnicity can serve as a proxy for genetic ancestry. Findings A population-based case-control study of stroke among women aged 15-49 identified 361 cases of first ischemic stroke and 401 age-comparable control subjects. Thirty single nucleotide polymorphisms (SNPs) throughout the genome unrelated to stroke risk and with established ancestry-based allele frequency differences were genotyped in all participants. The Structure program was used to iteratively evaluate for K = 1 to 5 potential genetic-based subpopulations. Evaluating the population as a whole, the Structure output plateaued at K = 2 clusters. 98% of self-reported Caucasians had an estimated probability ≥50% of belonging to Cluster 1, while 94% of self-reported African-Americans had an estimated probability ≥50% of belonging to Cluster 2. Stratifying the participants by self-reported ethnicity and repeating the analyses revealed the presence of two clusters among Caucasians, suggesting that potential substructure may exist. Conclusions Among our combined sample of African-American and Caucasian participants there is no large unknown subpopulation and self-reported ethnicity can serve as a proxy for genetic ancestry. Ethnicity-specific analyses indicate that population substructure may exist among the Caucasian participants indicating that further studies are warranted.


Introduction
Population-based case-control studies are used to identify common susceptibility variants for complex genetic traits; however, population stratification may confound their results [1,2]. Population stratification refers to differences in allele frequencies between cases and controls due to systematic differences in ancestry, rather than association of an allele with disease. To reduce the impact of population stratification, cases and controls are ascertained from the same population and matched on self-reported ethnicity. Some studies indicate that stratifying by selfreported ethnicity (i.e. race) may not adequately adjust for population stratification, specifically in out-bred United States populations [2]. A panel of genetic markers specific to ancestry and unlinked to the disease can be used to evaluate whether self-reported ethnicity can serve as a proxy for genetic ancestry or relatedness [3]. Literature suggests that a panel composed of ~20-40 appropriately chosen markers (SNPs or microsatellites) is sufficient for evaluating a group based ancestry estimation [4], but not individual ancestry estimation. In this report, we genotyped 30 markers selected because of their differing allele frequencies between European Caucasians and Nigerians (Yoruba). We used these markers to determine whether self-reported ethnicity can accurately approximate ancestry in a large biracial population of stroke cases and controls.

Study population
The Stroke Prevention in Young Women (SPYW) Study is a population-based case-control study initiated to examine risk factors for first ischemic stroke in women aged 15-49. All participants were identified from the same population including all of Maryland (except the far Western panhandle), Washington DC, and the southern portions of both Pennsylvania and Delaware. The methods for discharge surveillance, chart abstraction, and case adjudication have been described previously [5]. We determined each subject's case-control status (i.e. determined subjects who had a stroke) blinded to genetic information. Strokes were further classified by subtype according to TOAST (Trial of Org 10172 in Acute Stroke Treatment) [6] including thrombosis or embolism due to atherosclerosis of a large artery (N = 16), embolism of cardiac origin (N = 69), occlusion of a small blood vessel (N = 45), other determined cause (N = 43), undetermined cause (two possible causes, no cause identified, or incomplete investigation) (N = 188). Controls subjects (women without a history of stroke), were identified by random digit dialing and were frequency matched to the cases by age, race, and geographic region of residence. The present analysis includes 762 subjects (361 cases and 401 controls) from this study who self-identified themselves as Caucasian (non-Hispanic) (N = 405) or African-American (N = 357) (see Table 1).

SNP selection and genotyping
Twenty ancestry informative markers (i.e. SNPs) were chosen from a HapMap panel previously shown to differ (χ 2 > 10) in allele frequencies between individuals from Utah with European ancestry (CEU) and individuals from Nigeria (YRI) [7]. Ten additional SNPs were similarly selected from the Linkage IVb panel (Illumina, San Diego, CA).
Genotyping was conducted using DNA isolated from whole blood using the QIAamp DNA Blood Maxi Kit (Qiagen, Valencia, CA). SNP genotyping was performed by either TaqMan (Applied Biosystems, Foster, CA) or iPLEX (Sequenom, San Diego, CA) methodologies. For each SNP, genotyping for all cases and controls was performed on the same platform.
Following genotyping, four SNPs were excluded from the analyses: three SNPs (rs1021516, rs1648282, rs1011526) exhibited genotype call rates less than 80% and one SNP (rs2695) did not exhibit a difference in allele frequencies between our Caucasian and African-American populations. Hence, 26 SNPs distributed throughout the genome were included in the analyses (Table 2), with 7 of the SNPs genotyped via Taqman and 19 via iPLEX. All SNPs were verified to be unassociated with stroke (additive model) in the total population and stratified by race. All SNPs were verified to be in Hardy-Weinberg equilibrium (χ 2 test). Major allele frequency differences between selfreported Caucasians and African-Americans were calculated (χ 2 test). Analyses were performed using SAS ® , Version 9.1 (SAS Institute, Cary, NC) (Tables 1 and 2).

Analyses
Model-based clustering for inferring population structure was performed using Structure software [3]. An admixture ancestry model was chosen to estimate the likelihood that the observed genotypic data corresponded to K = 1 to 5 underlying subpopulations. Per standard Structure procedures, missing genotypes were still inputted. The "burn-in period" and the number of Markov Chain Monte Carlo repetitions after "burn-in" were each chosen to be 10,000. Summary statistics converged for these values. For each K, the estimated Ln of the probability of K clusters (log Pr (X | K)) was generated. Similar self-reported ethnicity-specific analyses were also performed.
The ANCESTDIST (Boolean) function of Structure was implemented to assess information about the distribution of Q, the estimated membership coefficients for each individual in each cluster. When this function is activated, the output file includes the left-and right-hand ends of the probability intervals for each q(i). (A probability interval is the Bayesian analog of a confidence interval.)

Findings
Demographic and risk factor characteristics by case-control status are described in Table 1. The mean age of the cases was 39.5 years and the mean age of control subjects was 37.8 years. Among cases, 51.5% were African American and among controls, 42.6% were African American. Cases were significantly more likely than controls to have a history of hypertension (p < .0001), diabetes (p < .0001), angina-MI (p < .0001), and to currently smoke cigarettes (p < .0001). Table 2 lists the SNPs by chromosomal location, including genotype call rates, ethnicity-specific major allele frequencies and resultant χ 2 comparison values. Table 3 details Structure output (log Pr (X | K) (denoted in Table 3 as Ln Prob) and Dirichlet parameter (α)) estimating the number of subpopulations (K) in our sample, K = 1 to 5. Results for the combined and ethnicity-specific analyses are presented. For the combined population, two subpopulations are likely because: 1) log Pr (X | K) plateaus at K = 2.
2) Dirichlet parameter for amount of admixture (α) converges to a value < 0.2 once the Markov chain converges.
3) Most individuals are strongly assigned to one of the two populations. Figure 1 graphically demonstrates for K = 2 clusters, the estimated probability of self-reported Caucasians and African-Americans belonging to each cluster. Summarizing, 98% of self-reported Caucasians had an estimated probability ≥50% of belonging to cluster 1, while 94% of self-reported African-Americans had an estimated probability ≥50% of belonging to cluster 2. Further, 81% of selfreported Caucasians and 68% of self-reported African-Americans had an estimated probability ≥90% of belonging to clusters 1 and 2 respectively.
The Structure ANCESTDIST option provided the 90% probability intervals for each individual. Of the 760 individuals, 130 (17%) have overlapping probability intervals. Hence, 83% of the study population demonstrated individual ancestry proportion estimates that had nonoverlapping 90% probability intervals.
Ethnicity specific exploratory analyses (demonstrated in Table 3) indicate some further substructure may be present among the self-reported Caucasians as log Pr (X | K) plateaus at K = 2 and α converges to a value < 0.2. When K = 2 among Caucasians alone, individuals distribute unevenly between the two clusters with 40% belonging to one cluster and 60% belonging to the other (data not shown). No further substructure was identified in our population of self-reported African-Americans as log Pr (X | K) does not plateau for K = 1 to 5 and α diverges.

Discussion
Our results indicate that among the combined sample of African-American and Caucasian participants, selfreported ethnicity can serve as a proxy for genetic ancestry or relatedness. Furthermore, no large unknown subpopulation was identified. The ethnicity-specific analyses demonstrate no clear substructure in self-reported African American participants. This differs from the accepted idea that greater genetic diversity, as measured by linkage disequilibrium, is seen in populations of African origin. The lack of substructure in our African-American participants may be related to limitations of our panel. Interestingly, the ethnicity-specific analyses do demonstrate that some population substructure may exist among self-reported Caucasian participants. Evaluation of substructure in Americans of European decent has shown a course separation of European populations along a northeast to southwest axis [8]. In this light, our heterogeneous urban-based  Caucasian population may partially explain the substructure present in our Caucasian participants. Notably, there are plans for the SPYW population to be part of a genome wide association study (GWAS) for ischemic stroke, thereby providing many more SNPs to better characterize the substructure of both the Caucasian and African-American participants. Another limitation of our study was the relatively low call rates, most notable for SNPs genotyped via the TaqMan platform. However, this should not have influenced our results because call rates did not differ sig-nificantly between cases and controls or those of selfreported African Americans and Caucasians (data not shown).
In summary, among the combined population, a small number of individuals were genetically more consistent with the other ancestry. Specifically, with a 50% ancestry threshold, 22 self-reported African-Americans were more consistent with Caucasian ancestry, while 10 self-reported Caucasians were more consistent with African-American ancestry. This information may be incorporated into future association analyses in various ways. Individuals not satisfying an ethnicity-based ancestry threshold could simply be removed from the study. Alternatively, as mentioned above, more null markers could be genotyped to improve the ancestry classification. Lastly, a variable incorporating percentage of ancestry could be introduced into the association analyses.

Conclusion
Among our combined sample of African-American and Caucasian participants there is no large unknown subpopulation and self-reported ethnicity can serve as a proxy for genetic ancestry or relatedness. Ethnicity-specific analyses indicate that population substructure may exist among the Caucasian participants indicating that further studies are warranted.
Publish with Bio Med Central and every scientist can read your work free of charge