Indirect calibration between clinical observers - application to the New York Heart Association functional classification system
© Severo et al; licensee BioMed Central Ltd. 2010
Received: 14 June 2011
Accepted: 3 August 2011
Published: 3 August 2011
Previous studies showed an inter-observer agreement for the NYHA classification of approximately 55%. The aim of this study was to calibrate the New York Heart Association (NYHA) classification system between observers, increasing its reliability.
Among 1136 community-dwellers in Porto, Portugal, aged ≥ 45 years, 265 reporting breathlessness answered a 4-item questionnaire to characterize symptom severity. The questionnaire was administered by 7 physicians who also classified the subject's functional capacity according to NYHA. Each subject was assessed by one physician. We calibrated NYHA classifications by the concurrent method, using 1-parameter logistic graded response model. Discrepancies between observers were assessed by differences in ability thresholds between NYHA classes I-II and II-III. The ability estimated by the model was used to predict the NYHA classification for each observer.
Estimates of the first and second thresholds for each observer ranged from -1.92 to 0.46 and from 1.42 to 2.30, respectively. The agreement between estimated ability and the observers' NYHA classification was 88% (kappa = 0.61).
The study objectively indicates the main reason why several studies have reported low inter-observer is the existence of discrepant thresholds between observers in the definition of NYHA classes. The concurrent method can be used to minimize the reliability problem of NYHA classification.
The New York Heart Association (NYHA) functional classification was originally conceptualized and described in 1928 and most recently updated in 1994 as a method of assessing functional disability induced by cardiac diseases in patients encountered in clinical practice . The NYHA system was designed for clinical assessment of patients by physicians in 4 classes (I, II, III or IV) on the basis of the patient's limitations in physical activities caused by cardiac symptoms. The NYHA classification is derived largely by inference from history and/or observation of the patient in certain physical activities, and occasionally by direct or indirect measurement of cardiac function in response to standardized exercises. There was an attempt to increase the objectivity of the NYHA classification by adding an objective assessment, based on measurements such as electrocardiogram, stress test, X-ray and echocardiogram. Despite this attempt, the NYHA classification remains essentially subjective . The class a clinician decides to assign a patient to depends on the clinician's interpretation of what is "ordinary" physical activity, "slight" and "marked" limitations. This results in a high inter-observer variability. Previous studies showed an inter-observer agreement for the NYHA classification of approximately 55% [3, 4]. Consequently the use of NYHA classification as an outcome measure in clinical research is rather poor. However this classification system has been widely used in clinical epidemiology studies as an inclusion criterion and also as an outcome measure . It is also used in routine clinical practice.
The aim of this study was to calibrate the NYHA classification system between different observers, aspiring to increase its reliability, by quantifying the discrepancy in thresholds in functional capacity that lead an observer to assign a NYHA class to a patient.
Participants were selected within the first follow-up of a cohort, representative at baseline of the non-institutionalized adult population of Porto, Portugal - the EPIPorto cohort study. At baseline, households were selected by random digit dialling . After the identification of a household, permanent residents were characterized according to age and gender, and one individual aged 18 years or older was randomly selected and invited to visit our department for an interview and physical examination. If there was a refusal, replacement was not allowed within the same household.
Trained interviewers collected information, using a standard protocol that comprised questions on social, demographic, clinical and behavioural characteristics. At baseline, 2485 participants were recruited. Between October 2006 and July 2008, all participants aged ≥ 45 years were eligible to a systematic evaluation, at our department, of measures of cardiac structure and function, which included a cardiovascular clinical history and physical examination, and a transthoracic echocardiogram.
Among 2048 eligible to this study, 134 (6.5%) had died, 198 (9.7%) refused to be re-evaluated and 580 (28.3%) were lost to follow up (unreachable by telephone or post). Therefore 1136 (55.4%) individuals aged ≥ 45 years were assessed by 8 physicians experienced in the management of heart failure patients.
At the standardized clinical interview applied by these physicians, subjects who reported to have breathlessness (n = 265; 23.3%) were presented to a 4-item questionnaire on functional capacity to characterize the severity of symptoms: 1) whether breathlessness is felt when walking on steep plane, horizontal plane or at rest; 2) distance walked until perception of breathlessness; 3) sets of stairs (10-15 steps) climbed until perception of breathlessness; 4) whether mild, moderate or intense efforts are necessary to elicit breathlessness. These will hereafter be referred to as "anchor items".
The same physician administered the questionnaire and classified the subject's functional capacity using the NYHA classification. This classification, defined by each physician for each subject, will hereafter be referred to as "target items". The assessment of the NYHA classification was carried out after the administration of the 4 anchor items. NYHA class IV was aggregated to class NYHA III because only one individual was classified in NYHA IV.
The Medical Outcomes Study Short Form-36 (SF36) was used to assess health-related quality of life . The scale had been previously translated and the adapted Portuguese version was validated ; each sub-domain of the SF-36 is scored from 0 to 100, with increasing values representing better health. Participants completed a physical activity questionnaire designed to estimate usual individual daily energy expenditure, focused on the activity in the past year. Time spent in a variety of activities per day, including work, transport to and from work, household chores, sports, sedentary leisure time and sleep, was self-reported and activity intensity categorized as very light, light, moderate and heavy with a corresponding average of 1.5, 2.5, 5.0 and 7.0 METs respectively, where one MET is equal to the energy expended at the basal metabolic rate or at rest . A severity scale was applied to measure fatigue , with increasing values representing higher severity.
The local ethics committee (Hospital São João) approved the study and participants provided written informed consent.
Different correlation coefficients were used to evaluate the magnitude of the association between anchor items and the target items (NYHA classifications): correlations between two (artificial) ordinal variables were evaluated through polychoric correlations, and between interval and (artificial) ordinal variables through polyserial correlations.
Exploratory factor analyses (weighted least square) on the 4 ordinal anchor items combined with each target item was used to evaluate homogeneity (i.e., to confirm there was a single latent variable) of the items and the Cronbach's alpha was used to measure the reliability . The global goodness of fit of the underlying structure with 1 factor was evaluated using the comparative fit index (CFI) recommended when N < 250 .
The convergent and divergent validity of the 4 anchor items was assessed through the correlation between the questionnaire's raw score and the 4 physical dimensions of the health-related quality of life scale SF36 (physical function, role physical, bodily pain and general health perception), a scale for fatigue and daily physical activity. The raw score was estimated by the sum of all anchor items.
Each set of individuals assessed by each physician was considered as a group. Calibration of NYHA classification across different groups was performed by the concurrent method. Concurrent calibration involves estimating item and ability parameters in all groups simultaneously, i.e., by combining data from these distinct groups. Items not taken by one of the groups are treated as either not reached or missing . Given the ordinal nature of the items, this is a particular use of the 1-dimensional logistic graded response model (GRM) from item response theory (IRT). Fit of the model was based on approximate marginal Maximum Likelihood. The four patient items were used as anchor items and the 7 obtained NYHA classifications as target items (observer 3 NYHA classification was eliminated for the GRM and dyspnea item was aggregated in two classes 0 vs. 1 and 2 because of the small sample size).
Exploratory factor analysis (EFA) supported that only 1 dimension was reflected in the ordinal items. Thus, 1-dimensional logistic graded response models (GRM) from item response theory (IRT) were used . These models assume that the performance of an individual on the items is explained by only one (standard normal) variable, commonly called "ability". "Ability" is the term that denotes the unobserved hypothetical variable (a latent trait) subjacent to graded response models. In our study, ability refers to the functional capacity of the subject that we are trying to characterize. Higher ability values represent worse functional capacity (more severe symptoms). In the graded response models, each item is described by a set of curves, item operation characteristic curves (IOCC). The item operation characteristic curves for category k represent the probability of endorsing categories higher than k conditional on subject's ability.
The item operation characteristic curves of an item are characterized by several parameters: the slope (discrimination), which is the same for all categories, and the thresholds (difficulty), which are as many as the number of categories minus one. For example, one item with 3 categories has 3 category characteristic curves, one slope and two thresholds: t1 to define I versus II-IV and t2 to define I-II versus III-IV.
The threshold parameter between two categories represents the ability value at which the probability of indicating the highest of these two or higher is 50%. So, the threshold parameters are expressed in the same scale as the ability. The slope parameter indicates how well an item is able to discriminate individuals with ability values near the respective threshold. The slope parameter may also be interpreted as describing how an item may be related to the ability. The steeper the slope the higher is the item discrimination. We fitted a 1-parameter logistic (1-PL) GRM assuming a unique slope (discrimination parameter) for all items.
Quality of the calibration
The thresholds estimated for each observer were used as ability cut-off points to predict the observed NYHA classifications, this procedure permitted to assess the ability fit with the target items and the agreement between observers.
In the first case NYHA predictions were sample-specific, i.e., the NYHA predictions were estimated separately for each sample assessed by each of the observers and compared with the observed NYHA classifications.
In the second case NYHA prediction were not sample-specific, i.e., all individuals were classify using the thresholds estimated for each observer regardless of the observer that assessed each individual and compare with each other.
The agreement was assessed with both the absolute agreement and the Cohen's weighted kappa coefficient. Guidelines for interpreting kappa statistics suggest that values between 0.81-1.00 indicate almost perfect agreement, 0.61-0.80 substantial agreement, 0.41-0.60 moderate agreement, 0.21-0.40 fair agreement, and values less than 0.21 are poor or slight agreement .
Characteristics of the study sample by observers
History of myocardial infarction
History of angina
History of heart failure
Left ventricular systolic dysfunction
III and IV
Systolic blood pressure (mmHg)
Diastolic blood pressure (mmHg)
Body mass index (kg/m2)
Score of each anchor item, the distribution of the items and the polychoric correlation of each item with NYHA classification
N = 265
Do you usually have breathlessness or difficulty breathing? ("dyspnea")
N = 263 (99.2)
Yes, when walking on steep plane
Yes, when walking on the horizontal plane
Yes, even at rest
If yes, how long can you walk before you have to stop?
N = 232 (87.5)
If yes, after how many sets of stairs (10-15 steps) do you have to stop? ("stairs")*
N = 258 (97.3)
3 or more sets
If yes, in your view, what level of effort induces breathlessness? ("effort")
N = 257 (97.0)
Raw score (0-8)
Exploratory factor analysis and internal consistency conducted separately for the 7 observers NYHA classification (target items) and combined with the 4 anchor items
Validity of the anchor items
Correlation between the raw score (sum of 4 items) and NYHA with fatigue scale, the daily physical activity, the 4 physical sub-dimensions (physical function, role physic, pain and health perception) and the general physical function of Short Form 36
Total physical activity (mets)
General physical health (SF36)
General health perception
One-dimensional 2 parameter logistic graded response model with equal discrimination parameters across items
The results of the calibration with the 1-PL graded response model showed that the observers and the patient anchor items showed a high discrimination (β = 2.27, standard error = 0.176).
Quality of the calibration
Agreement between the observers and between the observers and the ability estimated by the concurrent calibration
The agreement between observers predicted classifications for all individuals according to the thresholds estimated for each observer for the ability ranged from 30 to 97% with a median of 65%, the weighted Kappa ranged from 0.00 to 0.94 with median of 0.21. This means that without taking into account the discrepancies in thresholds between observers, the agreement between NYHA observers classification is fair.
Several studies have shown that the NYHA classification is valid but not reproducible [2, 4], and associated with symptom burden, quality of life, exercise capacity, and increased risk of ischemic stroke [18–20]. Nevertheless, the NYHA classification was originally designed as a clinical, not a research tool. Although much has been written regarding the limitations of the NYHA of classification as an outcome measure , investigators continue to use it in clinical research. The popularity of the NYHA classification system is based on its simplicity . Any system that might replace it should be more accurate without being more complex. So the aim of this study was not to build a new system but to improve the NYHA system. To do so, we used IRT models to equate and calibrate a large number of observers on the same scale; by doing so, we were able to identify observers with lower and higher thresholds for classification, as well as to understand the relations with anchor items across the ability continuum, and to improve the NYHA classification system.
The present study objectively indicates the main reason why several studies have reported low inter-observer reliability and, consequently, the limited usefulness of the NYHA classification as an outcome measure. The main reason is the existence of discrepant thresholds between observers in the definition of NYHA class I, II and III individuals. Although the observers in study were experienced physicians well trained in the management of heart failure, there were still discrepancies between their (subjective) evaluations.
The focus should therefore be on the identification of differences between the evaluations of the observers and on the calibration of those classifications.
Although intra-observer reliability is more important to interpret changes in NYHA class in the individual patient who is assessed repeatedly by the same physician, inter-observer variability is of special concern when patients are assessed by different physicians. This is particularly important, in practice, in unscheduled visits to the clinic or the emergency department, where patients are not assessed by their usual attendant. These unscheduled visits are usually due to worsening symptoms and an increase in NYHA class, in comparison with the previous clinical state, is used as a criterion for clinical decisions such as hospital admission and intensity of therapy adjustment such as use of intravenous medication.
Therefore, in each setting the NYHA classification is to be used, it would be useful to identify the differences between the assessments of the observers and calibrate their classifications. For the calibration with the IRT methodology to be possible, a set of anchor items is needed. These items should be reliable and valid. In this sample, the 4 anchor items combined with each target item showed good homogeneity (strong first factor) and reliability (alpha > 0.61). Furthermore, these items showed content validity on the basis of a previous study , which concluded that the self-reported distance (70%) and difficulty in climbing stairs (60%) were the items more commonly used by senior cardiologists and trainees in cardiology to classify patients in NYHA classes. Our study showed that these anchor items had a strong association with the NYHA classification and that had a similar association with scales that measure related constructs. So these results confirm the reliability of the anchor items and their validity to assess the same construct as the NYHA classification.
The improvement in the absolute agreement (65% to 88%) between the ability scale predictions of the NYHA classification between observers and the ability scale predictions of the NYHA classification with the observers' NYHA classifications observed, show how the subjectivity of the thresholds can affect the reliability of the NYHA classification. At the same time this improvement confirms the quality of the calibration obtained.
The calibration methodology can be useful to improve the reliability between observers in clinical practice and research settings. In clinical practice it is possible to use the anchor items' relations with ability to explain the differences between observers and give guidelines to improve the inter-observers reliability. For example, if we wanted to calibrate the threshold between NYHA I and II for all observers, we would advise all observers to use endorsement of the second category of the "Effort" item for the definition of class NYHA II. Similarly if we wanted to calibrate the threshold between NYHA II and III we would advise all observers to use endorsement of at least the third category of the "effort" item. In research settings the ability scale, defined using both the anchor items and an operator's classification, can be used as a refined NYHA classification, independently of the subjectivity of the observers.
The major limitation of this study is its small size. Whereas the minimum number of individuals required to properly fit a 1-PL model is 200 , only slightly less than the 263 individuals assessed here, a proper 2-parameter logistic (2-PL) GRM allowing the slope to vary among the items would require a larger sample size. An inadequate sample size would be expected to yield unstable item parameters and higher standard errors, which was the case in our study.
In the present study, each individual was assessed by only one observer, opposed to the ideal situation where that individual would be assessed by all observers. We do not think of this as a limitation. When we compared the individuals assessed by each of the observers there were no statistically significant differences in sex, clinical history, systolic blood pressure, education and left ventricular systolic dysfunction; only age, body mass index and diastolic blood pressure showed small differences. Consequently, overall the individuals that each observer assessed were very similar. On the other hand, the anchor items were related to each observer's NYHA classification. So even if the sample assessed by each observer was very discrepant, the anchor items would guarantee a good calibration. Therefore we are confident that this limitation did not have a major impact on the results.
The anchor items proposed to calibrate the NYHA classifications are not assumed to be the gold standard and are not intended replace the NYHA classification by themselves. The study only validated these anchor items against the NYHA classifications, supporting that they could be used to calibrate different observers in using NYHA classification. We do not intend to question the validity of either the anchor items or NYHA classification to measure true functional capacity, in which case we would need to confront each of them with quantitative measures of functional capacity like the 6-minute walk test or a cardiopulmonary exercise test with measurement of oxygen consumption.
Self-reported distance is a subjective measure and many factors influence a patient's answer, including psychosocial factors and perceptions of distance. Patients' ability to estimate 100 m, 500 m and 2500 m distance was shown to be poor . However, the use of additional anchor items is expected to attenuate the impact of this potential error in each of them.
The physicians were aware of patients' responses to the 4-anchor items. It is therefore possible that this fact influenced their ratings and thus violated the assumption of local independence of the statistical model. Separate calibration with the mean/mean method  was use as sensitivity analysis (data not shown) and the results obtained were similar to the concurrent analysis, also there were no significant differences between the observed and expected frequencies of items for the 7 observers models and only one pair of anchor items in 1 out of the 7 observers graded response model (observer 2) showed local dependencies.
The generalisation of the calibration method proposed is limited by the lack of individuals classified as NHYA class IV.
In conclusion, this study showed that the thresholds of the NYHA classification between observers were very discrepant and that concurrent calibration through IRT models can be used to calibrate a large number of observers on the same scale. It provides a way to minimize the reliability problem of NYHA classification. This type of approach can be useful to minimize the inter-observer variability in other classifications based on patient's and/or physicians's perception.
- Nomenclature and Criteria for Diagnosis of Diseases of the Heart and Great Vessels. 1994, Little, Brown, 9, revisedGoogle Scholar
- Bennett JA, Riegel B, Bittner V, Nichols J: Validity and reliability of the NYHA classes for measuring research outcomes in patients with cardiac disease. Heart Lung. 2002, 31: 262-270. 10.1067/mhl.2002.124554.PubMedView ArticleGoogle Scholar
- Raphael C, Briscoe C, Davies J, Ian Whinnett Z, Manisty C, Sutton R, Mayet J, Francis DP: Limitations of the New York Heart Association functional classification system and self-reported walking distances in chronic heart failure. Heart. 2007, 93: 476-482. 10.1136/hrt.2006.089656.PubMedPubMed CentralView ArticleGoogle Scholar
- Goldman L, Hashimoto B, Cook EF, Loscalzo A: Comparative reproducibility and validity of systems for assessing cardiovascular functional class: advantages of a new specific activity scale. Circulation. 1981, 64: 1227-1234. 10.1161/01.CIR.64.6.1227.PubMedView ArticleGoogle Scholar
- Ramos E, Lopes C, Barros H: Investigating the effect of nonparticipation using a population-based case-control study on myocardial infarction. Ann Epidemiol. 2004, 14: 437-441. 10.1016/j.annepidem.2003.09.013.PubMedView ArticleGoogle Scholar
- McHorney CA, Ware JE, Raczek AE: The MOS 36-Item Short-Form Health Survey (SF-36): II. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care. 1993, 31: 247-263. 10.1097/00005650-199303000-00006.PubMedView ArticleGoogle Scholar
- Severo M, Santos AC, Lopes C, Barros H: Reliability and validity in measuring physical and mental health construct of the Portuguese version of MOS SF-36. Acta Med Port. 2006, 19: 281-287.PubMedGoogle Scholar
- Ainsworth BE, Haskell WL, Leon AS, Jacobs DR, Montoye HJ, Sallis JF, Paffenbarger RS: Compendium of physical activities: classification of energy costs of human physical activities. Med Sci Sports Exerc. 1993, 25: 71-80. 10.1249/00005768-199301000-00011.PubMedView ArticleGoogle Scholar
- Krupp LB, LaRocca NG, Muir-Nash J, Steinberg AD: The fatigue severity scale. Application to patients with multiple sclerosis and systemic lupus erythematosus. Arch Neurol. 1989, 46: 1121-1123.PubMedView ArticleGoogle Scholar
- Cortina JM: What Is Coefficient Alpha? An Examination of Theory and Applications. Journal of applied psychology. 1993, 78: 98-98.View ArticleGoogle Scholar
- Hu L, Bentler PM: Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal. 1999, 6: 1-55.View ArticleGoogle Scholar
- McHorney CA, Cohen AS: Equating health status measures with item response theory: illustrations with functional status items. Med Care. 2000, 38: II43-59.PubMedView ArticleGoogle Scholar
- Samejima F: Graded response model. Handbook of modern item response theory. 1997, 85-100.View ArticleGoogle Scholar
- Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics. 1977, 33: 159-174. 10.2307/2529310.PubMedView ArticleGoogle Scholar
- R: A Language and Environment for Statistical Computing. 2008, Vienna, Austria: R Development Core TeamGoogle Scholar
- Rizopoulos D: ltm: An R package for latent variable modeling and item response theory analyses. Journal of Statistical Software. 2006, 17: 1-25.View ArticleGoogle Scholar
- Muthén L, Muthén B: Mplus User's Guide [Computer software and manual]. 5. Los Angeles: Muthén & Muthén. 2008Google Scholar
- van den Broek SA, van Veldhuisen DJ, de Graeff PA, Landsman ML, Hillege H, Lie KI: Comparison between New York Heart Association classification and peak oxygen consumption in the assessment of functional status and prognosis in patients with mild to moderate chronic congestive heart failure secondary to either ischemic or idiopathic dilated cardiomyopathy. Am J Cardiol. 1992, 70: 359-363. 10.1016/0002-9149(92)90619-A.PubMedView ArticleGoogle Scholar
- Ganiats TG, Browner DK, Dittrich HC: Comparison of Quality of Well-Being scale and NYHA functional status classification in patients with atrial fibrillation. New York Heart Association. Am Heart J. 1998, 135: 819-824. 10.1016/S0002-8703(98)70040-7.PubMedView ArticleGoogle Scholar
- Koren-Morag N, Goldbourt U, Tanne D: Poor functional status based on the New York Heart Association classification exposes the coronary patient to an elevated risk of ischemic stroke. Am Heart J. 2008, 155: 515-520. 10.1016/j.ahj.2007.10.032.PubMedView ArticleGoogle Scholar
- Tedesco C, Manning S, Lindsay R, Alexander C, Owen R, Smucker ML: Functional assessment of elderly patients after percutaneous aortic balloon valvuloplasty: New York Heart Association classification versus functional status questionnaire. Heart Lung. 1990, 19: 118-125.PubMedGoogle Scholar
- Downing SM: Item response theory: applications of modern test theory in medical education. Med Educ. 2003, 37: 739-745. 10.1046/j.1365-2923.2003.01587.x.PubMedView ArticleGoogle Scholar
- Marco GL: Item Characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement. 1977, 14: 139-160. 10.1111/j.1745-3984.1977.tb00033.x.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.