Input data quality control for NDNQI national comparative statistics and quarterly reports: a contrast of three robust scale estimators for multiple outlier detection
© Hou et al.; licensee BioMed Central Ltd. 2012
Received: 19 March 2012
Accepted: 17 August 2012
Published: 25 August 2012
To evaluate institutional nursing care performance in the context of national comparative statistics (benchmarks), approximately one in every three major healthcare institutions (over 1,800 hospitals) across the United States, have joined the National Database for Nursing Quality Indicators® (NDNQI®). With over 18,000 hospital units contributing data for nearly 200 quantitative measures at present, a reliable and efficient input data screening for all quantitative measures for data quality control is critical to the integrity, validity, and on-time delivery of NDNQI reports.
With Monte Carlo simulation and quantitative NDNQI indicator examples, we compared two ad-hoc methods using robust scale estimators, Inter Quartile Range (IQR) and Median Absolute Deviation from the Median (MAD), to the classic, theoretically-based Minimum Covariance Determinant (FAST-MCD) approach, for initial univariate outlier detection.
While the theoretically based FAST-MCD used in one dimension can be sensitive and is better suited for identifying groups of outliers because of its high breakdown point, the ad-hoc IQR and MAD approaches are fast, easy to implement, and could be more robust and efficient, depending on the distributional property of the underlying measure of interest.
With highly skewed distributions for most NDNQI indicators within a short data screen window, the FAST-MCD approach, when used in one dimensional raw data setting, could overestimate the false alarm rates for potential outliers than the IQR and MAD with the same pre-set of critical value, thus, overburden data quality control at both the data entry and administrative ends in our setting.
To establish the benchmark and monitor nursing sensitive quality indicators across the United States, the American Nurses Association (ANA) established the National Database for Nursing Quality Indicators® (NDNQI®) in 1998. With over 1,800 hospitals at present, NDNQI collect unit-level data online through a secured database and provides each member institution quarterly report with 8-quarter trend data, along with national comparative statistics stratified by hospital staffed bed size, teaching or Magnet status, unit type, and various other characteristics of institutional preference. With a dynamic input from over 18,000 hospital units, NDNQI compiles over 200 quantitative measures of nursing care structure, process, and outcomes. For input data quality control, NDNQI conducts one dimensional data quality check for various quantitative measures at first, potential outliers are flagged at the univariate level for correction or confirmation to ensure the quality and overall validity of national comparative statistics by various stratifications. Detecting and evaluating valid extreme observations, on the other hand, may be just as important to participating hospitals since they identify what needs to be exemplified or improved to better their services. Besides multilevel validation rules and compatibility checks with online data entry through the secured NDNQI database, an interactive statistical data screening procedure with up to three rounds of overnight univariate data screening for potential outliers has been implemented since the beginning of NDNQI. The statistical data screening starts immediately once a quarterly data entry deadline is approached and continues until all questionable inputs are resolved or confirmed through the hospital site coordinator, the institution’s designated data manager. At present, we rely on the theoretically based FAST-MCD approach, because it’s readily available with most commercial statistical packages and it is applicable to one dimensional outlier detections with high breakdown point property. With the continuous growth of NDNQI in both number of facilities and new quantitative measures, we need to expand the initial statistical screening on input data and run a most efficient and reliable quality control to ensure the on-time delivery of high quality quarterly report, one of the most frequent suggestions on the 2008 NDNQI customer satisfaction survey. Currently, NDNQI quarterly report uses Bayesian hierarchical modeling and Box-Cox transformation approach for hospital report cards and NDNQI national comparative statistics once the institutional data are deemed clean or reconfirmed after initial raw data screening. Robust regression methods with multivariate outlier detection techniques are also available and have been intensively reported in literature[6–10], though we focus this work on univariate outlier detection as guided by our application for NDNQI processes.
Outliers refer to abnormal observations that do not conform to the pattern (model)suggested by the majority of the cases in a data set, which can result from different reasons. Some of them reflect unit-level superior/deficient performance in measured quality, as in the case for NDNQI, but are true observed values; others may be derivatives of miscalculation, wrong definition or simply typos. Many methods are available for outlier detection[2, 12–19], and most of them are distance-based on one kind or another robust measure of location and scatter (scale estimator)[2, 17, 20–22]. Detection and examination of potential outliers are integral parts of data analysis[23–25], because the presence of outliers may alter statistics, reduce the power of a test, and even lead to incorrect conclusions. On the other hand, outliers are often of primary interest in searching for superiority, such as in biological breeding, geological exploration, and pharmaceutical research. In NDNQI, an outlier for a certain indicator could signal an outstanding performance or inadequate service in nursing care, supply, and/or skill, which in turn could provide critical feedback to the hospital administration. Comparisons of different methods for detecting outliers have also been well reported by Kianifard and Swallow, Hadi and Simonoff, Serbert et al., and most recently, Billor and Kiral. Most previous works focused on residuals from a regression model in which the residuals are roughly normally distributed for the bulk of observations. The primary interest for this study, however, is to investigate the extent to which the detection capability and robustness of three different approaches, based on FAST-MCD, IQR, and MAD, will be affected if the majority of the underlying population deviates from the normal assumption. This is because a) most NDNQI indicators have skewed distributions, b) factors with structural effect are potentially large, unknown, and most likely differ from indicator to indicator, and c) we emphasize on checking the validity of the raw input data.
where, the squared distance is Chi-square distributed,, with p representing the dimension in column of the X matrix. The outlyingness of an observation is assessed by its distance (D) from location T of (1) compared to the square root of a critical value of the distribution. The distance is robust because all (n - h) observations that did not contribute to the covariance matrix with the lowest determinant have zero weight on T and C, and thus have no effect on the measure of D. Consequently, the robust distances for all n observations are not affected by the number (if less than (n + p + 1)/2) and magnitude of potential outliers. If a large proportion of the data are concentrated at a single lower end point, FAST-MCD approach is more likely to fail because robust distance can not be calculated due to C being zero. It is also possible that the remaining (n - h) subset be all declared outliers if they tend to be isolated in groups but not necessarily separated by large distances from the h observations. As a result, the FAST-MCD approach could mislead depending on the nature of the data distribution. In this paper, we focus on detecting outliers in the raw (also called pre-aggregated) data. The FAST-MCD, used in one dimensional setting, along with the other two approaches, serves as a benchmark for comparison, because the theoretically based MCD approach is sensitive to groups of outliers with high breakdown point. Thus, T, C, D, and the X (N×P) in matrix notation under multivariate framework are reduced to scalars for point estimates of T, C, D, and X (N×1), respectively, as in the one dimensional cases.
where, represents all observations with i ranges from 1 to n.
Through simulation study on residuals from a regression model, whereandare generated as uniform U(0, 15) and standard normal N(0, 1) random variables, Swallow and Kianifard showed both IQR and MAD asymptotically approach the standardized variance of 1.00 for through constant divisors of 1.369, 1.363, 1.355 and 0.639, 0.658, 0.666 with sample sizes of 25, 50 and 100, respectively. They suggested adjusting IQR or MAD through one of the constant divisors as robust estimates () of σ for testing the null hypothesis that an observation is an outlier if e i / is greater than or equal to a preselected critical value for standard normal distribution N(0, 1) (1.96 for 5% or 2.54 for 1% significance level). They proposed a stepwise strategy for testing the null hypothesis that the j th ( j = p + 1, . . . , n ) observation is not an outlier. After fitting the regression model, the first p observations with the smallest absolute value of studentized residuals were used for computing the n - p recursive residuals () as defined by Brown, Durbin, and Evens. The largest of the test statistics |/| is compared to a critical value, and the no-outliers hypothesis is rejected when the test statistic is greater or equal to the pre-selected critical value. The procedure is repeated by removing the observation from computation until the no-outliers hypothesis cannot be rejected. Swallow and Kianifard concluded that using ordinary least square residuals, studentized residuals, or the recursive residuals has little effect on the critical values for testing no-outliers hypothesis at 0.1, 0.05, or 0.01 significance levels with either IQR or MAD as scale estimates. We chose IQR/1.355 or MAD/0.666 as the robust estimate of scale since both simulation and NDNQI example data used in this study are substantially large.
The cleaned NDNQI 3rd quarter data in 2007 was used to explore the distributional property of indicators and how data distribution affect robustness and false alarm rate by the three scale estimators. The study was approved by the IRB of the Human Subjects Committee at The Kansas University Medical Center. A total of 12,145 units contributed, at least partially, to the NDNQI database for the 3rd quarter in 2007. Based on the extract, 146 quantitative measures were computed for constructing nursing sensitive quality indicators at hospital-unit level. Among all indicators, we selected Total Falls Per 1,000 Patient Days, Injury Falls Per 1,000 Patient Days, Percent of PIV Sites with Vesicant Solution, Percent of Surveyed Patients with Pressure Ulcers, and Average Number of Pain Assessments per Patient Initiated in 24 Hours, because these measures represent the wide range of data distributions among all indicators. For each of the 7 selected measures, the critical value with FAST-MCD was set at 5.02 for the squared robust distance, corresponding to 2.5% significance level for χ2 distribution with 1 degree of freedom. The critical value for the IQR and MAD approaches was 2.24, corresponding to the 1.25% lower and upper percentiles for two-sided test with the standard normal distribution. In each case, around 2.5% of the observations were targeted for recheck. We thought it was necessary to keep the critical value at 2.5% level considering NDNQI commitment to data integrity and quality, the dimension of data to be screened, the number of hospitals involved, and the available data management resources.
A close look of all indicators revealed that their distributions are highly skewed to the right, and a Gamma distribution with different shape and scale parameters would provide each the best goodness of fit. Therefore, we performed a simulation study by generating Gamma random variables X ~ Г(α, β), using SAS® RANGAM function with various scale (β) and shape (α) parameters. The pairs of β and α were selected such that the skewness of X ranged from around 0 (close to normal) to 4 (heavily skewed to the right), but the means of X remained the same. SAS MCD CALL routine was used for calculating the robust distance, while the inter quartile range in (2) and median absolute deviation from the median in (3), along with the skewness and other descriptive statistics were obtained with the SAS UNIVARIATE procedure. A SAS macro program was written to identify potential outliers and to combine and compare results with the three methods.
To contrast the ability to identify true outliers by each method, we adjusted the Monte Carlo simulation such that 10 observations (1%) were planted at random as known outliers in each generated data set along with the remaining 990 data points (99%) at various level of asymmetry as described above.
For real case application, we computed a few NDNQI indicators both before and after data cleaning, using 2007 NDNQI 4th quarter data, and then checked each indicator for potential outliers to compare the sensitivity and efficiency of the three approaches.
NDNQI quarterly report data in 2007
Distributional skewness and false alarm rates for potential outlier check by IQR, MAD, and FAST-MCD approaches for selected NDNQI indicators
False alarm rates by different approach
Total Falls Per 1,000 Patient Days
Total Injury Falls Per 1,000 Patient Days
Percent of Total Nursing Hours Provided by RNs
Total Hospital Acquired Pressure Ulcer
Total Number of Ulcers
Average Pain Assessments in 24 Hours
Prior Risk Assessment for Pressure Ulcers
Total Nursing Hours per Patient Day
Percent Vesicant PIV
Total number of units reporting with data, required for reconfirm after screening, with outliers corrected, and false alarm rate by different approach
RN hours per patient day by unit type
Pediatric Critical Care
Pediatric Step Down
Injury Fall Rate
Fall Prior Risk Assmnt
Monte Carlo simulation
False alarm rate as a function of skewness in data distribution for IQR, MAD, or FAST-MCD approach with simulation
Asymmetry in data distribution
Potential outlier rate by different methods
Skewness in data distribution inflate overall false alarm rate with the presence of true outliers but with different scale depending on whether IQR, MAD, or FAST-MCD approach is used
Asymmetry in data distribution
True and false outlier rates by different approach
Overall Outliers Detected / (10 + 990)
Overall Outliers Detected / (10 + 990)
Overall Outliers Detected / (10 + 990)
Conclusion and discussion
When used for one dimensional outlier detection in raw data, the robustness and efficiency of the ad-hoc, distance-based IQR and MAD, as well as the classic theoretically based FAST-MCD approaches depends on the skewness in data distribution. Most previous studies focused on regression residuals with the majority of the observations being normally distributed or relatively symmetric, a precondition that makes the FAST-MCD robust (free from masking and swarming) and sensitive to the presence of multiple outliers. With Monte Carlo simulation and NDNQI examples, we demonstrated that, with skewed data and preselected critical value, the FAST-MCD approach could be misleading by overestimating false alarm rate than the targeted level. Consequently, it was less efficient because more time and resources need to be committed to find the true, among all flagged, potential outliers at the same significance levels, compared to the IQR or MAD approaches. Notice, a limitation to the MAD and FAST-MCD is with the application to 0-inflated data. As many NDNQI indicators reflect rare adverse events, a median value of 0 is not uncommon, causing both methods to fail. In certain indicator distributions, even the IQR method has limitations as the 75th percentile is 0.
The primary goal for initial input data screening with large database is to achieve high data quality with less time and effort. It can be argued that, without constraints in time and effort, one can always achieve higher quality by duplicating data entries, double checking every observation, or relaxing the significance level for the false alarm rate with any method. Winskowski et al. reported, for example, that the detection capability was increased by increasing the significance level of α from 0.05 to 0.20 without severe impact to false alarm probabilities for the randomly scattered outliers in the interior of the X-space. While this may be true for small datasets with low contamination and plausible to limited number of variables, a key question for extensive data based research is how to maintain balance between data quality control and limits and constraints in time and resources. At NDNQI, we strive to deliver quarterly reports to member hospitals within three weeks after a quarterly data entry was over. Unlike residual from regression analysis, on the other hand, most statistical data screening for quality control deals with raw data whose distribution may be anything but normal in nature. Over estimating the false alarm rates for potential outliers, could dramatically reduce the efficiency and add extra burden for data entry at hospital sites and database management at NDNQI administration. Instead of FAST-MCD, the IQR or MAD approach can be used to maintain the targeted significance level for potential outlier check without suffering a substantial loss in sensitivity for the presence of true outliers and a dramatic increase in false alarm rate. Notice that the critical-value based approach we currently used may not be most optimal considering the quantity of univariate measures checked for outliers, as recent literature suggested that a data dependent choice of critical-vale for the FAST-MCD approach can achieve full efficiency and control the false alarm rates.
Real case application with 2007 NDNQI 4th quarter data indicated that as much as 20% more observations need not to be checked with FAST-MCD (6 times more) than with IQR or MAD to achieve the goal of screening the same sets of true outliers (Table2). However, erratic behavior can be expected with MAD approach (Figure2), in some cases worse than FAST-MCD (e.g., Assault Rate).
Most statistics for detecting outliers suffer from masking effect as a result of inflation in scale estimates when multiple outliers are present. FAST-MCD avoids masking by assigning zero weight to every outlier, while IQR and MAD are generally robust to such effect by using ordered statistics. However, neither IQR nor MAD approach should be regarded as free from distributional effect because using ordered statistics for estimating scale does not change the fact that the extreme observations still lead to biased estimates for location. As a result, both IQR and MAD approach can not avoid masking and swarming effect for data with high rate of contamination. For example, if m contaminated true outliers hide in n total observations, the property of IQR and MAD may depend on the scale and proportions of the m outliers since the ordered statistics may shift to one of the m outliers from that of the (n-m) uncontaminated observations if the target population is highly contaminated.
Data transformation provides a powerful tool for developing a parsimonious model when the variable of interest deviates from normal in distribution. Applying the FAST-MCD approach on a transformed scale can be useful to detect potential outliers without inflating the false alarm rate but is beyond the scope of this paper. In multivariate analysis, FAST-MCD approach remains to be most popular and feasible for outlier check with data in multiple dimensions, but how asymmetry in data distribution affect the robustness in multivariate case need further investigation.
This research was conducted under contract from the American Nurses Association (ANA). Dr. Nancy Dunton is the principal investigator.
- Dunton N, Gajewski B, Kluas S, Pierson B: The relationship of nursing workforce characteristics to patient outcomes. Online J Nursing Issues. 2007, 12 (3):Google Scholar
- Rousseeuw PJ, Van Dressen K: A fast algorithm for the minimum covariance determinant estimator. Technometrics Vol. 1999, 41: 212-223.View ArticleGoogle Scholar
- Dunton N, Miller P: Report on the 2008 NDNQI® customer satisfaction survey. Prepared for the American Nurses Association. National Database on Nursing Quality Indicators. 2008, University of Kansas: School of NursingGoogle Scholar
- Gajewski BJ, Mahnken JD, Dunton N: Improving quality indicator report cards through Bayesian modelling. BMC Med Res Methodology. 2008, 8: 77-10.1186/1471-2288-8-77.View ArticleGoogle Scholar
- Hou Q, Mahnken JD, Gajewski BJ, Dunton N: The Box-Cox power transformation on nursing sensitive indicators: does it matter if structural effects are omitted during the estimation of the transformation parameter?. BMC Med Res Methodology. 2011, 11: 118-10.1186/1471-2288-11-118.View ArticleGoogle Scholar
- Hampel F, Rousseeuw P, Stahel W: Robust statistics: the approach based on influence curves. 1986, New York: WeleyGoogle Scholar
- Gajewski BJ: Robust multivariate estimation and variable selection in transportation and environmental engineering. 2000, Texas A & M University: Ph.D. dissertationGoogle Scholar
- Gervini D, Yohai VJ: A class of robust and fully efficient regression estimators. The annal of statistics. 2002, 30 (2): 258-616.View ArticleGoogle Scholar
- Gajewski BJ, Spiegelman HC: Correspondence estimation of the source profiles in receptor modeling. Environmetrics. 2004, 15: 613-634. 10.1002/env.654.View ArticleGoogle Scholar
- She Y, Owen AB: Outlier detection using nonconvex penalized regression. J Am Stat Assoc. 2011, 106 (494): 626-639. 10.1198/jasa.2011.tm10390.View ArticleGoogle Scholar
- Billor N, Kiral G: Comparison of multiple outlier detection methods for regression data. Commun Stat Simul Comput. 2008, 37: 3,521-545.View ArticleGoogle Scholar
- Larsen WA, McClearry SJ: The use of partial residual plots in regression analysis. Technometrics. 1972, 14: 781-790. 10.1080/00401706.1972.10488966.View ArticleGoogle Scholar
- Cook RD: Influential observations in linear regression. J Am Stat Assoc. 1979, 74: 169-174. 10.1080/01621459.1979.10481634.View ArticleGoogle Scholar
- Atkinson AC: Plots, Transformations, and Regression. 1985, New York: Oxford University PressGoogle Scholar
- Bacon-Shone J, Fung WK: A new graphical method for detecting single and multiple outliers in univariate and multivariate data. Appl Stat. 1987, 36 (2): 153-162. 10.2307/2347547.View ArticleGoogle Scholar
- Garret RG: The Chi-square plot: a tool for multivariate outlier recognition. J Geochem Explor Vol. 1989, 84: 116-144.Google Scholar
- Swallow WH, Kianifard F: Using robust scale estimates in detecting multiple outliers in linear regression. Biometrics. 1996, 52: 545-556. 10.2307/2532894.View ArticleGoogle Scholar
- Filzmore P, Reimann C, Garrett RG: Multivariate outlier detection in exploration geochemistry. Technical report TS 03–5, Department of Statistics. 2003, Austria: Vienna University of TechnologyGoogle Scholar
- Billor N, Chatterjee S, Hadi AS: A re-weighted least squares method for robust regression estimation. Am J Math Manag Sci. 2007, 26: 229-252.Google Scholar
- Maronna RA: Robust M-estimators of multivariate location and scatter. Ann Stat. 1976, 4: 51-67. 10.1214/aos/1176343347.View ArticleGoogle Scholar
- Davies PL: Asymptotic behavior of S-estimators of mutilvariate location parameters and dispersion matrices. Ann Stat. 1987, 15: 1269-1292. 10.1214/aos/1176350505.View ArticleGoogle Scholar
- Woodruff DL, Rocke DM: Computable robust estimation of multivariate location and shape in high dimension using compound estimators. J Am Stat Assoc. 1994, 89: 888-896. 10.1080/01621459.1994.10476821.View ArticleGoogle Scholar
- Barnett V, Lewis T: Outliers in statistical data. 1978, New York: John WileyGoogle Scholar
- Ryan PT: Statistical methods for quality improvement. 1989, New York: John WileyGoogle Scholar
- Draper NR, Smith H: Applied regression analysis. 1996, New York: John Wiley, 2Google Scholar
- Montalvo I, Dunton N: Transforming Nursing Data Into Quality Care: Profiles of Quality Improvement in U.S. Healthcare Facilities. 2000, USA: Healthcare FacilitiesGoogle Scholar
- Kianifard F, Swallow WH: A Monte Carlo comparison of five procedures for identifying outliers in linear regression. Commun Stat Theory Methods. 1990, 19: 1913-1938. 10.1080/03610929008830300.View ArticleGoogle Scholar
- Hadi AS, Simonoff JS: A more robust outlier identifier for regression data. Bull Int Stat Inst. 1997, 281: 282-Google Scholar
- Sebert DM: Identifying multiple outliers and influential subsets in linear regression: A clustering approach. Department of Industrial Engineering. 1996, Arizona State University, AZ: Unpublished dissertationGoogle Scholar
- Rousseeuw PJ, Van Zomereon BC: Unmasking multivariate outliers and leverage points. J Am Stat Assoc. 1990, 85: 633-639. 10.1080/01621459.1990.10474920.View ArticleGoogle Scholar
- Brown DA, Durbin J, Evens JM: Techniques for testing the constancy of regression relationships over time. J R Stat Soc Ser B. 1975, 37: 149-192.Google Scholar
- Wisnowski JW, Montgomery DC, Simpson JR: A Comparative analysis of multiple outlier detection procedures in the linear regression model. Comput Stat Data Anal. 2001, 36 (3): 351-382. 10.1016/S0167-9473(00)00042-6.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.