Input data quality control for NDNQI national comparative statistics and quarterly reports: a contrast of three robust scale estimators for multiple outlier detection

Background To evaluate institutional nursing care performance in the context of national comparative statistics (benchmarks), approximately one in every three major healthcare institutions (over 1,800 hospitals) across the United States, have joined the National Database for Nursing Quality Indicators® (NDNQI®). With over 18,000 hospital units contributing data for nearly 200 quantitative measures at present, a reliable and efficient input data screening for all quantitative measures for data quality control is critical to the integrity, validity, and on-time delivery of NDNQI reports. Methods With Monte Carlo simulation and quantitative NDNQI indicator examples, we compared two ad-hoc methods using robust scale estimators, Inter Quartile Range (IQR) and Median Absolute Deviation from the Median (MAD), to the classic, theoretically-based Minimum Covariance Determinant (FAST-MCD) approach, for initial univariate outlier detection. Results While the theoretically based FAST-MCD used in one dimension can be sensitive and is better suited for identifying groups of outliers because of its high breakdown point, the ad-hoc IQR and MAD approaches are fast, easy to implement, and could be more robust and efficient, depending on the distributional property of the underlying measure of interest. Conclusion With highly skewed distributions for most NDNQI indicators within a short data screen window, the FAST-MCD approach, when used in one dimensional raw data setting, could overestimate the false alarm rates for potential outliers than the IQR and MAD with the same pre-set of critical value, thus, overburden data quality control at both the data entry and administrative ends in our setting.


Background
To establish the benchmark and monitor nursing sensitive quality indicators across the United States, the American Nurses Association (ANA) established the National Database for Nursing Quality Indicators W (NDNQI W ) in 1998 [1]. With over 1,800 hospitals at present, NDNQI collect unit-level data online through a secured database and provides each member institution quarterly report with 8-quarter trend data, along with national comparative statistics stratified by hospital staffed bed size, teaching or Magnet status, unit type, and various other characteristics of institutional preference. With a dynamic input from over 18,000 hospital units, NDNQI compiles over 200 quantitative measures of nursing care structure, process, and outcomes. For input data quality control, NDNQI conducts one dimensional data quality check for various quantitative measures at first, potential outliers are flagged at the univariate level for correction or confirmation to ensure the quality and overall validity of national comparative statistics by various stratifications. Detecting and evaluating valid extreme observations, on the other hand, may be just as important to participating hospitals since they identify what needs to be exemplified or improved to better their services. Besides multilevel validation rules and compatibility checks with online data entry through the secured NDNQI database, an interactive statistical data screening procedure with up to three rounds of overnight univariate data screening for potential outliers has been implemented since the beginning of NDNQI. The statistical data screening starts immediately once a quarterly data entry deadline is approached and continues until all questionable inputs are resolved or confirmed through the hospital site coordinator, the institution's designated data manager. At present, we rely on the theoretically based FAST-MCD approach [2], because it's readily available with most commercial statistical packages and it is applicable to one dimensional outlier detections with high breakdown point property.
With the continuous growth of NDNQI in both number of facilities and new quantitative measures, we need to expand the initial statistical screening on input data and run a most efficient and reliable quality control to ensure the on-time delivery of high quality quarterly report, one of the most frequent suggestions on the 2008 NDNQI customer satisfaction survey [3]. Currently, NDNQI quarterly report uses Bayesian hierarchical modeling [4] and Box-Cox transformation approach [5] for hospital report cards and NDNQI national comparative statistics once the institutional data are deemed clean or reconfirmed after initial raw data screening. Robust regression methods with multivariate outlier detection techniques are also available and have been intensively reported in literature [6][7][8][9][10], though we focus this work on univariate outlier detection as guided by our application for NDNQI processes.
Outliers refer to abnormal observations that do not conform to the pattern (model)suggested by the majority of the cases in a data set [11], which can result from different reasons. Some of them reflect unit-level superior/ deficient performance in measured quality, as in the case for NDNQI, but are true observed values; others may be derivatives of miscalculation, wrong definition or simply typos. Many methods are available for outlier detection [2,[12][13][14][15][16][17][18][19], and most of them are distance-based on one kind or another robust measure of location and scatter (scale estimator) [2,17,[20][21][22]. Detection and examination of potential outliers are integral parts of data analysis [23][24][25], because the presence of outliers may alter statistics, reduce the power of a test, and even lead to incorrect conclusions. On the other hand, outliers are often of primary interest in searching for superiority, such as in biological breeding, geological exploration, and pharmaceutical research. In NDNQI, an outlier for a certain indicator could signal an outstanding performance or inadequate service in nursing care, supply, and/ or skill [26], which in turn could provide critical feedback to the hospital administration. Comparisons of different methods for detecting outliers have also been well reported by Kianifard and Swallow [27], Hadi and Simonoff [28], Serbert et al. [29], and most recently, Billor and Kiral [11]. Most previous works focused on residuals from a regression model in which the residuals are roughly normally distributed for the bulk of observations. The primary interest for this study, however, is to investigate the extent to which the detection capability and robustness of three different approaches, based on FAST-MCD, IQR, and MAD, will be affected if the majority of the underlying population deviates from the normal assumption. This is because a) most NDNQI indicators have skewed distributions, b) factors with structural effect are potentially large, unknown, and most likely differ from indicator to indicator, and c) we emphasize on checking the validity of the raw input data.
Among the commonly used methods, the FAST-MCD approach is most popular because it is robust, sensitive, and applicable to both univariate and multivariate outliers. The FAST-MCD approach is based on the iterative estimates of multivariate location (T) and scatter (C) obtained from h observations (out of a total of n) whose covariance has the lowest determinant, with h ≥ (n + p + 1)/2, and p representing the dimension of the data. In the extreme case, the robust estimates of location and scatter could be based on the simple majority (n/2 +1) of all observations. Once the scatter C and location T are determined, they are used in the following equation, in matrix notation, for calculating the robust distance (D) for all n data points: where, the squared distance is Chi-square distributed, D 2 $ χ 2 p , with p representing the dimension in column of the X matrix. The outlyingness of an observation is assessed by its distance (D) from location T of (1) compared to the square root of a critical value of the χ 2 p distribution [30]. The distance is robust because all (n -h) observations that did not contribute to the covariance matrix with the lowest determinant have zero weight on T and C, and thus have no effect on the measure of D. Consequently, the robust distances for all n observations are not affected by the number (if less than (n + p + 1)/2) and magnitude of potential outliers. If a large proportion of the data are concentrated at a single lower end point, FAST-MCD approach is more likely to fail because robust distance can not be calculated due to C being zero. It is also possible that the remaining (nh) subset be all declared outliers if they tend to be isolated in groups but not necessarily separated by large distances from the h observations. As a result, the FAST-MCD approach could mislead depending on the nature of the data distribution. In this paper, we focus on detecting outliers in the raw (also called pre-aggregated) data. The FAST-MCD, used in one dimensional setting, along with the other two approaches, serves as a benchmark for comparison, because the theoretically based MCD approach is sensitive to groups of outliers with high breakdown point. Thus, T, C, D, and the X (N×P) in matrix notation under multivariate framework are reduced to scalars for point estimates of T, C, D, and X (N×1) , respectively, as in the one dimensional cases.
Besides the FAST-MCD, two well-known and easily computed robust measures of scatter, the Inter Quartile Range (IQR) and Median Absolute Deviation from the median (MAD), were reported to be effective for detecting multiple outliers [17]. They are defined as: where, x i represents all observations with i ranges from 1 to n. Through simulation study on residuals from a regression model y i ¼ x i þ ε i , where x i and ε i are generated as uniform U(0, 15) and standard normal N(0, 1) random variables, Swallow and Kianifard [17] showed both IQR and MAD asymptotically approach the standardized variance of 1.00 for E i through constant divisors of 1.369, 1.363, 1.355 and 0.639, 0.658, 0.666 with sample sizes of 25, 50 and 100, respectively. They suggested adjusting IQR or MAD through one of the constant divisors as robust estimates (σ ) of σ for testing the null hypothesis that an observation is an outlier if e i /σ is greater than or equal to a preselected critical value for standard normal distribution N(0, 1) (1.96 for 5% or 2.54 for 1% significance level). They proposed a stepwise strategy for testing the null hypothesis that the j th ( j = p + 1, . . . , n ) observation is not an outlier. After fitting the regression model, the first p observations with the smallest absolute value of studentized residuals were used for computing the n -p recursive residuals (w j ) as defined by Brown, Durbin, and Evens [31]. The largest of the test statistics |w j /σ | is compared to a critical value, and the nooutliers hypothesis is rejected when the test statistic is greater or equal to the pre-selected critical value. The procedure is repeated by removing the observation from computation until the no-outliers hypothesis cannot be rejected. Swallow and Kianifard concluded that using ordinary least square residuals, studentized residuals, or the recursive residuals has little effect on the critical values for testing no-outliers hypothesis at 0.1, 0.05, or 0.01 significance levels with either IQR or MAD as scale estimates. We chose IQR/1.355 or MAD/0.666 as the robust estimate of scale since both simulation and NDNQI example data used in this study are substantially large.

Methods
The cleaned NDNQI 3 rd quarter data in 2007 was used to explore the distributional property of indicators and how data distribution affect robustness and false alarm rate by the three scale estimators.  24 Hours, because these measures represent the wide range of data distributions among all indicators. For each of the 7 selected measures, the critical value with FAST-MCD was set at 5.02 for the squared robust distance, corresponding to 2.5% significance level for χ 2 distribution with 1 degree of freedom. The critical value for the IQR and MAD approaches was 2.24, corresponding to the 1.25% lower and upper percentiles for two-sided test with the standard normal distribution. In each case, around 2.5% of the observations were targeted for recheck. We thought it was necessary to keep the critical value at 2.5% level considering NDNQI commitment to data integrity and quality, the dimension of data to be screened, the number of hospitals involved, and the available data management resources.
A close look of all indicators revealed that their distributions are highly skewed to the right, and a Gamma distribution with different shape and scale parameters would provide each the best goodness of fit. Therefore, we performed a simulation study by generating Gamma random variables X~Г(α, β), using SAS W RANGAM [33] function with various scale (β) and shape (α) parameters. The pairs of β and α were selected such that the of X ranged from around 0 (close to normal) to 4 (heavily skewed to the right), but the means of Þ remained the same. SAS MCD CALL routine was used for calculating the robust distance, while the inter quartile range in (2) and median absolute deviation from the median in (3), along with the skewness and other descriptive statistics were obtained with the SAS UNIVARIATE procedure. A SAS macro program was written to identify potential outliers and to combine and compare results with the three methods.
To contrast the ability to identify true outliers by each method, we adjusted the Monte Carlo simulation such that 10 observations (1%) were planted at random as known outliers in each generated data set along with the remaining 990 data points (99%) at various level of asymmetry as described above.
For real case application, we computed a few NDNQI indicators both before and after data cleaning, using 2007 NDNQI 4 th quarter data, and then checked each indicator for potential outliers to compare the sensitivity and efficiency of the three approaches.

NDNQI quarterly report data in 2007
If FAST-MCD, IQR, and MAD approaches were equally robust and efficient for NDNQI 2007 3 rd quarter data, we should expect around 2.5% of reporting units for each indicator to be identified for recheck or validation by hospital site coordinators. In this case, the false alarm rate was 2.5% since all questionable observations were rechecked and deemed as clean. Unfortunately, all three methods overestimated the target for Total Falls per 1,000 Patient Days, but their differences were within 2% when the indicator's distribution was neither too skewed (γ = 1.772) nor too concentrated at the lower end ( Table 1). The rate of overestimation went higher with the increase in skewness, especially with the FAST-MCD approach, as shown by the Injury Falls per 1,000 Patient Days, Percent of Surveyed Patients with Hospital Acquired Pressure Ulcers, and Total Nursing Hours per Patient Day. As the data skewed more to the right, such as Percent of PIV Sites with Vesicant Solution, FAST-MCD classified over 10% more units into the potential outlier category, compared to IQR and MAD methods. In an extreme case, the inflated false alarm rate by the FAST-MCD approach reached as high as 20% for Percent of Registered Nurses, and up to 30% for Percent of Surveyed Patients with Hospital Acquired Pressure Ulcers, compared to those identified by the IQR and MAD methods. Among the three approaches, the IQR was most consistent in terms of maintaining the preset 2.5% false alarm target across a wide range of asymmetry in data distribution followed by the MAD approach while the data is not heavily skewed (γ < 3). Furthermore, both MAD and FAST-MCD approaches are susceptible to failure when the data heavily concentrate at the lower end of the distribution even if the skewness is relatively low (γ < 2), as observed with NDNQI Prior Risk Assessment for Pressure Ulcer, Total Nursing Hours Per Patient Day, Assisted Patient Falls Rate, and Multiple Site PIVs. In such cases, neither FAST-MCD nor MAD will be able to estimate the scale, thus fail to pick any observation as potential outlier.
After the NDNQI open data entry period was closed for 2007 4 th quarter, RN Hours Per Patient Day, Total Falls Per 1,000 Patient Days, and Injury Falls Per 1,000 Patient Days (Table 2) were chosen as example for checking potential outliers at the 2.5% targeted significance level. Despite a considerably larger percentage of reporting units were flagged for each indicator by all three methods, most of the flagged reporting units were confirmed as true values (false outliers) by the corresponding hospital site coordinators after rechecking. All three methods were able to pick up nearly the same set of observations as true outliers (equally sensitive), which were corrected by site coordinators in the cleaned database (Table 2). A few more outliers picked by FAST-MCD in Table 2 may be attributed to the higher percentage of false outliers (identified for recheck but reconfirmed as true values) than it's robustness and sensitivity. As Figure 1 illustrated, the five reporting units picked up by FAST-MCD but ignored by IQR and MAD for Injury Falls Per 1,000 Patient Days are not significantly different from the bulk of the remaining units. The percentage of false outliers by FAST-MCD, however, is considerably larger than those given by IQR or MAD, suggesting more time and effort could be saved on data cleaning both at the hospital input and NDNQI administrative ends by using IQR or MAD approaches.

Monte Carlo simulation
The Gamma distribution is a general type of distribution ranging from nearly symmetric normal to extremely skewed exponential distributions. The skewness of a Gamma variable can be fully described with a shape parameter. We chose Gamma random variables to imitate the highly skewed NDNQI indicators, which were constructed in order to pinpoint rare but inadequate supply of Total Nursing Care Hours Per Patient Day ( Table 1).
The SAS Gamma random number generating function RANGAM was used with different sets of seed, shape (α), and scale (β) parameters to generate a data set of 1,000 observations at each α by β combination. We let β   to vary from 2 to 18, such that α from 4 to 4/9, in order to maintain the same mean ( μ ¼ α Â β ) of 8.00 for all data sets but with varying degrees in skewness from 0 to 4. Potential outliers for each generated data set are identified at the 1.25 th and 98.75 th percentile levels by all three methods. With each set of shape parameter α, the skewness is calculated as y ¼ 2 ffiffi α p . We then calculated the proportion of potential outliers for each data set by FAST-MCD, MAD, and IQR approach, and summarize for each method by the level of γ with the mean and standard deviation of the proportion of potential outliers. The estimated skewness for each data set was obtained with SAS UNIVARIATE procedure (Table 3). With skewness increasing from 1.00 to around 3.00, all three approaches tend to over-estimate the false alarm rate than the targeted 2.5% significance level, but the magnitude is quite different. With IQR approach, the overestimate ranges from 0.1% at γ = 1.00 to 5.1% at γ = 3.00, in contrast to from 0.3% to 9.7% or from 2.8% to 30.1% for MAD or FAST-MCD approaches, respectively. This indicates that, 1) the FAST-MCD, IQR, and MAD methods are within the range of natural variation from the 2.5% target and approach each other only when the data are approximately normally distributed (γ = 0); 2) FAST-MCD could inflate the false alarm rates as high as 30% in contrast to 5% for the IQR and 10% for MAD approaches if the data is highly skewed to the higher end (γ = 3.00). On average, the robustness to asymmetry in data distribution is ordered by IQR > MAD > FAST-MCD ( Figure 2). However, the behavior of the MAD approach is erratic (if γ >2) as reflected by quite a few cases with larger than usual estimates of proportion for potential outliers over the target (Figure 2) and the large variations in proportion (Table 3).
Outliers differ from extreme values of the same distribution. To examine the ability to pick up true outliers by each method, we insert 1% observations from N (60, 9/4) that differ from the remaining 99%. Again, the bulk of the data (990 out of 1,000) is generated with different shape and scale parameter with Gamma distribution (μ = 8.00). We set the skewness at 1.00, Table 3 False alarm rate as a function of skewness in data distribution for IQR, MAD, or FAST-MCD approach with simulation Mean rate of potential outliers with standard deviation in parenthesis for 1,000 simulated data sets at each preset skewness level.

Figure 2
False alarm rate for potential outliers varies greatly with different approaches if data is highly skewed in distribution, but remain close to each other if skewness is close to zero.
2.00, and 3.00, corresponding to a variance (σ 2 ) of 16 Â ffiffi ffi 2 p ; 8 ffiffi ffi 2 p ; and 16=3 Â ffiffi ffi 2 p , respectively. At different skewness, we compare the three methods and see if the inserted true outliers are picked up and whether the overall proportion of potential outliers by each method approaches to the targeted 5% level (Table 4). All three methods were able to identify 10 out of 10 (100%) of the planted outliers regardless the severity of skewness in data distribution. Extreme values, along with the planted known outliers, could be identified as false outliers at a much higher rate by the FAST-MCD or MAD than the IQR approach when the bulk of the data is skewed to the right (γ >1). The rate of false outliers reach as high as 30%, 15%, and 10% at γ =3.00 for FAST-MCD, MAD, IQR, and 20%, 8%, and 7% at γ =2.00, but are barely distinguishable between MAD and IQR, and only slightly higher for FAST-MCD approach at γ =1.00 (Table 4).

Conclusion and discussion
When used for one dimensional outlier detection in raw data, the robustness and efficiency of the ad-hoc, distance-based IQR and MAD, as well as the classic theoretically based FAST-MCD approaches depends on the skewness in data distribution. Most previous studies focused on regression residuals with the majority of the observations being normally distributed or relatively symmetric, a precondition that makes the FAST-MCD robust (free from masking and swarming) and sensitive to the presence of multiple outliers. With Monte Carlo simulation and NDNQI examples, we demonstrated that, with skewed data and preselected critical value, the FAST-MCD approach could be misleading by overestimating false alarm rate than the targeted level. Consequently, it was less efficient because more time and resources need to be committed to find the true, among all flagged, potential outliers at the same significance levels, compared to the IQR or MAD approaches. Notice, a limitation to the MAD and FAST-MCD is with the application to 0-inflated data. As many NDNQI indicators reflect rare adverse events, a median value of 0 is not uncommon, causing both methods to fail. In certain indicator distributions, even the IQR method has limitations as the 75 th percentile is 0.
The primary goal for initial input data screening with large database is to achieve high data quality with less time and effort. It can be argued that, without constraints in time and effort, one can always achieve higher quality by duplicating data entries, double checking every observation, or relaxing the significance level for the false alarm rate with any method. Winskowski et al. [34] reported, for example, that the detection capability was increased by increasing the significance level of α from 0.05 to 0.20 without severe impact to false alarm probabilities for the randomly scattered outliers in the interior of the X-space. While this may be true for small datasets with low contamination and plausible to limited number of variables, a key question for extensive data based research is how to maintain balance between data quality control and limits and constraints in time and resources. At NDNQI, we strive to deliver quarterly reports to member hospitals within three weeks after a quarterly data entry was over. Unlike residual from regression analysis, on the other hand, most statistical data screening for quality control deals with raw data whose distribution may be anything but normal in nature. Over estimating the false alarm rates for potential outliers, could dramatically reduce the efficiency and add extra burden for data entry at hospital sites and database management at NDNQI administration. Instead of FAST-MCD, the IQR or MAD approach can be used to maintain the targeted significance level for potential outlier check without suffering a substantial loss in sensitivity for the presence of true outliers and a dramatic increase in false alarm rate. Notice that the critical-value based approach we currently used may not be most optimal considering the quantity of univariate measures checked for outliers, as recent literature suggested that a data dependent choice of critical-vale for the FAST-MCD approach can achieve full efficiency and control the false alarm rates [10]. Real case application with 2007 NDNQI 4 th quarter data indicated that as much as 20% more observations need not to be checked with FAST-MCD (6 times more) than with IQR or MAD to achieve the goal of screening the same sets of true outliers (Table 2). However, erratic behavior can be expected with MAD approach (Figure 2), in some cases worse than FAST-MCD (e.g., Assault Rate).
Most statistics for detecting outliers suffer from masking effect as a result of inflation in scale estimates when multiple outliers are present. FAST-MCD avoids masking by assigning zero weight to every outlier, while IQR and MAD are generally robust to such effect by using ordered statistics. However, neither IQR nor MAD approach should be regarded as free from distributional effect because using ordered statistics for estimating scale does not change the fact that the extreme observations still lead to biased estimates for location. As a result, both IQR and MAD approach can not avoid masking and swarming effect for data with high rate of contamination. For example, if m contaminated true outliers hide in n total observations, the property of IQR and MAD may depend on the scale and proportions of the m outliers since the ordered statistics may shift to one of the m outliers from that of the (n-m) uncontaminated observations if the target population is highly contaminated.
Data transformation provides a powerful tool for developing a parsimonious model when the variable of interest deviates from normal in distribution [5]. Applying the FAST-MCD approach on a transformed scale can be useful to detect potential outliers without inflating the false alarm rate but is beyond the scope of this paper. In multivariate analysis, FAST-MCD approach remains to be most popular and feasible for outlier check with data in multiple dimensions, but how asymmetry in data distribution affect the robustness in multivariate case need further investigation.