- Account
- Join for Free
- Sign In
- Help & Info
- Privacy Notice
- DMCA
- Contact Us
- Terms Of Use
...Description...... more. less.
Comparison of Multilevel Model Results to Randomization Test Results.............................27 CONCLUSIONS AND RECOMMENDATIONS..........................................................................30 APPENDIX.......................................................................................................................<br><br> ...........34 Further Adjustments to the Multilevel Models.......................................................................34 MEG 3 Medical Episode Group".........................................................................................36 Risk Adjusted Expected Episode Payments.........................................................................37 Summary of Episode Grouping Results................................................................................38 Physician Attribution Methodology........................................................................................39 REFERENCES..................................................................................................................... .......41 - i - LIST OF TABLES Table 1: Number of Medicare Claims, by Source and Year..............................................2 Table 2: Five Monte Carlo Samples Matched to an MD Sample of 22 Episodes.............9 Table 3: Overall p -values for a One-sided p -value = .0001 for Each Test......................12 Table 4: Counts Physicians and Episodes per Physician...............................................14 Table 5: Counts of Urologists and Episodes per Urologist..............................................14 Table 6: Counts of Cardiologists and Episodes per Cardiologist....................................15 Table 7: Correlation Between 2002 and 2003 Scores, Multilevel Models.......................19 Table 8: Correlation Between 2002 and 2003 Scores, Randomization Tests.................23 Table 9: Comparison of Outlier Results: All Physicians, 2002........................................27 Table 10: Comparison of Outlier Results: All Physicians, 2003......................................27 Table 11: Summary of Grouped and Ungrouped Claims, 2002 and 2003......................38 Table 12: Episode Exclusions, 2002...............................................................................39 Table 13: Episode Exclusions, 2003...............................................................................39 Table 14: Episodes Attributed to Physicians, 2002.........................................................39 Table 15: Episodes Attributed to Physicians, 2003.........................................................40 - ii - LIST OF FIGURES Figure 1: Distribution of Payment Ratio per Episode........................................................5 Figure 2: Distribution of Log(Payment Ratio) per Episode................................................5 Figure 3: Illustration of Episode-level Residuals and Physician-level Residuals..............7 Figure 4: Example Physician Residual Plot......................................................................8 Figure 5: Distribution of Mean Payments from 10,000 Monte Carlo Samples for an MD. ...............................................................................................................................<br><br> ..10 Figure 6: Distribution of Mean Log(Payments) from 10,000 Monte Carlo Samples for an MD...........................................................................................................................11 Figure 7: Urologist Physician Efficiency Estimates with 99.98% Confidence Limits, 2002. ............................................................................................................................... ..17 Figure 8: Cardiologist Physician Efficiency Estimates with 99.98% Confidence Limits, 2002.........................................................................................................................18 Figure 9: Physician Efficiency Scores (Residuals), 2002 versus 2003, Multilevel Models.<br><br> ............................................................................................................................... ..21 Figure 10: Look Forward: 2002 Outliers and 2003 p -values, Multilevel Models.............22 Figure 11: Look Backward: 2003 Outliers and 2002 p -values, Multilevel Models...........22 Figure 12: Physician Efficiency Scores (Residuals), 2002 versus 2003, Randomization Tests........................................................................................................................25 Figure 13: Look Forward: 2002 Outliers and 2003 p -values, Randomization Tests.......26 Figure 14: Look Backward: 2002 Outliers and 2003 p -values, Randomization Tests.....26 Figure 15: Randomization Test Efficiency vs Multilevel Model Efficiency, 2002.............28 Figure 16: Randomization Test Efficiency vs Multilevel Model Efficiency, 2003.............29 Figure 17: Distribution of Mean Payments from 10,000 Monte Carlo Samples for Dr. Smith........................................................................................................................32 - iii - EXECUTIVE SUMMARY This study assessed the feasibility of using an episode grouper to identify physicians with substantially higher than expected utilization in the treatment of Medicare patients.<br><br> The episode grouper used in this study was Thomson 9s Medical Episode Grouper" (MEG), which was commercially released in 1998. This tool combines medical claims found in administrative data into coherent and distinct episodes of treatment. These episodes describe a series of related health care services for the treating each patient 9s spells of illness.<br><br> Episodes comprise health care utilization from multiple sites of service 1 . Each episode was characterized by several factors, including patient demographics, as well as the patient 9s disease, stage of disease, and complexity. Patient complexity was measured by the patient 9s Diagnostic Cost Group relative risk score, which is predictive of a patient 9s level of overall medical expenditures.<br><br> The main episode outcome was the sum of standardized payments for services contained in the episode. Payments were standardized to eliminate area wage variations and other local cost factors. For example, standardized diagnosis related group (DRG) payments were used for hospitalizations and standardized payments based on relative value units (RVUs) were used for physician service payments.<br><br> The data employed in this study were provided by the Medicare Payment Advisory Commission (MedPAC). They comprised all 2002 and 2003 Medicare claims 2 for patients residing in six metropolitan statistical areas (MSAs): Boston, MA; Greenville, SC; Miami, FL; Minneapolis, MN; Orange County, CA; and Phoenix, AZ. In all, there were about 75 million claims per year, which were processed by MEG to produce episode groups.<br><br> Approximately 85 percent of the claims could be grouped, representing over 96 percent of total claim payments. The final analysis file contained over 6 million Medicare episodes per year. Each episode was attributed to a single physician based on that physician 9s share of the evaluation and management (E&M) payments for the episode.<br><br> An episode was attributed to the physician who billed the highest percentage of total E & M payments, if it was at least 35 percent. Physicians were identified by the Unique Physician Identification Number (UPIN) provided on the claims data. Physician specialties were obtained from the Medicare physician file.<br><br> In total, episodes were attributed to about 37,000 physicians per year. For our analyses we selected only physicians assigned at least 20 episodes, for a total of about 25,000 physicians per year. Our objective was to identify physicians whose average episode payment could be considered an coutlier. d For this study, we defined an outlier to be an average episode payment that exceeded the expected mean payment by at least 25 percent at the .0001 level of statistical significance.<br><br> The expected mean payment was adjusted for episode and patient severity and it corresponded to the mean for an caverage d physician with the same specialty and in the same MSA as the physician under test. The choice of threshold, 25 percent above the expected mean, represented a substantial deviation from the expected mean, although it was somewhat arbitrary. The very low significance level, .0001, was selected because of the large number of tests that were conducted (one for each physician).<br><br> Compared with more conventional significance levels, it reduced the 1 Prescription drug claims can also be incorporated. However, they were not available for the present study. 2 2001 claims were also used to ensure that episodes beginning in 2001 and ending in 2002 would be completed.<br><br> - iv - overall chances that one of the many physicians under test would be erroneously labeled an outlier. We employed two statistical methodologies to determine outlier status. First, we employed multilevel regression models, which accounted for the correlation of episodes within physicians while estimating the physician 9s residual (deviation of the physician 9s observed episode mean payment from his or her expected episode mean payment).<br><br> Multilevel models have been widely used for provider profiling applications. Second, we employed randomization tests. Unlike multilevel models, these tests require no assumptions concerning statistical distributions.<br><br> For the randomization tests, each physician 9s mean payment was compared to a distribution of mean payments estimated from a large number of random samples of episodes similar to those attributed to the physician under test. Both methods yielded stable estimates of physician residuals. Among physicians with at least 20 episodes in both years, the correlation between the physician 9s 2002 and 2003 residuals was 89 percent for the multilevel model and 87 percent for the randomization test 3 .<br><br> The multilevel models identified a slightly higher percentage of physician outliers compared with the randomization tests (4.4 % vs. 2.9 % in 2002, and 4.7 % vs. 3.4 % in 2003).<br><br> Both methods produced cstable d outliers in the sense that outliers in 2002 tended to have small p - values (large residuals) in 2003 and, likewise, outliers in 2003 tended to have small p -values (large residuals) in 2002. The following table quantifies this stability: Method Year # Outlier MDs % of Outliers Adjacent Year p -value < .05 2002 918 90.7 Multilevel model 2003 972 88.6 2002 611 93.6 Randomization test 2003 712 90.0 The multilevel method produced 918 outliers in 2002, of which 90.7 percent had a small p -value in 2003. In other words, over 90 percent of the 2002 outliers also showed evidence of being an outlier in 2003.<br><br> We call this the clook forward d from 2002 to 2003. Similarly, we did a clook backward d from 2003 to 2002. The .05 significance level defining the csmall d p -value for the adjacent year is justified on the grounds that we were only testing physicians for whom we already had strong evidence of being an outlier.<br><br> Approximately 10 percent of the outlier physicians in one year had p -values larger than .05 in the adjacent year, indicating that their residual in the adjacent year was not significantly 25 percent above the expected residual. These physicians could have been truly outliers in one year and truly not outliers in the adjacent year, or they could have been erroneously identified as an outlier. In any event, the overall results are encouraging that a physician 9s outlier status appears to be highly persistent from year to year.<br><br> While we believe that this study establishes the feasibility of episode groupers for use in Medicare physician profiling, we note the following limitations: 3 Residuals for the randomization test were based on the difference between the physician 9s observed mean episode payment and the average of the mean payment distribution generated from random samples of similar episodes. - v - 1. Standardized payments were the basis for measuring episode resource intensity and physician cefficiency. d For example, hospital payments were the same for every patient hospitalized with a given diagnosis related group.<br><br> This standardization no doubt masked some true episode cost variation. 2. Each episode was attributed to the single physician that billed the highest percentage of E&M dollars (at least 35 %) for that episode.<br><br> For episodes involving multiple physicians, it is possible that less than full responsibility should have been accorded to that physician. 3. Risk adjustment was based on episode severity as measured by the episode 9s principal disease, the stage of the principal disease, and the relative risk score.<br><br> Although these factors incorporated patient diagnoses and demographics, other factors might have provided further risk adjustment. 4. Physician comparisons were based only on episodes attributed to physicians within the same specialty group and within the same MSA.<br><br> There might be an argument for comparing performance across a broader spectrum of specialties and geographic areas. 5. These analyses were strictly episode-based.<br><br> They only compared physicians on their average episode-level resource intensity. They did not account for the frequency of episodes. It is possible that some physicians broke up the treatment for a condition into several low-intensity episodes, while other physicians combined the treatment for a condition into a few high-intensity episodes.<br><br> However, the several low-intensity episodes would need to have been widely spaced to create separate episodes using the MEG algorithms. 6. The episodes in this analysis were based on the MSA of the patient, not on the MSA of the physician.<br><br> For example, all episodes for Boston physicians were based solely on patients residing in the Boston MSA. However, this excluded episodes for patients outside the Boston MSA that were treated by Boston physicians. To partially address the second point, MedPAC has commissioned a study currently under way to test multiple physician attribution in place of single physician attribution.<br><br> We also recommend that MedPAC should repeat the analyses in the present study to address the sixth limitation. If some physicians treated a large number of patients outside their own MSA, then their estimated mean episode payment could have been biased if patients outside their MSA had different treatment patterns compared with patients in their own MSA. At a minimum, the larger sample of episodes could produce a more reliable estimate of their mean episode payments.<br><br> . - vi - INTRODUCTION The purpose of this study was to assess the usefulness of episode groupers for profiling physicians on their treatment of Medicare patients. In particular, we developed measures related to physician efficiency.<br><br> We say crelated to d because we could only approximately estimate efficiency with the available data. The episode grouper used in this study was Thomson 9s Medical Episode Grouper" (MEG), which was commercially released in 1998. This tool combines medical claims found in administrative data into coherent and distinct episodes of treatment.<br><br> These episodes describe a series of related health care services for the treating each patient 9s spells of illness. Episodes can be comprised of outpatient, inpatient, skilled nursing facility, and home health agency utilization 4 . The grouper is described in more detail in the Appendix.<br><br> The data employed in this study were provided by the Medicare Payment Advisory Commission (MedPAC). They comprised all 2002 and 2003 Medicare claims for patients residing in five metropolitan statistical areas (MSAs): Boston, MA; Greenville, SC; Miami, FL; Minneapolis, MN; Orange County, CA; and Phoenix, AZ. The data are described in the Data section of this report.<br><br> The detailed results of applying MEG to the data are contained in the Appendix. A primary objective was to identify physicians whose average episode payment could be considered an coutlier. d We employed two statistical methodologies to determine outlier status. These methods and the process for identifying outliers based on them are explained in the Methods section of this report.<br><br> First, we employed multilevel regression models, which accounted for the correlation of episodes within physicians while estimating the physician 9s residual (deviation of the physician 9s observed episode mean payment from his or her expected episode mean payment). Multilevel models have been widely used for provider profiling applications (Dubois et al., 1987; Jencks et al., 1988; Thomas et al., 1994; Normand et al., 1995; Epstein, 1995; Schneider and Epstein, 1996; Morris and Christiansen, 1996; Goldstein and Spiegelhalter, 1996; Rice and Leyland, 1996; Normand et al., 1997; Leyland and Boddy, 1998; Marshall and Spiegelhalter, 2001). Second, we employed randomization tests (Manly, 2007; Noreen, 1989).<br><br> Unlike multilevel models, these tests require no assumptions concerning statistical distributions. For the randomization tests, each physician 9s mean payment was compared to a distribution of mean payments estimated from a large number of random samples of episodes similar to those attributed to the physician under test. We do not know of any previous applications of this methodology to provider profiling.<br><br> The Results section contains results for the two methods separately as well as comparisons between the methods. This section shows the distribution of the outliers and the underlying efficiency measures overall and for selected conditions. It also addresses the stability 4the year- to-year persistence 4of both the outliers and the efficiency measures.<br><br> The final section of this report, Conclusions and Recommendations, contains a broad assessment of the results, some important caveats to the study, and some considerations for future studies. 4 Prescription drug claims can also be incorporated. However, they were not available for the present study.<br><br> - 1 - DATA MedPAC provided the study data, composed of all medical claims during the calendar years 2001 through 2004 for Medicare beneficiaries residing in the six study MSAs: Boston, Greenville, Miami, Minneapolis, Orange County and Phoenix. Table 1 displays the number of claims, by year, for each claim source. Table 1: Number of Medicare Claims, by Source and Year.<br><br> Year Claim Source 2001 2002 2003 2004 Total HHA 147,523 159,901 178,903 197,674 684,001 MEDPAR 575,519 591,412 618,358 633,141 2,418,430 Physician 47,342,026 51,054,090 55,980,215 57,916,665 212,292,996 Outpatient 14,961,933 16,035,609 16,855,777 18,030,835 65,884,154 Total 63,027,001 67,841,012 73,633,253 76,778,315 281,279,581 Key Variables in the Raw Data The following data elements, which are necessary for episode creation, were extracted from the raw data files and placed in a uniform format: " Patient ID 3 a unique and encrypted patient identifier. " UPIN 3 a unique physician identification number. " Diagnosis Codes 3 the reconfigured claims records contained up to 11 diagnosis codes assigned using the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) diagnosis coding system.<br><br> " Procedure Codes 3 Current Procedural Terminology (CPT) procedure codes, Healthcare Common Procedure Coding System (HCPCS) procedure codes, and ICD-9-CM procedure codes were extracted from the original data. Each claim record contained one procedure code. " MSA 3 the patient 9s metropolitan statistical area.<br><br> " Standardized Payment 3 as described below, claims payment amounts were standardized to remove local market payment differences among episodes. " Age 3 patient age, in years. " Gender 3 patient gender.<br><br> " Date of Service 3 the date of outpatient service or the date of admission. " Claim Number 3 a unique record identification number. " Length of Stay- inpatient length of stay.<br><br> Standardized Payments MedPAC established methods for standardizing payments for physician profiling applications with episode groupers (Medicare Payment Advisory Commission, 2006). Briefly, for each type of claim, MedPAC standardized payments as follows: - 2 - " Hospital inpatient services 4 A standardized amount was created for each Diagnosis Related Group (DRG) for each year and applied to all records uniformly. " Skilled Nursing Facility (SNF) services 4 SNF Medicare Provider Analysis and Review records were merged to the DataPro SNF Stay file.<br><br> This information was combined with specific standardized amounts of resource utilization groups from CMS. " Long-term care hospital services 4For discharges that occurred on or after October 1, 2002, a standardized amount for each DRG was applied. For discharges prior to this date, local area wage-index adjustments from each hospital 9s payment were backed-out, assuming local area wage indexes acted as a proxy for underlying costs.<br><br> " Rehabilitation/psychiatric hospital services 4Total Medicare payments and total length of stay were calculated for each DRG, a DRG-level per diem amount was created and then multiplied by the length of stay for each record. " Home health 4 the home health case-mix weight on each claim was multiplied times the base payment rate for the appropriate fiscal year. " Physician services 4 the relative value unit (RVU) was determined for each record by matching the HCPCS code and modifier on the record to the physician fee schedule RVU file.<br><br> The RVU was multiplied by the units of volume for each record by the conversion factor for the appropriate year and reduced the standardized payment for multiple surgical procedures on the same claim and for services provided by physician assistants and assistants at surgery. " Ambulatory Surgical Center (ASC) services 4 HCPCS codes were used to match records to ASC payment rate files. Consistent with Medicare payment rules the payment rate was reduced for multiple surgical procedures on the same claim.<br><br> " Clinical laboratory services 4 A record was classified as a clinical lab service if the HCPCS for a record on the carrier file matched a HCPCS code on the clinical lab fee schedule. The standardized payment rate for each lab record is the national limitation amount (NLA) for the service. " Anesthesia services 4The base and the time units were summed for each anesthesia record and multiplied by the anesthesia conversion factor for the appropriate year.<br><br> Certified registered nurse anesthetists were assigned an amount that was half of the full amount, consistent with Medicare payment rules. " Hospital outpatient services 4 HCPCS codes were used to match outpatient records to an outpatient prospective payment system payment rate file and a standardized payment amount was assigned to each record. In this study, the total payment for an episode is the total of the standardized payments for the claims contained in that episode.<br><br> Throughout this report the term cpayment d is shorthand for cstandardized payment. d Berenson-Eggers Type of Service (BETOS) 5 The BETOS coding system was developed primarily for analyzing the growth in Medicare expenditure. The coding system assigns each and every HCPCS codes to a single BETOS code, which represents a readily understood clinical category. BETOS codes were added to professional and outpatient claims.<br><br> BETOS codes are broadly classified under seven major categories: 1. Evaluation and Management 5 See www.cms.hhs.gov/HCPCSReleaseCodeSets/20_BETOS.asp (last accessed 9/9/2007) for more information on BETOS categories. - 3 - 2.<br><br> Procedures 3. Imaging 4. Tests 5.<br><br> Durable Medical Equipment 6. Other 7. Exceptions/Unclassified The category of Evaluation and Management (E&M) played a special role in the assignment of episodes to physicians, as explained in the Appendix.<br><br> We also used these as descriptive payment categories to cdrill down d on total episode payments to better understand outpatient utilization patterns. METHODS The Appendix contains a description the Medical Episode Grouper (MEG TM ), which was used to produce episodes for our analyses. It also explains the method we used to attribute episodes to physicians.<br><br> Below, we discuss the statistical methods that were employed to identify physician outliers. We used two approaches. First, we fit a multilevel model, which is a regression model suitable for nested data.<br><br> Second, we used an approximate randomization test, which is a more transparent method that makes fewer statistical assumptions than the multilevel model. In both cases, we tested whether each physician is an outlier in terms of mean episode payments. The p -value for these tests was set at a very small value to account for the large number of tests performed.<br><br> Multilevel Models Multilevel models are often used and recommended for physician and hospital profiling applications (Dubois et al., 1987; Jencks et al., 1988; Thomas et al., 1994; Normand et al., 1995; Epstein, 1995; Schneider and Epstein, 1996; Morris and Christiansen, 1996; Goldstein and Spiegelhalter, 1996; Rice and Leyland, 1996; Normand et al., 1997; Leyland and Boddy, 1998; Marshall and Spiegelhalter, 2001). These regression models 4also called hierarchical models or mixed effects models 4are designed for nested or grouped data such as we have with episodes nested within physicians. Specifically, these models take into account the correlation of episodes within physicians, unlike standard regression methods that assume the observations are uncorrelated.<br><br> Throughout this report, the term cpayment d is shorthand for cstandardized payment. d We analyzed the payment ratio = (observed payment) / (expected payment) calculated for each episode. This ratio is highly skewed (Figure 1), with a long right tail. Consequently, we modeled the logarithm of the payment ratio, which has a more nearly normal distribution (Figure 2), helping to satisfy one assumption for the multilevel regression models we fit.<br><br> - 4 - Figure 1: Distribution of Payment Ratio per Episode. Data source: All episodes for Medicare patients from six MSAs during 2002. Figure 2: Distribution of Log(Payment Ratio) per Episode.<br><br> Data source: All episodes for Medicare patients from six MSAs during 2002. - 5 - We begin by considering a simple multilevel model. Define O ij = observed payment for episode i attributed to physician j.<br><br> E ij = expected payment for episode i attributed to physician j. Consider the regression model: () () 0 00 2 2 ln ~0, ~0, ij ji ij jj ju ije O e E u uN eN ² ²² Ã Ã ?? j = + ??<br><br> ?? =+ (Model 1) Model 1 assumes that the logarithm of the payment ratio is distributed as normal with a specific mean for physician j, denoted by ² 0j . The physician means are assumed to be distributed as normal with an overall mean ² 0 , which is the mean for an caverage d physician.<br><br> We are interested in the value of u j , the residual deviation of physician j from the overall average. If this residual is positive (negative), then the average payment per episode for physician j tends to be higher (lower) than that of the average physician. This residual forms the basis for each physician 9s estimated cefficiency d score.<br><br> The episode-level residual for episode i treated by physician j is denoted by e ij . In this simple version of the model, this episode-level residual is assumed to be normally distributed with a mean of zero and a constant variance, Ã e 2 . In Figure 3, this model is illustrated graphically for four physicians.<br><br> The thin, short horizontal lines represent episode means for each of four physicians. Each physician-level residual is equal to the difference between the physician 9s mean and the overall mean. The overall mean is represented by the thick horizontal line through the center of the graph, labeled ² 0 .<br><br> Each episode 9s residual is equal to the difference between the episode 9s observed response, represented by a dot, and the physician 9s mean. The total residual is the sum of the episode-level residual and the physician-level residual. The physician variance is represented by the spread of the physician- level means around the overall mean.<br><br> The episode variance is represented by the spread of the episode-level responses around the physician means. In our regression, the response is equal to ln(Observed Payment). This model description above is sufficient for understanding the analyses reported below.<br><br> However, we actually fit a slight modification of the model designed to account for variance heterogeneity, as explained in the Appendix. SAS PROC MIXED was used to fit the multilevel models, and to estimate the value of each physician residual, u j , and its standard error. Subsequently, these estimates were used to identify outlier physicians 4those with especially large positive residual values.<br><br> - 6 - Figure 3: Illustration of Episode-level Residuals and Physician-level Residuals. Physician Variance, Ã u 2 MD-Level Residual, u j E p isode Variance , Ã e 2 Episode-Level R esidual, e i j Total Variance MD Mean. ² 0 + u j ² 0 We tested whether each physician was significantly above the average by at least 25 percent.<br><br> In so doing, we assumed that the physician residual variance was fairly small and that we were mainly identifying residuals that lie outside a narrow range (Ohlssen, et al., 2006). Since we performed many hypothesis tests, we declared significance only when p j < 0.0001, where p j is a one-sided p -value for the null hypothesis that the residual for physician j is equal to zero versus the alternative hypothesis that the residual for physician j is greater than the mean by at least 25 percent. The threshold of 0.0001 was selected to reduce the probability of identifying a false outlier among the large number of physicians being tested.<br><br> An example residual plot is shown in Figure 4, which shows the estimated residuals and 99.98 % confidence limits for 120 physicians, ranked from smallest residual (most efficient) to largest residual (least efficient). Each red confidence bar is completely above the dashed blue line and corresponds to an coutlier d physician whose residual is 25 percent higher than that of the average physician at the .0001 significance level. - 7 - Figure 4: Example Physician Residual Plot.<br><br> Approximate Randomization Tests The multilevel model makes critical assumptions concerning statistical distributions and the form of the model. Using that approach, physician outliers are identified based on the physician-level residuals estimated from the model. In contrast, approximate randomization tests are non- parametric, making very few assumptions about the data (Manly, 2007; Noreen, 1989).<br><br> The idea is to test whether the observed average episode payment for each physician 9s sample is consistent with the complete distribution of average episode payments for similar samples drawn at random from the collection of all physicians 9 episodes. Using this approach, physician outliers are identified based on how unlikely the physician 9s observed average episode payment is compared with the distribution of average episode payments for similar samples of randomly-drawn episodes. Perhaps the simplest way to understand the approximate randomization approach is through an example.<br><br> In Table 2, the fifth column labeled cMD Sample d contains the observed payment for 22 episodes attributed to an example physician, and we want to test whether the average episode payment of $1,521 makes this physician an outlier. Scanning down that column, the observed payments were $114 for the physician 9s first episode, $334 for the physician 9s second episode, - 8 - and so on. The average observed payment for all 22 episodes was $1,521, shown at the bottom of column five.<br><br> For each episode, the second column ( cMEG d), third column ( cStage d), and fourth column ( cRRS Group d) indicate the episode 9s group number, stage of disease, and relative risk score group, respectively. For example, the first five episodes have MEG = 180, Stage = 1, and RRS Group = 1. The columns labeled cSample m d for m = 1, 2,&, 5, contain observed payments for randomly drawn episodes from within each category of MEG, Stage, and RRS.<br><br> The mean payment for each sample is shown at the bottom of the table. For example, the mean payment was $1,343 for sample 1, and it was $801 for sample 4. The mean payments for the five samples ranged from a low of $579 (sample 5) to a high of $1,375 (sample 2).<br><br> These mean payments are for samples containing the same number of episodes and with the same case-mix as the subject physician 9s sample of episodes because the episodes in each of the five samples are matched on MEG, Stage, and RRS. Based on these five sample means, the physician 9s observed mean payment of $1,521 appears high. Of course, five is too small a sample on which to judge whether the physician 9s mean payment is really an coutlier. d Therefore, we drew 10,000 random samples of episodes and compared the physician 9s mean payment to the distribution of 10,000 sample mean payments.<br><br> Table 2: Five Monte Carlo Samples Matched to an MD Sample of 22 Episodes Episode # MEG Stage RRS Group MD Sample Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 1 180 1 1 $ 114 $ 812 $ 55 $ 301 $ 655 $ 197 2 180 1 1 334 1,003 1,221 66 51 84 3 180 1 1 96 4,256 80 3,140 41 55 4 180 1 1 151 2,135 44 704 51 1,544 5 180 1 1 55 4,521 55 52 139 99 6 181 1 1 141 600 120 235 87 55 7 184 1 1 3,475 851 528 5,499 2,822 106 8 184 1 1 623 3,562 15,141 3,363 1,123 681 9 184 1 1 5,680 154 154 8,605 7,218 4,119 10 192 1 1 625 168 58 51 336 2,129 11 192 1 1 527 2,037 81 876 51 1,073 12 192 1 1 1,188 577 168 454 279 183 13 193 1 1 110 84 51 89 55 73 14 331 1 1 96 55 416 263 641 799 15 331 1 1 78 236 177 387 125 192 16 331 1 1 171 111 66 243 456 70 17 331 1 3 264 4,664 157 83 1,538 158 18 331 1 3 68 92 80 171 69 225 19 331 2 3 3,995 3,030 5,690 4,060 584 390 20 336 1 1 625 218 41 169 172 195 21 336 2 1 14,900 339 5,384 51 212 195 22 339 1 1 151 51 487 139 907 120 Mean $ 1,521 $ 1,343 $ 1,375 $ 1,318 $ 801 $ 579 - 9 - Figure 5 shows the distribution of 10,000 sample mean payments for the example physician. About 5 percent of the 10,000 sample means exceeded this physician 9s observed mean payment. This percentage can be considered a p -value, the probability of observing a mean payment as large or larger from a random sample of payments for similar episodes 6 .<br><br> A p -value of 0.05 is too large to reject the null hypothesis. Consequently, the physician 9s mean is not significantly different from the average. Figure 5: Distribution of Mean Payments from 10,000 Monte Carlo Samples for an MD.<br><br> We actually conducted the randomization tests based on log(payments) rather than payments for two reasons. First, the log transformation reduces the influence of episode-level payment outliers. Second, this approach is consistent with the multilevel models, which employed log(payment) as the dependent variable.<br><br> The log(payment) distribution is shown in Figure 5 corresponding to the payment distribution shown in Figure 5. The physician 9s observed log(payment) is 5.9 and the corresponding p-value is 0.14, indicating that this physician 9s mean log(payment) is not significantly different from the average log(payment). 6 Technically, the p -value is calculated as (g+1) / 10,001, where g is the number of sample means with a value greater than the physician 9s observed mean.<br><br> - 10 - Figure 6: Distribution of Mean Log(Payments) from 10,000 Monte Carlo Samples for an MD. Finally, rather than test whether each physician 9s mean was above average, we tested whether each physician 9s mean exceeded the expected mean by at least 25 percent. To accomplish this, we added log(1.25) to each of the 10,000 means on the log scale.<br><br> For example, in Figure 6 the entire distribution is shifted to the right by 0.223 (=log(1.25)), while the physician 9s average stays at 5.9. Using approximate randomization tests to identify outliers entails multiple comparisons. Therefore, just as for the multilevel modeling approach, we identified as outliers those physicians with p -values under .0001.<br><br> While we conducted the randomization tests on the logarithm of payments to identify outlier physicians, we recommend plotting the distribution of means on the dollar scale, as shown in Figure 5, for descriptive purposes. There are advantages to describing the outlier physician 9s payments on the original dollar scale. In particular, it allows for an easy cdrill down d on total payments by disaggregating episode payments into several service categories (e.g., inpatient payments, office payments, imaging payments, etc.).<br><br> The 10,000 sample average payments can be calculated by type of service and Monte Carlo distributions can be constructed for each payment category. The means for these payment categories will sum to the mean for total payments. An example will be shown later in this report.<br><br> For each physician, SAS PROC SURVEYSELECT was used to obtain 10,000 random samples stratified on MEG (disease), Stage of Disease, and Relative Risk Group, taken from the cpopulation d of episodes in that physician 9s specialty and MSA. Therefore, each physician 9s - 11 - cpeer group d was considered to be physicians in the same specialty and in the same MSA. For example, for an endocrinologist in Boston, a random sample of diabetes episodes would be taken from diabetes episodes attributed to Boston endocrinologists, but not from diabetes episodes attributed to internists, general practitioners, and other specialties, and not from endocrinologists in MSAs other than Boston.<br><br> This is consistent with the multilevel model approach, wherein each regression was limited to episodes in a given MSA for physicians in a given specialty. Identifying Physician Outliers We are interested in identifying outlier physicians , and not merely physicians with mean payments above that of the average physician, which in reality is probably about half of all physicians. Therefore, we tested whether each physician 9s mean payment was at least 25 percent above the average at the .0001 significance level.<br><br> For an analysis involving N physicians, the overall type I error rate (probability of identifying at least one false outlier) is approximately 1 3 .9999 N . Table 3 shows the overall p -values for representative values of N: Table 3: Overall p -values for a One-sided p -value = .0001 for Each Test. Number of Physicians, N Overall p -value 25 .0025 50 .0050 100 .0100 250 .0247 500 .0488 For example, in an analysis involving 100 physicians, if we identify outliers as physicians with values of p < .0001, then there is a 1 percent chance that at least one physician will be classified as an outlier who is not truly an outlier.<br><br> Likewise, in an analysis involving 500 physicians, there is a nearly 5 percent chance that at least one physician will be declared an outlier by mistake. On the other hand, setting such a low p -value increases the risk of failing to identify true outliers. Year-to-year Stability To measure the cstability d of the efficiency measures, we analyze physicians 9 p -values between 2002 and 2003.<br><br> In the absence of efforts to change behavior, we expect most outlier physicians to remain outlier physicians during adjacent years. By changing practice patterns, it is certainly possible for a physician to be a true outlier during one year and not an outlier during either the previous year or the following year. It is also possible for a physician to be cunlucky d in the sense that his or her episodes tend to come from the high end of the cnormal d payment distribution for one year, incorrectly causing him or her to be declared an outlier in that year.<br><br> In that case, it is unlikely that the same physician would be cunlucky d twice, because the physician 9s average episode payment would tend to cregress d to the mean in another year, yielding an unremarkable p -value. Still another reason for inconsistent results between years could be a larger sample of patients in one year than in another year. A larger sample can turn a statistically insignificant difference into a statistically significant one.<br><br> To test stability, we identify physician outliers in 2002 and then look cforward d at their p -values in 2003. If the 2002 outlier physicians are ctrue d outliers, then most of them also should have low p -values in 2003. Likewise, we identify physician outliers in 2003 and then look cbackward d - 12 - at their p -values in 2002.<br><br> Again, if the 2003 outlier physicians are ctrue d outliers, then most of them should also have low p -values in 2002. For this purpose, we regard a p -value as clow d if it is less than .05. This threshold is somewhat arbitrary.<br><br> However, it is a conventional threshold for statistical testing, and we are only considering physicians for whom we have reason to believe that their mean will substantially deviate from the overall mean because they had extremely low p -values (< .0001) in the adjacent year. - 13 - RESULTS The Appendix contains descriptive results relating to the application of MEG to the entire database of Medicare claims provided by MedPAC. In what follows, we illustrate the outlier identification methodologies by presenting results in total (all physicians), and selected results for physicians in two specialties, urology and cardiology, for all six MSAs: Boston, Greenville, Miami, Minneapolis, Orange County, and Phoenix.<br><br> As discussed in the methods section, each physician is tested for having a high mean episode payment (one-sided test) relative to episode payments for physicians in the same MSA and in the same specialty. A physician with a p -value < .0001 is considered an outlier in all analyses. Table 4, Table 5 and Table 6 show the total number of physicians, the number of urologists, and the number of cardiologists, respectively, for each MSA, along with information on the range of physician sample sizes (episodes per physician) for 2002 and 2003.<br><br> For our analyses, we include only physicians with at least 20 episodes. The numbers of physicians are lowest in Greenville and highest in Boston. Within each MSA, the number of physicians and the number of episodes per physician is similar between the two years.<br><br> Table 4: Counts Physicians and Episodes per Physician. Episodes per Physician (Among Physicians with at Least 20 Episodes) Total Physicians Physicians with at least 20 Episodes Mean 10 th percentile Median 90 th percentile MSA 2002 2003 2002 2003 2002 2003 2002 2003 2002 2003 2002 2003 Boston 12,619 13,126 7,606 7,960 187 199 30 30 118 121 425 455 Greenville 1,958 2.054 1,558 1,642 374 365 54 47 292 280 805 819 Miami 4,870 5,104 3,511 3,653 231 229 32 32 140 138 538 521 Minneapolis 7,311 7,615 4,689 4,898 159 158 30 30 105 104 347 342 Orange Co. 4,763 4,922 3,216 3,451 205 205 30 31 128 128 484 481 Phoenix 5,943 6,356 4,027 4,287 208 212 30 31 124 126 486 490 Table 5: Counts of Urologists and Episodes per Urologist.<br><br> Episodes per Urologist (Among Urologists with at Least 20 Episodes) Total Urologists Urologists with at least 20 Episodes Mean 10 th percentile Median 90 th percentile MSA 2002 2003 2002 2003 2002 2003 2002 2003 2002 2003 2002 2003 Boston 140 142 120 126 280 306 82 72 273 296 481 549 Greenville 41 42 36 37 475 479 272 271 446 452 735 748 Miami 98 98 87 84 261 265 72 74 215 211 522 593 Minneapolis 78 80 65 63 266 264 50 103 256 244 490 486 Orange Co. 75 79 65 69 264 264 53 28 221 234 518 526 Phoenix 84 89 76 81 301 311 78 63 236 254 517 518 - 14 - Table 6: Counts of Cardiologists and Episodes per Cardiologist. Episodes per Cardiologist (Among Cardiologists with at Least 20 Episodes) Total Cardiologists Cardiologists with at least 20 Episodes Mean 10 th percentile Median 90 th percentile MSA 2002 2003 2002 2003 2002 2003 2002 2003 2002 2003 2002 2003 Boston 450 467 352 376 246 246 37 29 177 164 563 569 Greenville 61 63 53 54 398 399 142 201 413 400 567 555 Miami 212 219 197 197 324 324 81 62 270 268 626 632 Minneapolis 188 195 162 165 109 111 36 33 91 85 203 205 Orange Co.<br><br> 165 168 140 149 335 326 112 55 294 281 578 581 Phoenix 222 245 186 200 280 286 59 50 234 232 538 615 The average and median number of episodes per physician included in the analyses are much higher in Greenville than they are in other MSAs. The variation across MSAs in episodes per physician might be an artifact of the data. Recall that the data contain only episodes from patients residing in the six MSAs and they exclude episodes from patients residing in other MSAs.<br><br> For example, Boston physicians might have fewer episodes in these data because Boston physicians might serve a larger proportion of patients outside the Boston MSA than, say, Greenville physicians serve outside the Greenville MSA. Overall and for cardiologists (but not urologists), Table 4 and Table 6 show that the average and median numbers of episodes per physician are dramatically lower in Minneapolis compared with other MSAs. This should be kept in mind for any comparisons between Minneapolis and other MSAs, especially for cardiologist episodes.<br><br> Results for Multilevel Models We begin with some residual plots for urologists and cardiologists. Plots for other specialties are similar. We then describe results more generally.<br><br> Figure 7 displays the physician efficiency estimates for urologists in each of the six MSAs based on 2002 data. For each plot, the horizontal axis contains the urologist ranks, ranging from 1 to the number of urologists in the MSA, where the urologists are ordered from lowest to highest residual (from most efficient to least efficient). In the body of each plot there is one vertical bar per urologist.<br><br> The middle of the bar represents the urologist 9s estimated residual, and the bar endpoints represent the endpoints of a 99.98% confidence interval for the urologist 9s residual. The blue horizontal bar represents an efficiency level that is 25 percent above average. Consequently, if the urologist 9s bar is completely above the blue horizontal line, then the urologist 9s residual is significantly above the 25 percent threshold at the .0001 significance level (one-sided).<br><br> These red bars belong to the coutlier d physicians. We note that we did not find any outlier urologists in either Greenville or Minneapolis. Notice that some black bars (non-outliers) are situated between red bars (outliers).<br><br> These represent physicians who have estimated residuals as large or larger than the residuals for some outlier physicians, but with wider confidence intervals, either because they have smaller samples or because they have episodes with larger payment variances, or some combination of the two. - 15 - It is important to recognize that these graphs cannot be used to compare physicians to one another. Comparisons between physicians are invalid for at least two reasons.<br><br> First, it is not valid to conduct a hypothesis test by comparing the degree of overlap between confidence intervals shown in this version of the graph. Second, each physician has his or her own mix of episodes, which the model effectively compares against a standard based on that particular mix of episodes. As an extreme example, physician 1 might only have episodes from disease A, while physician 2 might only have episodes from disease B.<br><br> The two physicians cannot be compared directly. However, they can each be compared to standards based on diseases A and B, respectively. Figure 8 displays the physician efficiency estimates for cardiologists in each of the six MSAs based on 2002 data.<br><br> Each MSA has more cardiologists than urologists. Thus, there are more confidence bars plotted in Figure 8 compared with Figure 7. In Figure 8, the difference between the plots for Miami and Minneapolis is striking.<br><br> The percentage of cardiologist outliers (in red) is higher for Miami than for Minneapolis. This is largely due to the higher average number of patients per cardiologist in Miami (324) compared with Minneapolis (109), shown earlier in Table 6. As a result, the Miami confidence intervals tend to be much shorter than the Minneapolis confidence intervals, in part leading to proportionately more significant residuals in Miami.<br><br> - 16 - Figure 7: Urologist Physician Efficiency Estimates with 99.98% Confidence Limits, 2002. - 17 - Figure 8: Cardiologist Physician Efficiency Estimates with 99.98% Confidence Limits, 2002. - 18 - Figure 7 and Figure 8 are both based on 2002 data.<br><br> They are meant only to illustrate the results of the underlying methodology. The plots for these two specialties based on 2003 data (not shown) are similar. We now turn to a comparison of the 2002 outliers with the 2003 outliers, based on physicians who were present in the data in both years.<br><br> Among physicians with at least 20 episodes in either year, about 85 percent had at least 20 episodes in both years. The percentage was higher 4about 90 percent 4for urologists and cardiologists. Correlations between 2002 and 2003 efficiency scores, weighted by each physician 9s average number of episodes per year, are shown for urologists, cardiologists, and all physicians in Table 7.<br><br> These correlations are quite high, indicating good year-to-year stability in the efficiency scores based on multilevel regressions. Physicians with high (low) efficiency scores in 2002 also tended to have high (low) scores in 2003. Table 7: Correlation Between 2002 and 2003 Scores, Multilevel Models.<br><br> MSA All MDs Urologists Cardiologists Boston 0.90 0.89 0.88 Greenville 0.91 0.95 0.88 Miami 0.88 0.93 0.88 Minneapolis 0.86 0.85 0.79 Orange County 0.89 0.88 0.87 Phoenix 0.90 0.86 0.87 Total 0.89 0.89 0.87 To illustrate this correlation, Figure 9 contains plots of 2002 versus 2003 efficiency scores (residuals) for all physicians who had at least 20 episodes in both years. Clearly, the 2002 and 2003 efficiency scores are highly correlated. There is one graph per MSA.<br><br> The upper right quadrant in each graph contains physicians whose efficiency score was positive in both years (higher than average payments). The lower left quadrant in each graph contains physicians whose efficiency score was negative in both years (lower than average payments). The other quadrants represent physicians whose efficiency score was positive in one year and negative in the other year.<br><br> In Figure 9, the light green circles represent physicians who were not declared outliers in either year. Although, not completely evident from the graph, they represent the vast majority of physicians (the green circles are highly concentrated near the center of the graph and they are partially overwritten by red and blue circles). In both years these physicians 9 residuals were not significantly 25 percent above average at the .0001 significance level.<br><br> The red circles represent physicians who were labeled as an outlier in at least one year ( p -value < .0001) and whose p - value was less than .05 in the other year. If a physician had a p -value under .0001 in either 2002 or 2003, and that same physician also had a p -value under .05 in the other year, then we have some confidence that the physician was a true outlier. The dark blue circles represent physicians who were declared as an outlier in one year, but whose p -value was greater than .05 in the other year.<br><br> We regard these physicians as potentially cfalse outliers, d although not necessarily. It is possible that they were truly an outlier in one year and truly not an outlier in the other year. - 19 - While Figure 9 clearly shows the high year-to-year correlation in the estimated physician residuals, it is difficult to judge the percentage of outliers because the point cloud is much more concentrated in the center than it is at the edges.<br><br> Figure 10 and Figure 11 do a better job of summarizing the frequency and stability of the outliers. Figure 10 is the clook forward d for 2002 outliers. Physicians are grouped by whether they were an outlier in 2002 and the bars in the plot show the percentage of physicians who had p -values < .05 in 2003.<br><br> The red bars correspond to physicians who were outliers in 2002. There were 918 outliers (4.4 %) out of a total of 20,902 physicians with at least 20 episodes in both years. Of the 918 outliers, 833 (90.7 %) of them had a p -value under .05 in 2003.<br><br> Figure 11 is the clook backward d for 2003 outliers. Physicians are grouped by whether they were an outlier in 2003 and the bars in the plot show the percentage of physicians who had p -values < .05 in 2002. The red bars correspond to physicians who were outliers in 2003.<br><br> There were 972 outliers (4.7 %) out of a total of 20,902 physicians with at least 20 episodes in both years. Of the 972 outliers in 2003, 861 (88.6 %) of them had a p -value under .05 in 2002. In both years, about 6.4 percent of non-outlier physicians have p -values under .05.<br><br> Nominally, we would expect about 5 percent of the non-outlier physicians to have p -values under .05. However, it 9s possible that we failed to identify some outlier physicians because of the very low significance level that was required to attain outlier status, leading to somewhat more than 5 percent of physicians with p -values under .05 in the adjacent year. Figure 10 and Figure 11 indicate that approximately 90 percent of the outliers identified in one year had low p -values in the other year, demonstrating fairly strong consistency between years.<br><br> - 20 - Figure 9: Physician Efficiency Scores (Residuals), 2002 versus 2003, Multilevel Models. - 21 - Figure 10: Look Forward: 2002 Outliers and 2003 p -values, Multilevel Models. Figure 11: Look Backward: 2003 Outliers and 2002 p -values, Multilevel Models.<br><br> - 22 - Results for Approximate Randomization Tests For the sake of comparisons, we calculated residuals for the randomization tests that were on the same scale as residuals for the multilevel models. For each physician we calculated an cefficiency d score: efficiency score = Observed mean log(payment) 3 Expected mean log(payment). The observed mean log(payment) is the physician 9s average episode log(payment).<br><br> The expected mean log(payment) is the average of the 10,000 Monte Carlo sample log(payment) means. This is the scale on which the multilevel models were based and on which the randomization tests were conducted. P -values were calculated as described in the methods section.<br><br> The 2002 and 2003 randomization test efficiency scores are highly correlated, as shown in Table 8. These correlations are nearly all within a couple of percentage points of the corresponding multilevel correlations shown earlier in Table 7. In these tables, the largest difference is between the multilevel correlations and the Monte Carlo correlations for Minneapolis cardiologists: 0.79 versus 0.73, respectively.<br><br> In fact, of the year-to-year correlations shown, Minneapolis cardiologists have the lowest for both methods. Table 8: Correlation Between 2002 and 2003 Scores, Randomization Tests. MSA All MDs Urologists Cardiologists Boston 0.87 0.87 0.84 Greenville 0.89 0.93 0.88 Miami 0.86 0.93 0.87 Minneapolis 0.84 0.81 0.73 Orange County 0.84 0.88 0.82 Phoenix 0.88 0.84 0.85 Total 0.87 0.88 0.84 Figure 12 plots 2002 versus 2003 efficiency scores for all physicians.<br><br> The symbols in this plot are defined the same was as they were for the multilevel regression results in Figure 9. The green circles represent physicians who were not declared outliers in either year. The red circles represent physicians who were declared as an outlier in at least one year ( p -value < .0001) and whose p -value was less than .05 in the other year.<br><br> The blue circles represent physicians who were declared as an outlier in one year, but whose p -value was greater than .05 in the other year. Figure 13 and Figure 14 summarize the frequency and stability of the randomization test outliers. Figure 13 is the clook forward d for 2002 outliers.<br><br> Physicians are grouped by whether they were an outlier in 2002 and the bars in the plot show the percentage of physicians who had p -values < .05 in 2003. The red bars correspond to physicians who were outliers in 2002. There were 611 outliers (2.9 %) out of a total of 20,911 physicians with at least 20 episodes in both years.<br><br> Of the 611 outliers, 572 (93.6 %) of them had a p -value under .05 in 2003. Figure 14 is the clook backward d for 2003 outliers. Physicians are grouped by whether they were an outlier in 2003 and the bars in the plot show the percentage of physicians who had p -values < .05 in 2002.<br><br> The red bars correspond to physicians who were outliers in 2003. There were 712 - 23 - outliers (3.4 %) out of a total of 20,911 physicians with at least 20 episodes in both years. Of the 712 outliers in 2003, 641 (90.0 %) of them had a p -value under .05 in 2002.<br><br> In both years, about 7.9 percent of non-outlier physicians have p -values under .05. Nominally, we would expect about 5 percent of the non-outlier physicians to have p -values under .05. However, it is possible that we failed to identify some outlier physicians because of the very low significance level that was required to attain outlier status, leading to somewhat more than 5 percent of physicians with p -values under .05 in the adjacent year.<br><br> Figure 13 and Figure 14 indicate that more than 90 percent of the outliers identified in one year had low p -values in the other year, demonstrating appreciable year-to-year stability. - 24 - Figure 12: Physician Efficiency Scores (Residuals), 2002 versus 2003, Randomization Tests. - 25 - Figure 13: Look Forward: 2002 Outliers and 2003 p -values, Randomization Tests.<br><br> Figure 14: Look Backward: 2002 Outliers and 2003 p -values, Randomization Tests. - 26 - Comparison of Multilevel Model Results to Randomization Test Results The correlation between the residuals (estimated efficiencies) for the two methods is quite high, at 92.6 % for both 2002 and 2003. Figure 15 and Figure 16 show plots of the estimated efficiencies for 2002 and 2003, respectively.<br><br> The two methods are in substantial agreement: residuals that are high (low) for one method tend to be high (low) for the other method. Table 9 and Table 10 compare the outlier status of physicians between the two methods for 2002 and 2003, respectively. In both years, the randomization test identified a lower percentage of outliers (2.9 % and 3.5 %) compared with the multilevel model (4.3 % and 4.9 %).<br><br> In 2002, 550 physicians were identified as outliers by both methods, representing 86 % of the randomization test outliers and representing 57 % of the multilevel outliers. In 2003, 679 physicians were identified as outliers by both methods, representing 84 % of the randomization test outliers and representing 59 % of the multilevel outliers. Therefore, for outlier identification the multilevel model approach supports the randomization test approach more than the reverse.<br><br> A further strategy, which we did not explore, would be to identify as outliers only those physicians who where identified as outliers using both approaches in a given year. Table 9: Comparison of Outlier Results: All Physicians, 2002. Multilevel Model Results Randomization Test Result Not outlier Outlier Total Not outlier 21,206 (95.3 %) 407 (1.8 %) 21,613 (97.1 %) Outlier 90 (0.4 %) 550 (2.5 %) 640 (2.9 %) Total 21,296 (95.7 %) 957 (4.3 %) 22,253 (100.0 %) Table 10: Comparison of Outlier Results: All Physicians, 2003.<br><br> Multilevel Model Results Randomization Test Result Not outlier Outlier Total Not outlier 22,088 (94.6 %) 466 (2.0 %) 22,554 (96.6 %) Outlier 127 (0.5 %) 679 (2.9 %) 806 (3.5 %) Total 22,215 (95.1 %) 1,145 (4.9 %) 23,360 (100.0 %) - 27 - Figure 15: Randomization Test Efficiency vs. Multilevel Model Efficiency, 2002. - 28 - Figure 16: Randomization Test Efficiency vs.<br><br> Mult