Abstract
        This research studies 
          the endpoints of length of stay and predicted mortality status among 
          patients receiving open heart surgery at the University of Louisville 
          Hospitals in 2001.� Specifically, this project uses statistical and 
          data mining techniques to investigate the relationship between patient 
          risk factors, care, and outcomes. Of 1240 records, 45 patients died 
          during their initial hospitalization, while 1195 were discharged with 
          an average length of stay of 10.5 days (standard deviation 9).�� Logistic 
          regression models can be shown to be insufficient to predict mortality 
          status in a sample of such disparate group sizes.� Data mining techniques 
          are more effective than these commonly used models.� Pharmaceutical 
          information, in addition to patient laboratory values, is critical to 
          predicting length of stay.� 
        Introduction
        The purpose of this 
          paper is to explore data collected from a hospital pharmacy on 1300 
          patients undergoing open heart surgery and to examine the needed statistical 
          techniques that are used to make accurate estimates of both patient 
          mortality and length of patient hospital stay following surgery.� Hospitals 
          are often ranked as to quality by the use of statistical models to predict 
          patient severity and mortality. Hospitals where the difference between 
          predicted mortality and actual mortality is negative or small rank higher 
          than hospitals where the difference is positive and large. Hospitals 
          can improve their ranking either by improving quality or by manipulating 
          their outcomes in the statistical models.
        The goal of this research 
          is to use data-mining techniques on a dataset collected from a Louisville 
          area hospital both for pharmaceutical distribution and insurance reimbursement 
          to discover statistical methods that lead to more accurate predictions 
          for length of stay and patient mortality.� This data set of approximately 
          1500 patients who received open-heart surgery in 2001 includes variables 
          measuring patient risk factors, pharmaceutical information, and patient 
          laboratory values.� 
        Background
        Mortality prediction 
          is used to make an evaluation of a hospital�s performance.� Healthgrades.com, 
          as an example of publicly available data, uses Medicare billing data 
          that are publicly available for analysis purposes. Healthgrades.com 
          uses a logistic regression modeling technique to determine the difference 
          between predicted and actual mortality.� This difference is then used 
          to assign the hospital a grade.� Such information is available for a 
          variety of procedures, including open-heart surgery.� Unfortunately, 
          billing data used to define the model are not collected and interpreted 
          uniformly among hospitals.� 
        Mortality status is 
          often predicted through a standard logistic regression model.� Within 
          a given population of open-heart surgery patients, relatively few patients 
          die within the initial hospitalization for the procedure.� Consequently, 
          disparate sample sizes result and logistic regression is a poor modeling 
          choice.� For example, if group A contains 96% of the patients (all those 
          who survive) and if group B contains only 4% of the patients (all those 
          who don�t survive) then a function that predicts 100% survival will 
          only misclassify 4% of the sample. Any logistic regression function 
          defined will misclassify approximately the same 4% ensuring that while 
          the result may be statistically significant, it is practically unimportant. 
          Logistic regression is only effective when the group sizes are approximately 
          equal.
        Similarly, a patient�s 
          length of hospital stay is closely related to the severity of any complications 
          that may result from the procedure and the patient�s overall health 
          status.� Most potential severe complications resulting from open heart 
          surgery occur infrequently, thereby creating disparate sample sizes 
          for this information as well.� Moreover, a patient�s length of stay 
          is influenced by individual physician preferences, such as the method 
          of blood filtration used to prevent inflammatory injury or antibiotic 
          protocols.� Such data are not commonly available in data used to rank 
          hospitals.� Having the knowledge of an accurate estimated length of 
          that patient�s stay may ultimately improve the hospital�s quality ranking 
          and provide some benefit to that patient and his family.
        The combination of 
          a lack of uniformity and infrequent death occurrence lead a user of 
          this service to question the accuracy of the model.� Improving such 
          a model will ultimately allow for more accurate comparisons between 
          hospitals and more accurate knowledge of a patient�s condition while 
          he or she is being hospitalized than what is currently available.� 
        Method
        The dataset ultimately 
          used for analysis is a composite of two existing datasets.� The first 
          was collected originally for insurance reimbursement by the Louisville 
          Hospitals.� It contains all billing records for Medicare patients who 
          received open-heart surgery in 2001.� The second was made available 
          by the Hospitals� pharmacy.� It contains a listing of medications prescribed 
          during each patient�s stay at the Hospitals, as well as clinical data 
          on a number of patient characteristics. Patient ID numbers were randomized 
          and all existing identifiable fields were purged so that the data were 
          HIPAA-compliant. 
        The first step was 
          to merge the two datasets together so that billing information could 
          be compared to clinical data. The merger of the datasets did not complete 
          correctly, instead duplicating files for each patient.� These files 
          were removed according to patient ID when available.� When patient ID 
          was not available all other fields available were considered.� The duplicate 
          information was deleted, resulting in a database of approximately 1500 
          patients.��� 
        Available laboratory 
          values were analyzed using both linear and logistic regression to predict 
          patient outcomes of length of hospital stay and mortality.� Kernel density 
          estimation was also used to examine for mortality status in more detail.� 
          Finally, SAS Enterprise Miner software (SAS Institute, Inc.; Cary, NC) 
          was used to create a decision tree and neural network to predict length 
          of stay using these same laboratory values.� 
        Classification Methods 
          such as Decision Trees and Neural Networks are not effective to classify 
          mortality status because of the disparate sample sizes.� The model shown 
          in Figure One, constructed using SAS Enterprise Miner gives further 
          information concerning the ability of patient laboratory values to predict 
          length of hospital stay.� The model in the diagram (Figure 1) runs from 
          left to right, with each icon representing a unique procedure in the 
          analysis of the data.� Data are entered into the model.� Missing cells 
          are replaced by the variable mean.� The data are then partitioned into 
          training, testing, and validation stages.� A regression, decision tree, 
          and neural network model are then run.�� The results for all procedures 
          are given through the reporter icon.� 
        
        Figure 1.  Model 
          from SAS Enterprise Miner for length of stay
        Pharmaceutical information 
          was clustered using the SAS Text Miner tool.� These clusters were ranked 
          by a pharmacist according to patient severity, by the complications 
          of congestive heart failure (CHF) and chronic obstructive pulmonary 
          disease (COPD), and finally by mortality rates.� This information was 
          incorporated into a linear model to examine its effect on length of 
          stay.� 
        Results
        Patients undergoing 
          open heart surgery have pre-admission blood testing, and a number of 
          laboratory values are collected. Table 1 gives the relationship of patient 
          laboratory values to length of stay.� It contains the coefficient estimates 
          for a linear regression model. The patient laboratory values included 
          in the model are hematocrit, white blood cell count, glucose, and creatinine. 
          Two-way interactions were included in the model.� In particular, glucose 
          appears to have a large effect (p=0.002).� However, the low R2 
          value of .065 illustrates the need to include additional information 
          in the model.� Adding pharmaceuticals does improve the model.� Pharmaceuticals 
          the patient received while in the hospital were clustered using the 
          Text Miner feature of SAS.� 
        Table 1.� 
        Linear Model for 
          Length of Stay, Patient Laboratory Values
         
        
           
            | Variable | Estimate | St. Error | P-Value | 
           
            | Intercept | 21.6 | 4.33 | <.0001 | 
           
            | Glucose | -.089 | .03 | .002 | 
           
            | Hematocrit | -.24 | .11 | .03 | 
           
            | Creatinine | -.51 | .13 | <.0001 | 
           
            | White Blood Cell 
                Count | -.33 | .15 | .03 | 
           
            | Glucose*Hematocrit | .0014 | .0007 | .04 | 
           
            | Glucose*Creatinine | .01 | .002 | <.0001 | 
           
            | Glucose*White 
                Blood Count | .003 | .0009 | .002 | 
           
            | Hematocrit*Creatinine | .006 | .003 | .04 | 
           
            | Hematocrit*White 
                Blood | -.0002 | .004 | .94 | 
           
            | Creatinine*White 
                Blood | .029 | .009 | .002 | 
        
         
        Each cluster, shown 
          in Table 2, contains pharmaceuticals commonly prescribed together.� 
          These prescription combinations were then taken to pharmacist J. D. 
          Cerrito who created a listing of possible health conditions for which 
          the drugs in each cluster are commonly prescribed (personal communication, 
          August 2003).� This pharmacist then ranked the clusters in order of 
          probable severity.� Using information also contained in the dataset, 
          the clusters were arranged in order of severity for the complications 
          of CHF and COPD.� Severity was determined through an ordinal ranking 
          of the percentage of patients in each category who suffered from the 
          given condition.� Finally, the clusters were ranked according to frequency 
          of mortality.� Again, the clusters were ordered, with the cluster having 
          the greatest frequency of mortality having the greatest severity ranking.� 
          The results, shown in Table 3, illustrate the importance of this pharmaceutical 
          information on predicting a patient�s length of stay.� The R2 
          value for this model is 0.12.� Because of its influence on pharmaceutical 
          prescriptions, diabetes is added as a control.� The R2 value 
          was greatly improved from the 0.065 in the original model.� 
        Table 2.� 
        Description of 
          Pharmaceutical clusters
        
           
            | Cluster ID | Frequency | Rank COPD | Rank CHF | Rank Mortality | Rank Pharmacist | Associated Diagnoses | 
           
            | 1   | 88 | 2 | 2 | 9 | 8 | IDDM (insulin-dependent 
                diabetes) CHF COPD | 
           
            | 2 | 306 | 7 | 5 | 3 | 10 | ASC (Vascular 
                disease) ASCUD (Athrosclorotic cardiovascular disease) | 
           
            | 3 | 42 | 3 | 1 | 7 | 7 | CHF NDDM (non-insulin 
                dependent diabetes) | 
           
            | 4 | 8 | 1 | 9 | 1 | 2 | Allergy Vertigo COPD HTN (Hypertension) | 
           
            | 5 | 8 | 10 | 10 | 10 | 1 | Depression HTN Pain | 
           
            | 6 | 23 | 9 | 8 | 6 | 6 | Angina GERD (gastro-esophagial 
                reflux disease) Vertigo | 
           
            | 7 | 34 | 8 | 7 | 8 | 4 | Hyperlipidemia Blood clot Depression Smoking cessation Infection IDDM | 
           
            | 8 | 158 | 6 | 4 | 5 | 3 | HTN Angina Blood Clot | 
           
            | 9 | 225 | 5 | 3 | 2 | 9 | Angina Pain Infection | 
           
            | 10 | 348 | 4 | 6 | 4 | 5 | HTN Pain GERD | 
        
        Table 3.� 
        Linear Model of 
          Length of Stay, Including Diabetes and Pharmaceutical Information
        
           
            | Variable | Estimate | T-Value | P-Value | 
           
            | Intercept | 88.2 | 2.38 | .02 | 
           
            | Diabetes | 5.1 | .93 | .35 | 
           
            | Glucose | -.06 | -1.10 | .27 | 
           
            | Hematocrit | -.54 | -1.68 | .09 | 
           
            | Creatinine | 1.95 | 1.33 | .18 | 
           
            | White Blood Cell 
                Count | -.2 | -.31 | .76 | 
           
            | Cluster by CHF | -7.5 | -2.24 | .03 | 
           
            | Cluster by COPD | -7.54 | -1.69 | .09 | 
           
            | Cluster by Mortality | -1.57 | -1.21 | .23 | 
           
            | Cluster by Pharmacist | -6.68 | -1.67 | .09 | 
           
            | Diabetes*Glucose | -.007 | -.55 | .58 | 
           
            | Diabetes*Hematocrit | -.035 | -.46 | .65 | 
           
            | Diabetes*Creatinine | -.0002 | 0 | .99 | 
           
            | Diabetes*White 
                Blood Cell Count | .133 | .86 | .39 | 
           
            | Diabetes*Cluster 
                by CHF | -.45 | -.95 | .34 | 
           
            | Diabetes*Cluster 
                by COPD | .11 | .28 | .78 | 
           
            | Diabetes*Cluster 
                by Mortality | -.56 | -1.59 | .11 | 
           
            | Diabetes*Cluster 
                by Pharmacist | -.03 | -.12 | .91 | 
           
            | Glucose*Hematocrit | .001 | 1.61 | .11 | 
           
            | Glucose*Creatinine | .013 | 5.17 | <.0001 | 
           
            | Glucose*White 
                Blood Cell Count | .003 | 2.63 | .009 | 
           
            | Glucose*Cluster 
                by CHF | .004 | .95 | .34 | 
           
            | Glucose*Cluster 
                by COPD | -.004 | -1.1 | .27 | 
           
            | Glucose*Cluster 
                by Mortality | -.002 | -.72 | .47 | 
           
            | Glucose*Cluster 
                by Pharmacist | -.002 | -.66 | .51 | 
           
            | Hematocrit*Creatinine | -.0006 | -.07 | .95 | 
           
            | Hematocrit*White 
                Blood Cell Count | .0007 | .11 | .92 | 
           
            | Hematocrit*Cluster 
                by CHF | .036 | 1.3 | .19 | 
           
            | Hematocrit*Cluster 
                by COPD | -.01 | -.40 | .69 | 
           
            | Hematocrit*Cluster 
                by Mortality | .024 | 1.22 | .22 | 
           
            | Hematocrit*Cluster 
                by Pharmacist | .012 | .68 | .50 | 
           
            | Creatinine*White 
                Blood Cell Count | .0003 | .01 | .99 | 
           
            | Creatinine*Cluster 
                by CHF | -.08 | -1.32 | .19 | 
           
            | Creatinine*Cluster 
                by COPD | -.14 | -1.41 | .16 | 
           
            | Creatinine*Cluster 
                by Mortality | -.11 | -1.35 | .18 | 
           
            | Creatinine*Cluster 
                by Pharmacist | -.09 | -2.00 | .05 | 
           
            | White Blood Cell 
                Count*Cluster by CHF | -.02 | -.36 | .72 | 
           
            | White Blood Cell 
                Count*Cluster by COPD | .001 | .03 | .98 | 
           
            | White Blood Cell 
                Count*Cluster by Mortality | -.03 | -.69 | .49 | 
           
            | White Blood Cell 
                Count*Cluster by Pharmacist | -.0006 | -.02 | .99 | 
           
            | Cluster by CHF*Cluster 
                by COPD | .105 | .41 | .68 | 
           
            | Cluster by CHF*Cluster 
                by Mortality | .75 | 1.12 | .26 | 
           
            | Cluster by CHF*Cluster 
                by Pharmacist | .025 | .06 | .95 | 
           
            | Cluster by COPD*Cluster 
                by Mortality | .029 | .05 | .95 | 
           
            | Cluster by COPD*Cluster 
                by Pharmacist | 1.13 | 1.35 | .18 | 
           
            | Cluster by Mortality*Cluster 
                by Pharmacist | 0 | . | . | 
        
        Figure 2 illustrates 
          the Receiver Operating Characteristic (ROC) curve generated by a logistic 
          model using the same terms listed for the linear model.� An ROC Curve 
          is a plot of the true positive rate (sensitivity) against the false 
          positive rate (specificity).� The larger the area under the curve (AZ 
          value), the better the predictor. This model indicates the every increase 
          in specificity results in a corresponding decrease in sensitivity, making 
          it a poor modeling choice.� Indeed, because of the disparate sample 
          sizes, logistic regression is often a poor modeling choice for this 
          type of data.� A model that identifies all patients as low risk for 
          mortality will be 96% accurate because 96% of the patients survive.� 
          None of the terms were significant at the 0.05 level.
        
        Figure 2. ROC 
          Curve, logistic estimate of mortality status 
        Kernel density estimates 
          the shape of the distribution of a particular variable by taking the 
          proportion of data points that occur within a given interval and dividing 
          by the length of that interval.� If a laboratory value is a good predictor 
          of mortality, there will be an observable difference in the peaks of 
          the distributions.� Figure 3 illustrates that patients who have a creatinine 
          level greater than two are at higher risk for death than those patients 
          with lower creatinine levels. The number 2 was chosen because it defines 
          renal failure prior to surgery. Glucose levels in Figure 4 do not appear 
          to differ between patients who survived and did not survive surgery.� 
          The hospital has initiated a protocol to monitor and adjust glucose 
          levels for patients before, during, and after surgery. The kernel density 
          estimators indicate the success of the protocol. For other measures, 
          a slight shift can be observed.� Patients with a white blood cell count 
          higher than 6 have a slightly greater mortality risk (Figure Five).� 
          Patients with a hematocrit level lower than 30 also have a slightly 
          greater risk of mortality (Figure Six).
        
 ��������
�������� 
        Figure 3. Kernel 
          density estimation, creatinine�� ���������������������������������������������������
          
        

        Figure 4. Kernel 
          density estimation, glucose
        �
          
        
 
 
        Figure 5. Kernel 
          density estimation, white blood cell count
        �
          
        

        Figure 6. Kernel 
          density estimation, hematocrit
         
        Conclusion
        The decision tree and 
          neural networks are used to classify groups. The classifications were 
          approximately equally successful in determining the patient�s length 
          of stay.� The average squared error of these methods in the training 
          phase is 55.� Unfortunately, in the testing phase this error rose to 
          280 depending on the specific method used.� As with linear regression, 
          these values could probably best be reduced by adding other measures 
          influencing patient care and complications of open heart surgery.� Other 
          models must be defined to determine hospital quality. The standard linear 
          and logistic regression models fit too poorly.�