Skip to main content
  1. Projects/

Determining the Causes of Diabetes Readmissions in Hospitals

··2074 words·10 mins
Unsplash artwork
Photo by Konstantin Kolosov on Unsplash

Summary #

Unplanned readmissions are the most useful key metric when evaluating the quality of care of a hospital, as it highlights the practitioners’ diagnosis or treatment error. Unplanned readmissions are those that occur within 30 days of discharge from the last hospital visit, and are more closely correlated to health care administration. Consequently, decreasing unplanned readmissions are a direct measure of the improvement of patients’ health as well as being of great financial relief to health care centers. Therefore, the primary focus of this data analysis is to generate a predictive model that can forecast the agents that may be responsible for unplanned hospital readmissions.

Introduction #

Diabetes is quickly becoming one of the major causes of mortality in the developing world, due to changing lifestyles and massive urbanization of the population, and is currently affecting over 10% of the US population alone according to the CDC. Moreover, recent studies predict over 1.3 billion people worldwide will have diabetes by the year 2050. Millions of deaths could be prevented each year by use of better analytics, such as non-invasive screening, tailor-made solutions and hospital readmissions.

The data set employed in this analysis is the Diabetes 130-US hospitals for years 1999-2008 data set from the UCI Machine Learning Repository web site, which represents 10 years of clinical care at 130 US hospitals and integrated delivery networks. It includes 101766 entries and 50 features representing patient and hospital outcomes. The data contain such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of the admitting physician, number of lab tests performed, glycated hemoglobin (HbA1c) test results, diagnosis, number of medications, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization.

Exploratory Data Analysis #

To get a better feel for the data set, a few exploratory data analysis plots are displayed below. The three plots in Figure 1 highlight the percentage of null values in the data set, the percent breakdown for the medical specialty of the admitting primary physician and the percent breakdown for payer codes. As a consequence of the high prevalence of null values in the weight, medical_specialty and payer_code features, these are dropped from further analysis.

Percentage of null values in data set

Figure 1 – Percentage of null values in the data set, of feature composition for the medical specialty of the primary physician and of the percent breakdown for payer codes

Percentage of null values in data set

As shown below (Figure 2, left), the analysis exposes a higher percentage predominance of elder patients above the age of 40. Moreover, the middle plot shows the majority of patients of Caucasian origins, while the last plot displays a slightly greater presence of female over male patients.

Percentage presence for age, race and gender

Figure 2 – Percentage presence for age, race and gender

Percentage presence for age, race and gender

The primary research article used for this analysis suggests that the probability of readmission is contingent on the glycated hemoglobin, orHbA1c, measurement in the primary diagnosis (Figure 3, left), so both the HbA1c measurement and the primary diagnosis features are retained in the data analysis, even though theHbA1c measurement was only performed in less than 17% of the inpatient cases.

Percentage presence for A1Cresult, primary patient diagnosis and cases for readmission

Figure 3 – Percentage presence for HbA1c, primary patient diagnosis and cases for readmission

Percentage presence for A1Cresult, primary patient diagnosis and cases for readmission

Correlation Plot #

A correlation plot with the numerical features is plotted in Figure 4. Without much surprise, one can notice that num_medications is moderately correlated with time_in_hospital and num_of_procedures. However, no other correlation can be evinced from the plot.

Correlation plot of numeric features

Figure 4 – Correlation plot of the numeric features

Correlation plot of numeric features

Data Preparation and Feature Engineering #

A summary of the data preparation and feature engineering that are performed in the analysis are compiled in the bullet list below:

  • All of the object values in the data frame are converted to category values,
  • All null values from the race category and the Unknown/Invalid subcategory are removed from the Gender category,
  • As already mentioned, weight, medical_specialty and payer_code columns are removed due to the large presence of null values,
  • encounter_id column is removed since it isn’t relevant to the analysis,
  • Null, not admitted and not mapped values are removed from admission_type_id,
  • All variations of Expired at... or Expired in discharge_disposition_id are removed since they won’t be responsible for further readmission cases. Null, not admitted and unknown/invalid values in discharge_disposition_id are removed as well,
  • Null, not available, not mapped and unknown/invalid values are removed from admission_source_id,
  • Following the analysis conditions laid out in the main research article for this work, duplicate patient data are removed to maintain the statistical independence of the data, after which the patient_nbr column is dropped,
  • citoglipton and examide are removed since they don’t offer any discriminatory information,
  • glimepiride-pioglitazone and metformin-rosiglitazone are removed as well due to the lack of discriminatory information,
  • number_outpatient, number_emergency and number_inpatient are summed into one column called service_use and then removed,
  • The primary diag_1 values are encoded into nine major groups: circulatory, respiratory, digestive, diabetes, injury, musculoskeletal, genitourinary, neoplasms and others,
  • The secondary diag_2 and additional diag_3 are removed to simplify the data analysis,
  • readmitted column is divided into two 0 and 1 categories, where the 0 category contains the Not readmitted and the > 30 days cases, and the 1 category consists of the < 30 days cases,
  • Categorical variables are encoded for all columns except for the six numerical columns: time_in_hospital, num_lab_procedures, num_procedures, num_medications, number_diagnoses and service_use.

Dealing With Unbalanced Data #

The data set is highly unbalanced for what concerns readmission to non-readmission and >30 days cases, due to the very small amount of cases for hospital readmissions (just above 11% of all cases). To make up for this lack of readmission cases, the minority data set has been oversampled with replacement and added to the rest of the data set. This is accomplished with the imbalanced-learn package which is part of the scikit-learn-contrib project. More about imbalanced-learn can be found at scikit-learn-contrib/imbalanced-learn. Due to the widespread presence of categorical features in the data set, the imblearn.over_sampling.RandomOverSampler class has been employed, since it is the only class in imblearn.over_sampling that can deal with categories.

Standardization of the Data #

The numeric features have been standardized for each feature, by subtracting the mean of the feature and dividing by its standard deviation, using Scikit-Learn’s StandardScaler class. Let’s now proceed with the modeling of the data.

Data Modeling #

For the sake of interpretation, the data modeling makes use of three, simple classification algorithms, all of which are available in Scikit-Learn: LogisticRegression, DecisionTreeClassifier and RandomForestClassifier. For each of these algorithms in Table 1, the analysis calculates the accuracy, precision, recall, F-score and cross-validated average Brier score for readmitted cases.

Accuracy Precision Recall F-score Average Brier score
Random forest classifier 0.9085 0.8778 0.9475 0.9113 0.0821 +/- 0.0025
Decision tree classifier 0.7874 0.7345 0.8953 0.8070 0.1631 +/- 0.0020
Logistic regression 0.5895 0.5963 0.5351 0.5641 0.2420 +/- 0.0011
Table 1 – Accuracy, precision, recall, F-score and average Brier score for the random forest classifier, decision tree classifier and logistic regression algorithms

From a first view one can see that the random forest classifier easily comes ahead of all the other algorithms. The high F-score tell us how well the random forest classifier performs on the data set, as a high F-score reflects both a high recall and precision.

The hyperparameters of the random forest algorithm are fine tuned using Scikit-Learn’s GridSearchCV class. The hyperparameters n_estimators and max_depth are maximized at the parameter values at 160 and 16, respectively. No further action was undertaken to improve the result, given the overall, good results with the random forest classifier.

Confusion Matrix Heat map Plots #

Following the initial analysis shown in the table and the hyperparameter fine tuning, a visual representation of the performance of the models is displayed in the heat map plot of the confusion matrices (Figure 5). The values in the plot are the number of the predictions in each category divided by the sum of the values along the rows. The values shown correspond exactly in the upper left-hand corner to the specificity of the non-readmitted cases and in the lower right-hand corner to the sensitivity of the readmitted cases.

The decision tree model performs better than the logistic regression model, although there are still quite a few outliers on the transverse diagonal as compared to the main one. However, the random forest classifier confusion matrix accomplishes the best selection between all cases of true positives, true negatives, false positives and false negatives, as shown in most right-hand side plot below.

Logistic regression, decision tree classifier and random forest classifier confusion matrix plots

Figure 5 – Logistic regression, decision tree classifier and random forest classifier confusion matrix heat maps

Logistic regression, decision tree classifier and random forest classifier confusion matrix plots

ROC Curves And AUC #

The receiver operating characteristic curve, or ROC curve, is also capable of showing the greater performance of the random forest classifier compared to the other two algorithms. The ROC curve displays the true positive and false positive rates against a series of thresholds that produce these rates, and the best curve is the one that produces the highest true positive rate against the smallest false positive rate for all thresholds. These plots also contain the area under the curve (AUC) calculations in the bottom right corner. The bigger this value is the more snug the ROC curve will be along the left and top axes of the plot. Therefore compared against the AUC value and the ROC curve, the random forest classifier achieves the best performance among the three algorithms used.

Receiver operating characteristic curve for logistic regression, decision tree classifier and random forest classifier

Figure 6 – Receiver operating characteristic curve for logistic regression, decision tree classifier and random forest classifier

Receiver operating characteristic curve for logistic regression, decision tree classifier and random forest classifier

Learning Curves #

To clarify for cases of possible model overfitting or underfitting, the learning curves with one standard deviation error bands are calculated for all three models (Figure 7). For the classification problem at hand, accuracy is the metric used to determine the error in the model performance.

As can be seen in the plots below, a case can be made for underfitting for the logistic regression model, since both curves produce low accuracy values that tend to be similar. For the decision tree model a case may be made for some overfitting, since the increase in data causes high accuracy values for the training curve but low values for the validation curve, resulting in a small gap between the training and validation curves. The same may be said for random forest classifier model, however the accuracy is decisively higher in the latter case.

Learning curves

Figure 7 – Learning curves for logistic regression, decision tree classifier and random forest classifier

Learning curves

At this point the random forest model seems to be the best model, after the results of the confusion matrix heat map plots, ROC plot and AUC measurement and learning curves. Let’s now use this model to calculate the features that are most likely to determine hospital readmission.

Feature Importances #

The normalized feature importance plot is shown below in Figure 8, and highlights the features that are most influential for hospital readmission. The num_lab_procedures, num_medications and to a lesser extent the discharge_disposition_id and time_in_hospital features are the ones that appear to be more helpful in determining readmission cases, which does after all make some sense. As mentioned in the main reference article for this analysis, primary_diag also bears some relationship with the possibility of readmission, even though it isn’t as strong as the former features. The plot shows one standard deviation errors bars, which highlight even more the features that are related to readmission.

Feature importances

Figure 8 – Feature importances using the random forest classifier algorithm

Feature importances

Conclusions #

The challenge of this analysis was to confront the large number of categorical features and the overwhelming ratio of not-readmitted to readmitted cases in the data. The random forest classifier outperforms the other two models on all fronts. This may be due to the non-linear decision boundary in the data that makes it difficult for the logistic regression model to perform in any satisfactory manner, but makes it easier for classifier models such as decision trees and especially random forests to perform well. The decision tree model already does a good job in model performance, but due to the advantage of the ensemble technique the random forest algorithm performs even better. The model performs very well on the validation data as well, which demonstrates the lack of the model overfitting the training data. Please feel free to take a look at the code in my GitHub repository.

Update (2022-12-23): For the latest data analysis, the following software packages were used: scikit-learn (version 1.2.0), pandas (version 1.5.2), matplotlib (version 3.6.2), seaborn (version 0.12.1) and imbalanced-learn (version 0.10.0).