Table of Contents
Unplanned readmissions are the most useful key metric when evaluating the quality of care of a hospital, as it highlights the practitioners’ diagnosis or treatment error. Unplanned readmissions are those that occur within 30 days of discharge from the last hospital visit, and are more closely correlated to health care administration. Consequently, decreasing unplanned readmissions are a direct measure of the improvement of patients’ health as well as being of great financial relief to health care centers. Therefore, the primary focus of this data analysis is to generate a predictive model that can forecast the agents that may be responsible for unplanned hospital readmissions.
Diabetes is quickly becoming one of the major causes of mortality in the developing world, due to changing lifestyles and massive urbanization of the population, and is currently affecting over 10% of the US population alone according to the CDC. Moreover, recent studies predict over 1.3 billion people worldwide will have diabetes by the year 2050. Millions of deaths could be prevented each year by use of better analytics, such as non-invasive screening, tailor-made solutions and hospital readmissions.
The data set employed in this analysis is the Diabetes 130-US hospitals for years 1999-2008 data set from the UCI Machine Learning Repository web site, which represents 10 years of clinical care at 130 US hospitals and integrated delivery networks. It includes 101766 entries and 50 features representing patient and hospital outcomes. The data contain such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of the admitting physician, number of lab tests performed, glycated hemoglobin (HbA1c) test results, diagnosis, number of medications, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization.
Exploratory Data Analysis #
To get a better feel for the data set, a few exploratory data analysis plots are displayed below. The three plots in Figure 1 highlight the percentage of null values in the data set, the percent breakdown for the medical specialty of the admitting primary physician and the percent breakdown for payer codes. As a consequence of the high prevalence of null values in the
payer_code features, these are dropped from further analysis.
As shown below (Figure 2, left), the analysis exposes a higher percentage predominance of elder patients above the age of 40. Moreover, the middle plot shows the majority of patients of Caucasian origins, while the last plot displays a slightly greater presence of female over male patients.
The primary research article used for this analysis suggests that the probability of readmission is contingent on the glycated hemoglobin, or
HbA1c, measurement in the primary diagnosis (Figure 3, left), so both the
HbA1c measurement and the primary diagnosis features are retained in the data analysis, even though the
HbA1c measurement was only performed in less than 17% of the inpatient cases.
Correlation Plot #
A correlation plot with the numerical features is plotted in Figure 4. Without much surprise, one can notice that
num_medications is moderately correlated with
num_of_procedures. However, no other correlation can be evinced from the plot.
Data Preparation and Feature Engineering #
A summary of the data preparation and feature engineering that are performed in the analysis are compiled in the bullet list below:
- All of the
objectvalues in the data frame are converted to
- All null values from the
racecategory and the
Unknown/Invalidsubcategory are removed from the
- As already mentioned,
payer_codecolumns are removed due to the large presence of null values,
encounter_idcolumn is removed since it isn’t relevant to the analysis,
- Null, not admitted and not mapped values are removed from
- All variations of
discharge_disposition_idare removed since they won’t be responsible for further readmission cases. Null, not admitted and unknown/invalid values in
discharge_disposition_idare removed as well,
- Null, not available, not mapped and unknown/invalid values are removed from
- Following the analysis conditions laid out in
the main research article for this work, duplicate patient data are removed to maintain the statistical independence of the data, after which the
patient_nbrcolumn is dropped,
examideare removed since they don’t offer any discriminatory information,
metformin-rosiglitazoneare removed as well due to the lack of discriminatory information,
number_inpatientare summed into one column called
service_useand then removed,
- The primary
diag_1values are encoded into nine major groups:
- The secondary
diag_3are removed to simplify the data analysis,
readmittedcolumn is divided into two
1categories, where the
0category contains the
Not readmittedand the
> 30 dayscases, and the
1category consists of the
< 30 dayscases,
- Categorical variables are encoded for all columns except for the six numerical columns:
Dealing With Unbalanced Data #
The data set is highly unbalanced for what concerns readmission to non-readmission and >30 days cases, due to the very small amount of cases for hospital readmissions (just above 11% of all cases). To make up for this lack of readmission cases, the minority data set has been oversampled with replacement and added to the rest of the data set. This is accomplished with the imbalanced-learn package which is part of the scikit-learn-contrib project. More about imbalanced-learn can be found at scikit-learn-contrib/imbalanced-learn. Due to the widespread presence of categorical features in the data set, the imblearn.over_sampling.RandomOverSampler class has been employed, since it is the only class in imblearn.over_sampling that can deal with categories.
Standardization of the Data #
The numeric features have been standardized for each feature, by subtracting the mean of the feature and dividing by its standard deviation, using Scikit-Learn’s StandardScaler class. Let’s now proceed with the modeling of the data.
Data Modeling #
For the sake of interpretation, the data modeling makes use of three, simple classification algorithms, all of which are available in Scikit-Learn: LogisticRegression, DecisionTreeClassifier and RandomForestClassifier. For each of these algorithms in Table 1, the analysis calculates the accuracy, precision, recall, F-score and cross-validated average Brier score for readmitted cases.
|Average Brier score
|Random forest classifier
|0.0821 +/- 0.0025
|Decision tree classifier
|0.1631 +/- 0.0020
|0.2420 +/- 0.0011
From a first view one can see that the random forest classifier easily comes ahead of all the other algorithms. The high F-score tell us how well the random forest classifier performs on the data set, as a high F-score reflects both a high recall and precision.
Grid Search #
The hyperparameters of the random forest algorithm are fine tuned using Scikit-Learn’s
GridSearchCV class. The hyperparameters
max_depth are maximized at the parameter values at 160 and 16, respectively. No further action was undertaken to improve the result, given the overall, good results with the random forest classifier.
Confusion Matrix Heat map Plots #
Following the initial analysis shown in the table and the hyperparameter fine tuning, a visual representation of the performance of the models is displayed in the heat map plot of the confusion matrices (Figure 5). The values in the plot are the number of the predictions in each category divided by the sum of the values along the rows. The values shown correspond exactly in the upper left-hand corner to the specificity of the non-readmitted cases and in the lower right-hand corner to the sensitivity of the readmitted cases.
The decision tree model performs better than the logistic regression model, although there are still quite a few outliers on the transverse diagonal as compared to the main one. However, the random forest classifier confusion matrix accomplishes the best selection between all cases of true positives, true negatives, false positives and false negatives, as shown in most right-hand side plot below.
ROC Curves And AUC #
The receiver operating characteristic curve, or ROC curve, is also capable of showing the greater performance of the random forest classifier compared to the other two algorithms. The ROC curve displays the true positive and false positive rates against a series of thresholds that produce these rates, and the best curve is the one that produces the highest true positive rate against the smallest false positive rate for all thresholds. These plots also contain the area under the curve (AUC) calculations in the bottom right corner. The bigger this value is the more snug the ROC curve will be along the left and top axes of the plot. Therefore compared against the AUC value and the ROC curve, the random forest classifier achieves the best performance among the three algorithms used.
Learning Curves #
To clarify for cases of possible model overfitting or underfitting, the learning curves with one standard deviation error bands are calculated for all three models (Figure 7). For the classification problem at hand, accuracy is the metric used to determine the error in the model performance.
As can be seen in the plots below, a case can be made for underfitting for the logistic regression model, since both curves produce low accuracy values that tend to be similar. For the decision tree model a case may be made for some overfitting, since the increase in data causes high accuracy values for the training curve but low values for the validation curve, resulting in a small gap between the training and validation curves. The same may be said for random forest classifier model, however the accuracy is decisively higher in the latter case.
At this point the random forest model seems to be the best model, after the results of the confusion matrix heat map plots, ROC plot and AUC measurement and learning curves. Let’s now use this model to calculate the features that are most likely to determine hospital readmission.
Feature Importances #
The normalized feature importance plot is shown below in Figure 8, and highlights the features that are most influential for hospital readmission. The
num_medications and to a lesser extent the
time_in_hospital features are the ones that appear to be more helpful in determining readmission cases, which does after all make some sense. As mentioned in
the main reference article for this analysis,
primary_diag also bears some relationship with the possibility of readmission, even though it isn’t as strong as the former features. The plot shows one standard deviation errors bars, which highlight even more the features that are related to readmission.
The challenge of this analysis was to confront the large number of categorical features and the overwhelming ratio of not-readmitted to readmitted cases in the data. The random forest classifier outperforms the other two models on all fronts. This may be due to the non-linear decision boundary in the data that makes it difficult for the logistic regression model to perform in any satisfactory manner, but makes it easier for classifier models such as decision trees and especially random forests to perform well. The decision tree model already does a good job in model performance, but due to the advantage of the ensemble technique the random forest algorithm performs even better. The model performs very well on the validation data as well, which demonstrates the lack of the model overfitting the training data. Please feel free to take a look at the code in my GitHub repository.
Update (2022-12-23): For the latest data analysis, the following software packages were used: scikit-learn (version 1.2.0), pandas (version 1.5.2), matplotlib (version 3.6.2), seaborn (version 0.12.1) and imbalanced-learn (version 0.10.0).