
Determining the causes of diabetes readmissions in hospitals
30 Jan 2021Diabetes is quickly becoming one of the major causes of mortality in the developing world, due to changing lifestyles and massive urbanization of the population, and is currently affecting over 10% of the US population alone according to the CDC. Millions of deaths could be prevented each year by use of better analytics, such as non-invasive screening, tailor-made solutions and hospital readmissions.
Unplanned readmissions are the most useful key metric when evaluating the quality of care of a hospital, as it highlights the practitioners’ diagnosis or treatment error. Unplanned readmissions are those that occur within 30 days of discharge from the last hospital visit, and are more closely correlated to health care administration. Consequently, decreasing unplanned readmissions are a direct measure of the improvement of patients’ health as well as being of great financial relief to health care centers. Therefore, the primary focus of this data analysis is to generate a predictive model that can forecast the agents that may be responsible for unplanned hospital readmissions.
Data summary
The data set employed in this analysis is the Diabetes 130-US hospitals for years 1999-2008 data set from the UCI Machine Learning Repository web site, which represents 10 years of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. The data contain such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of the admitting physician, number of lab test performed, HbA1c test results, diagnosis, number of medications, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization.
Exploratory data analysis
To get a better feel for the data set, a few exploratory data analysis plots are displayed below. The first three plots highlight the percentage of null values in the data set, the feature composition for the medical specialty of the admitting primary physician and the percent breakdown for payer codes. As a consequence of the high prevalence of null values in the weight
, medical_specialty
and payer_code
features, these features are dropped from further analysis.
As shown below, the analysis exposes a higher percentage predominance of elder patients above the age of 40. Moreover, the middle plot shows the majority of patients of Caucasian origins, while the last plot displays a slightly greater presence of female patients than male patients.
The primary research article used for this analysis suggests that the probability of readmission is contingent on the HbA1c
measurement in the primary diagnosis, so both the HbA1c
measurement and the primary diagnosis features are retained in the data analysis, even though theHbA1c
measurement was only performed in less than 19% of the inpatient cases.
Correlation plot
A correlation plot with the numerical features and the readmission cases is plotted in the figure below. Without much surprise, one can notice that num_medications
is moderately correlated with time_in_hospital
and num_of_procedures
. However, no substantive correlation with readmitted cases can be evinced from the plot.
Data preparation and feature engineering
A summary of the data preparation and feature engineering that are performed in the analysis are compiled in the bullet list below:
- All of the
object
values in the data frame are converted tocategory
values. - All null values from the
race
category and theUnknown/Invalid
subcategory are removed from theGender
category. - As already mentioned,
weight
,medical_specialty
andpayer_code
columns are removed due to the large presence of null values. encounter_id
column is removed since it isn’t relevant to the analysis.- Null, not admitted and not mapped values are removed from
admission_type_id
. - All variations of
Expired at...
orExpired
indischarge_disposition_id
are removed since they won’t be responsible for further readmission cases. Null, not admitted and unknown/invalid values indischarge_disposition_id
are removed as well. - Null, not available, not mapped and unknown/invalid values are removed from
admission_source_id
. - Following the analysis conditions laid out in the primary research article for this work, duplicate patient data are removed to maintain the statistical independence of the data as required by logistic regression, after which the
patient_nbr
column is dropped. citoglipton
andexamide
are removed since they don’t offer any discriminatory information.glimepiride-pioglitazone
andmetformin-rosiglitazone
are removed as well due to the lack of discriminatory information.number_outpatient
,number_emergency
andnumber_inpatient
are summed into one column calledservice_use
and then removed.- The primary
diag_1
values are encoded into nine major groups:circulatory
,respiratory
,digestive
,diabetes
,injury
,musculoskeletal
,genitourinary
,neoplasms
andothers
. - The secondary
diag_2
and additionaldiag_3
are removed to simplify the data analysis. readmitted
column is divided into two0
and1
categories, where the0
category contains theNot readmitted
and the> 30 days
cases, and the1
category consists of the< 30 days
cases.- Categorical variables are encoded for all columns except for the six numerical columns:
time_in_hospital
,num_lab_procedures
,num_procedures
,num_medications
,number_diagnoses
andservice_use
.
Dealing with unbalanced data
The data set is highly unbalanced for what concerns readmission to non-readmission and >30 days cases, due to the very small amount cases for hospital readmissions (just above 11% of cases). To make up for this lack of readmission cases, the minority data set has been oversampled with replacement and added to the rest of the data set. This is accomplished with the imbalanced-learn package which is part of the scikit-learn-contrib project. More about imbalanced-learn can be found at scikit-learn-contrib/imbalanced-learn.
Due to the widespread presence of categorical features in the data set, the imblearn.over_sampling.RandomOverSampler class has been employed, since it is the only class in imblearn.over_sampling that can deal with categories. Moreover, the numeric features have been standardized, by subtracting the mean and dividing by the standard deviation, using Scikit-Learn’s StandardScaler class.
Let now proceed to the modeling of our data!
Data modeling
For the sake of interpretation, the data modeling makes use of three, simple classification algorithms, all of which are available in Scikit-Learn: LogisticRegression, DecisionTreeClassifier and RandomForestClassifier. For each of these algorithms in the table below, the analysis calculates the accuracy, precision, recall, F-score and cross-validated average Brier score for readmitted cases.
Accuracy | Precision | Recall | F-score | Average Brier score | |
---|---|---|---|---|---|
Logistic regression | 0.5838 | 0.5914 | 0.5223 | 0.5547 | 0.2430 +/- 0.0009 |
Decision tree classifier | 0.6973 | 0.6685 | 0.7738 | 0.7173 | 0.2100 +/- 0.0022 |
Random forest classifier | 0.9965 | 0.9935 | 0.9995 | 0.9965 | 0.0035 +/- 0.0010 |
From a first view one can see that the random forest classifier easily comes ahead of all the other algorithms. The high F-score tell us how well the random forest classifier performs on the data set, as a high F-score reflects both a high recall and precision.
Confusion matrix heat map plots
Following the initial analysis shown in the table, the heat map plot of the confusion matrices is generated to give a visual representation of the performance of the models. The values in the plot are the number of the predictions in each category divided by the sum of the values along the rows. The values shown correspond exactly in the upper left-hand corner to the precision for the non-readmitted cases and in the lower right-hand corner to the precision for the readmitted cases.
The decision tree model performs better than the logistic regression model, although there are still quite a few outliers on the transverse diagonal as compared to the main one. However, the random forest classifier confusion matrix accomplishes the best selection between all cases of true positives, true negatives, false positives and false negatives, as shown in most right-hand side plot below.
ROC curves and AUC
The receiver operating characteristic curve, or ROC curve, are also capable of showing the greater performance of the random forest classifier compared to the other two algorithms. The ROC curve displays the true positive and false positive rates against a series of thresholds that produce these rates, and the best curve is the one that produces the highest true positive rate against the smallest false positive rate. These plots also contain the area under the curve (AUC) calculations in the bottom right corner. The bigger this value is the more snug the ROC curve will be along the left and top axes of the plot. In this regard against the AUC value, the random forest classifier achieves the best performance among the three algorithms used.
Learning curves
To clarify for cases of possible model overfitting, the learning curves with one standard deviation error bands are calculated against all three models. As can be seen in the plots below, while a case can be made for overfitting in the logistic regression and decision tree models, there doesn’t seem to be any case of overfitting with the random forest classifier model.
Feature importances
Finally, the normalized feature importance plot is shown below, which highlights the features that appear to be most influential for readmission. The num_lab_procedures
, num_medications
and to a lesser extent the time_in_hospital
features are the ones that appear to be more helpful in determining readmission cases, which does after all make some sense. As mentioned in the main reference article for this analysis, primary_diag
also bears some relationship with the possibility of readmission, even though it isn’t as strong as the formers. The plots shows one standard deviation errors bars, which help even more to signal out those features that are related to readmission.
Grid search
A brief attempt was undertaken to fine tune the hyperparameters of the algorithm using Scikit-Learn’s GridSearchCV class, but it didn’t change the final outcome in any significant manner. The hyperparameters chosen (n_estimators
and max_depth
) maximize the parameter values at the chosen seed, but there appears to be little room for improvement by changing these values and the results for the random forest classifier model impress enough to not warrant further analysis.
Conclusions
The random forest classifier shines above the other two models. The model uses bagging, or sampling with replacement, as the default. The final prediction from this classifier is a hard voting on the individual predictors. During training each individual decision tree learner of the random forest is trained on a random subset of the data, given by the square root of the number of features.
The better performance of the random forest classifier is likely due to the advantage of the ensemble technique itself which is inherent in the random forest algorithm. By random selection of the features, no individual feature is predominant over the others, and averaging the outcome of the decision trees by hard voting allows for a noticeable decrease in spurious noise effects. The model performs very well on the validation data as well, which demonstrates the lack of the model overfitting the training data. It is also possible that the random forest algorithm benefits the most from the duplication with replacement of data performed by the imblearn.over_sampling.RandomOverSampler class, compared to the other two models.
The challenge in this analysis was to confront the large number of categorical features and the overwhelming ratio of not-readmitted to readmitted cases in the data, and the complete data analysis procedure seems to have satisfactorily tackled the problem. If you have any comments or suggestions, please feel free to make remarks in the section below. You are more than welcome to take a look at the code in my GitHub repository. For the analysis, the following software packages were used: scikit-learn (version 0.24.1), pandas (version 1.2.1), matplotlib (version 3.3.3), seaborn (version 0.11.1) and imbalanced-learn (version 0.7.0).