# Determining the Causes of Diabetes Readmissions in Hospitals

## Table of Contents

Diabetes is quickly becoming one of the major causes of mortality in the developing world, due to changing lifestyles and massive urbanization of the population, and is currently affecting over 10% of the US population alone according to the CDC. Millions of deaths could be prevented each year by use of better analytics, such as non-invasive screening, tailor-made solutions and hospital readmissions.

Unplanned readmissions are the most useful key metric when evaluating the quality of care of a hospital, as it highlights the practitioners’ diagnosis or treatment error. Unplanned readmissions are those that occur within 30 days of discharge from the last hospital visit, and are more closely correlated to health care administration. Consequently, decreasing unplanned readmissions are a direct measure of the improvement of patients’ health as well as being of great financial relief to health care centers. Therefore, the primary focus of this data analysis is to generate a predictive model that can forecast the agents that may be responsible for unplanned hospital readmissions.

## Data summary #

The data set employed in this analysis is the Diabetes 130-US hospitals for years 1999-2008 data set from the UCI Machine Learning Repository web site, which represents 10 years of clinical care at 130 US hospitals and integrated delivery networks. It includes 104985 entries and over 50 features representing patient and hospital outcomes. The data contain such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of the admitting physician, number of lab test performed, glycated hemoglobin (HbA1c) test results, diagnosis, number of medications, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization.

## Exploratory data analysis #

To get a better feel for the data set, a few exploratory data analysis plots are displayed below. The first three plots highlight the percentage of null values in the data set, the feature composition for the medical specialty of the admitting primary physician and the percent breakdown for payer codes (Figure 1). As a consequence of the high prevalence of null values in the `weight`

, `medical_specialty`

and `payer_code`

features, these are dropped from further analysis.

As shown below (Figure 2, left), the analysis exposes a higher percentage predominance of elder patients above the age of 40. Moreover, the middle plot shows the majority of patients of Caucasian origins, while the last plot displays a slightly greater presence of female than male patients.

The primary research article used for this analysis suggests that the probability of readmission is contingent on the `HbA1c`

measurement in the primary diagnosis (Figure 3, left), so both the `HbA1c`

measurement and the primary diagnosis features are retained in the data analysis, even though the`HbA1c`

measurement was only performed in less than 19% of the inpatient cases.

### Correlation plot #

A correlation plot with the numerical features and the readmission cases is plotted in Figure 4. Without much surprise, one can notice that `num_medications`

is moderately correlated with `time_in_hospital`

and `num_of_procedures`

. However, no substantive correlation with readmitted cases can be evinced from the plot.

## Data preparation and feature engineering #

A summary of the data preparation and feature engineering that are performed in the analysis are compiled in the bullet list below:

- All of the
`object`

values in the data frame are converted to`category`

values, - All null values from the
`race`

category and the`Unknown/Invalid`

subcategory are removed from the`Gender`

category, - As already mentioned,
`weight`

,`medical_specialty`

and`payer_code`

columns are removed due to the large presence of null values, `encounter_id`

column is removed since it isn’t relevant to the analysis,- Null, not admitted and not mapped values are removed from
`admission_type_id`

, - All variations of
`Expired at...`

or`Expired`

in`discharge_disposition_id`

are removed since they won’t be responsible for further readmission cases. Null, not admitted and unknown/invalid values in`discharge_disposition_id`

are removed as well, - Null, not available, not mapped and unknown/invalid values are removed from
`admission_source_id`

, - Following the analysis conditions laid out in the main research article for this work, duplicate patient data are removed to maintain the statistical independence of the data, after which the
`patient_nbr`

column is dropped, `citoglipton`

and`examide`

are removed since they don’t offer any discriminatory information,`glimepiride-pioglitazone`

and`metformin-rosiglitazone`

are removed as well due to the lack of discriminatory information,`number_outpatient`

,`number_emergency`

and`number_inpatient`

are summed into one column called`service_use`

and then removed,- The primary
`diag_1`

values are encoded into nine major groups:`circulatory`

,`respiratory`

,`digestive`

,`diabetes`

,`injury`

,`musculoskeletal`

,`genitourinary`

,`neoplasms`

and`others`

, - The secondary
`diag_2`

and additional`diag_3`

are removed to simplify the data analysis, `readmitted`

column is divided into two`0`

and`1`

categories, where the`0`

category contains the`Not readmitted`

and the`> 30 days`

cases, and the`1`

category consists of the`< 30 days`

cases,- Categorical variables are encoded for all columns except for the six numerical columns:
`time_in_hospital`

,`num_lab_procedures`

,`num_procedures`

,`num_medications`

,`number_diagnoses`

and`service_use`

.

### Dealing with unbalanced data #

The data set is highly unbalanced for what concerns readmission to non-readmission and >30 days cases, due to the very small amount of cases for hospital readmissions (just above 11% of all cases). To make up for this lack of readmission cases, the minority data set has been oversampled with replacement and added to the rest of the data set. This is accomplished with the imbalanced-learn package which is part of the scikit-learn-contrib project. More about imbalanced-learn can be found at scikit-learn-contrib/imbalanced-learn. Due to the widespread presence of categorical features in the data set, the **imblearn.over_sampling.RandomOverSampler** class has been employed, since it is the only class in **imblearn.over_sampling** that can deal with categories.

### Standardization of the data #

The numeric features have been standardized for each feature, by subtracting the mean of the feature and dividing by its standard deviation, using Scikit-Learn’s **StandardScaler** class. After the data preparation phase, there remain 101767 entries for analysis. Let’s now proceed with the modeling of the data.

## Data modeling #

For the sake of interpretation, the data modeling makes use of three, simple classification algorithms, all of which are available in Scikit-Learn: **LogisticRegression**, **DecisionTreeClassifier** and **RandomForestClassifier**. For each of these algorithms in Table 1, the analysis calculates the accuracy, precision, recall, F-score and cross-validated average Brier score for readmitted cases.

Accuracy | Precision | Recall | F-score | Average Brier score | |
---|---|---|---|---|---|

Random forest classifier | 0.9085 | 0.8778 | 0.9475 | 0.9113 | 0.0821 +/- 0.0025 |

Decision tree classifier | 0.7874 | 0.7345 | 0.8953 | 0.8070 | 0.1631 +/- 0.0020 |

Logistic regression | 0.5895 | 0.5963 | 0.5351 | 0.5641 | 0.2420 +/- 0.0011 |

From a first view one can see that the random forest classifier easily comes ahead of all the other algorithms. The high F-score tell us how well the random forest classifier performs on the data set, as a high F-score reflects both a high recall and precision.

### Grid search #

The hyperparameters of the random forest algorithm are fine tuned using Scikit-Learn’s **GridSearchCV** class. The hyperparameters `n_estimators`

and `max_depth`

are maximized at the parameter values at 160 and 16, respectively. No further action was undertaken to improve the result, given the overall, good results with the random forest classifier.

### Confusion matrix heat map plots #

Following the initial analysis shown in the table and the hyperparameter fine tuning, a visual representation of the performance of the models is displayed in the heat map plot of the confusion matrices (Figure 5). The values in the plot are the number of the predictions in each category divided by the sum of the values along the rows. The values shown correspond exactly in the upper left-hand corner to the specificity of the non-readmitted cases and in the lower right-hand corner to the sensitivity of the readmitted cases.

The decision tree model performs better than the logistic regression model, although there are still quite a few outliers on the transverse diagonal as compared to the main one. However, the random forest classifier confusion matrix accomplishes the best selection between all cases of true positives, true negatives, false positives and false negatives, as shown in most right-hand side plot below.

### ROC curves and AUC #

The *receiver operating characteristic* curve, or ROC curve, is also capable of showing the greater performance of the random forest classifier compared to the other two algorithms. The ROC curve displays the true positive and false positive rates against a series of thresholds that produce these rates, and the best curve is the one that produces the highest true positive rate against the smallest false positive rate. These plots also contain the *area under the curve* (AUC) calculations in the bottom right corner. The bigger this value is the more snug the ROC curve will be along the left and top axes of the plot. Therefore compared against the AUC value and the ROC curve, the random forest classifier achieves the best performance among the three algorithms used.

### Learning curves #

To clarify for cases of possible model overfitting or underfitting, the learning curves with one standard deviation error bands are calculated for all three models (Figure 7). For the classification problem at hand, accuracy is the metric used to determine the error in the model performance.

As can be seen in the plots below, a case can be made for underfitting for the logistic regression model, since both curves produce low accuracy values that tend to be similar. For the decision tree model a case may be made for some overfitting, since the increase in data causes high accuracy values for the training curve but low values for the validation curve, resulting in a small gap between the training and validation curves. The same may be said for random forest classifier model, however the accuracy is decisively higher in the latter case.

At this point the random forest model seems to be the best model, after the results of the confusion matrix heat map plots, ROC plot and AUC measurement and learning curves. Let’s now use this model to calculate the features that are most likely to determine hospital readmission.

### Feature importances #

The normalized feature importance plot is shown below in Figure 8, and highlights the features that are most influential for hospital readmission. The `num_lab_procedures`

, `num_medications`

and to a lesser extent the `discharge_disposition_id`

and `time_in_hospital`

features are the ones that appear to be more helpful in determining readmission cases, which does after all make some sense. As mentioned in the main reference article for this analysis, `primary_diag`

also bears some relationship with the possibility of readmission, even though it isn’t as strong as the former features. The plot shows one standard deviation errors bars, which highlight even more the features that are related to readmission.

## Conclusion #

The challenge of this analysis was to confront the large number of categorical features and the overwhelming ratio of not-readmitted to readmitted cases in the data. The random forest classifier outperforms the other two models on all fronts. This may be due to the non-linear decision boundary in the data that makes it difficult for the logistic regression model to perform in any satisfactory manner, but makes it easier for classifier models such as decision trees and especially random forests to perform well. The decision tree model already does a good job in model performance, but due to the advantage of the ensemble technique the random forest algorithm performs even better. The model performs very well on the validation data as well, which demonstrates the lack of the model overfitting the training data.

If you have any comments or suggestions, please feel free to make remarks in the section below. You are more than welcome to take a look at the code in my GitHub repository.

**Update (2022-12-23)**: For the latest data analysis, the following software packages were used: scikit-learn (version 1.2.0), pandas (version 1.5.2), matplotlib (version 3.6.2), seaborn (version 0.12.1) and imbalanced-learn (version 0.10.0).