How to Improve Credit Card Fraud Detection

Table of Contents

Introduction
#

Fraudulent payment transactions are of very grave concern for banking institutions, due to the financial losses and customer distrust they can potentially cause. Using one year’s worth of card payment transactions, with fraudulent transactions appropriately flagged, the goal of this project is to produce a machine learning model that can predict fraudulent payment transactions. Moreover, due to the limited banking capacities, only 400 transactions per month can be investigated for fraud. After model generation and bootstrap simulation, a machine learning model was found to produce a 23.89% fraud detection rate over a 3.75% fraud detection rate by random transaction selection, producing a 6.37x improvement. By checking just 400 transactions, corresponding to 4% of monthly transactions, the bank can uncover with the model almost 24% of frauds on average.

Data preparation
#

Several changes were made to the data set. The merchantZip column contains 3260 unique categories, 19.4% of which are missing values, which increases to 31.6% if you include all of the entries marked as ‘0’. Both these values are labeled simply as Unknown and kept in the data set. The remaining merchantZip codes all figure below 1%. The most common values in posEntryMode are ‘5’, ‘81’ and ‘1’, present respectively in 59.25%, 30.2% and 8.9% of cases, with the remaining below 1%.

In the analysis I decided to convert eventId, accountNumber, merchantId, mcc, merchantCountry, merchantZip, posEntryMode to string data type, because Scikit-Learn’s DictVectorizer will only execute binary one-hot encoding when feature values are string data types. However, eventId and merchantId are dropped from the data set due to the high amount of unique values which don’t offer any discrimination. On the string data types, I calculated the mutual information score from Scikit-Learn (also called information gain) against the fraud case target. The low values for each of these features suggesting a low similarity between labels in the data set. transactionTime is set as a datetime type but is also dropped from the data set. transactionAmount and availableCash are the only two numerical data types and are kept. The Pearson correlation between these two values is very low, suggesting no correlation between one another.

Exploratory data analysis
#

The exploratory data analysis plot in Figure 1 shows some interesting observations, such as that many frauds occur in the summer months. Some of the observations are summarized in Table 1.

time_amount_number_fraud_transactions — Figure 1 – Transactions per account, frauds per month, number of fraudulent cases, number of fraud amount per account and number of frauds per account.

Description	Value
Total number of transactions	118621
Number of non-fraudulent transactions	117746 (99.26% of total)
Number of fraudulent transactions	875 (0.74% of total)
Total number of accounts	766
Number of accounts with fraud	167
Percentage of fraud per transaction	0.74%
Percentage of accounts with fraud	21.8%
Percentage of accounts with less than £1000 of fraud	83.23%

Table 1 – Summary statistics for the data set.

The data set consists of card payment transactions from January 1, 2017 to January 3, 2018. It contains 118621 transactions of which 117746 are non-fraudulent (99.26% of total) and 875 are fraudulent (0.74% of total). The total number of accounts in the data set are 766, of which those subject to fraud are 167. Even though the percentage of fraud per transaction is small, fraud cases affect 21.8% of accounts. The percentage of accounts with less than £1000 of fraud is 83.23%.

Modeling
#

I used the RandomUnderSampler class from imbalanced-learn to under-sample the majority class. Since in my data set 875 cases are fraudulent, the imbalanced-learn class randomly selects without replacement 875 non-fraud cases to generate a balanced data set. I used the the balanced data set over ten month’s worth of data to build four, simple machine learning classifiers: logistic regression, decision tree, random forest and histogram gradient boosting. Of all four the random forest obtained the best balanced accuracy mean and standard deviation, from 5-fold cross validation, at 0.819 ± 0.009 on the unbalanced validation data set. The balanced accuracy score is defined as the mean of the recall of the two target classes.

Balanced accuracy (\(\mu ± \sigma\))	Training set values	Validation set values
Logistic regression	0.818 ± 0.099	0.815 ± 0.026
Decision tree	0.796 ± 0.073	0.78 ± 0.027
Random forest	0.814 ± 0.05	0.819 ± 0.009
Histogram gradient boosting	0.835 ± 0.08	0.812 ± 0.02

Table 2 – Balanced accuracy scores for four classifiers.

The result of undersampling the majority case and using the random forest classifier can be observed in the receiver operating characteristic curve in Figure 2. The curve tends to the upper left-hand corner of the plot and a high value of 0.9088 proves the performance of this particular classifier.

auc_plot — Figure 2 – Receiver operating characteristic (ROC) curve for random forest classifier.

The random forest classifier was then tested against two left-out data sets, each containing one month’s worth of unseen data, which correspond in the original data set to the 9th and 10th months. The recall score shows the model’s ability to detect fraudulent payment transactions.

Description	Value
Recall score using the 400 most-likely detections	27.78%
Balanced accuracy score	61.92%
Precision score	3.73%
Average random fraud detection rate on 30 bootstrapped sets	4.07%
Improvement of model detection over average random detection	6.8x

Table 3 – Results from first left-out data set.

Description	Value
Recall score using the 400 most-likely detections	20.0%
Balanced accuracy score	58.01%
Precision score	2.0%
Average random fraud detection rate on 30 bootstrapped sets	3.42%
Improvement of model detection over average random detection	5.9x

Table 4 – Results from second left-out data set.

Description	Value
Recall average	23.89%
Random detection rate average	3.75%
Average improvement	6.37

Table 5 – Average improvement of model detection over random selection.

Considering that 400 transaction checks over the average number of transactions in a month corresponds to 4% of transactions, the bank can uncover on average almost 24% of frauds by checking only 4% of all monthly transactions.

If you want to know more about this work, the code can be found on my GitHub repository. Have fun!

Author

Angelo Varlotta

If you can’t explain it simply, you don’t understand it well enough – Albert Einstein

Introduction #

Data preparation #

Exploratory data analysis #

Modeling #

Introduction
#

Data preparation
#

Exploratory data analysis
#

Modeling
#