Credit Card Fraud
Credit Card Fraud
I. Introduction:
As the demand of consumer goods in daily life continues to increase, the need for more convenient and efficient methods of payment correspondingly grows, and one of the most widely used transaction method nowadays is credit card. However, as efficient as it is, credit card holders cannot overlook the concern of having their card, or card information stolen, and being charged from fraudulent payment.
Our project aims to analyze trends related to credit card frauds from data found online and build models using machine learning algorithm to predict the authenticity of credit card transactions
II. How the prediction works:
Firstly, a specific target has to be selected out of the columns of the dataset. In this case, the prediction should be whether the credit card transaction is a fraud or not. Secondly, a set of features, excluding the target, needs to be created, which will be used for the training of the machine learning model and ultimately for the prediction. The features should present some correlation to the target feature, as this enhances the result of the training. Thoroughly analyzing the data and constructing a correlation matrix are the two common solutions to check the relationship between features in the data frame. After the required inputs have been chosen, the data frame will be split using the method “train_test_split” into a fitting and validation dataset with each set including its own features and target. Next, the model with be trained with the fitting data with different algorithms and later validated with the validation data. Gradually, the trained model can be used to make predictions.
Each step from the overview above will be depicted as follow.
1. Information about dataset and choosing prediction target:
Because this type of bank transaction information often contains a lot of personal privacy, this is one of the few datasets we can find online that contains real transaction data. We found this dataset from datacamp. This dataset consists of credit card transactions in the western United States. Our dataset contains approximately 330,000 transaction data. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.
In the dataset for the credit card fraud detecting project, the target is “is_fraud”, which is a categorical variable with the value “0” being valid and “1” being fraudulent.
2. Analyzing data
Here is the list of column of the data frame:
Since a part of the columns is non-numeric, and the machine learning model only works with numeric values, the non-numeric features need to be encoded before fitting.
- Encoding column “trans_date_trans_time”: From this feature, the time of the transaction can be used to train the model. In our opinion, the most logical way is to sort the time of day by “one hot encode” method into “morning” from 5 am to 11 am, “noon/ afternoon” from 11 am to 5 pm, “evening” from 5 pm to 11 pm, and “night/ after midnight” from 11 pm to 5 am. “One hot encode” method creates for each time of day new feature with categorical variable in the data frame. For instance, if the transaction happens in the morning, the new feature “morning” of this transaction has the value “1”, and other time of day features have the value “0”. Additionally, the fraud count for each time of day can be depicted in a diagram.
- Encoding column “merchant”: Since this variable has too many values, “one hot encode” isn’t a reasonable solution. Therefore, we decide to use the frequency a merchant appears in the data frame as its numeric variable. In the code, the new column for the frequency is called “merchant encoded”.
After that, the number of fraud transaction for each merchant can be counted, and tested for correlation with the merchant frequency.
- Encoding column “job”: Similar to “merchant”.
- Encoding column “category”: Similar to “merchant”.
- Encoding columns “city” and “state”: These features can be treated as the features “merchant”, “category” and “job”, since they also have a wide range of values. However, this dataset is originated from the western United States, the cities and states can be grouped into three different regions: “West Coast”, “Mountain States”, “Mid -West” using “one hot encode”.
- Encoding column “dob”: This feature stands for “date of birth”. From the birthdate of the card holders, their age can be calculated, which is a numerical variable that can be utilized for the training process.
For the interest of analyzing, the “age” column can also be divided into four groups: “Group 20–40”, “Group 40–60”, “Group 60–80”, and “Group 80 and over”. The number of frauds in each age group can be counted and plotted as follow.
3. Creating correlation matrix
Another option to check whether a column has a relationship with the target feature is to construct a correlation matrix. The value of the elements in the matrix varies from “-1” to “1”. The value “0” indicates no correlation, where as “1” and “-1” signal the best correlation with the target feature.
For better visualization, we can display only the correlations between “is_fraud” and other columns.
Since this diagram shows little relationship between variables, we decide to run the correlation matrix on a smaller dataset with one percent the amount of data from the original. Here’s the result:
4. Building the model
- Selecting target features: After having finished analyzing and checking correlation, the following list of columns can be selected as features for the model: ‘amt’, ‘lat’, ‘long’, ‘merch_lat’, ‘merch_long’, ‘age’, ‘state encoded’, ‘job encoded’. Additionally, the target feature also has to be specified.
- Splitting dataset: Using “train_test_split, the data frame can be separated into two sets, one for fitting, and the other is for validating.
- Importing machine learning model: For this project, we choose to work with “Decision Tree Classifier”, “Random Forest Classifier”, “Logistic Regression”, “Support Vector Machine Classifier”, since these are the most common models that are easy to work with, which is especially helpful for our first data science project.
- Training the model: With the value of the features and the target from the fitting dataset, the model can be trained. The machine learning models listed above have their individual training algorithm. They may perform differently; however, they can sometimes deliver the same prediction.
This cell of code is for all of the steps above.
- Making the first prediction: After the model is trained, the first prediction can be made. We can use a random data from the validation dataset, which is till now unfamiliar with the model, to see how it performs. For further prediction with new data, the code below can be similarly operated.
As depicted, all models predict the transaction as non-fraud. In order to check the accuracy of the prediction, let’s reveal the real value of the target of this random data:
The transaction is in fact valid, which means the model have predicted correctly.
5. Validating the performance:
From one random test above, the accuracy of the model cannot be directly concluded. Therefore, a validation needs to be carried out. This can be achieved using the validation dataset. The process is similar to the random test. The models receive the validation data as inputs and make predictions for every transaction in the set. In the end, the predictions will be compared to the real value of the target feature and the “mean absolute error” and the “accuracy score” of each model can be calculated.
The table shows that every model has a very high accuracy score and low mean of error.
6. Improving model performance:
The first option to enhance the prediction accuracy of a model is making sure the correct features are chosen for the training. It is important that the features with noticeable correlation with the target are identified, as well as the right amount of features is selected. Too many features used for training can result in overfitting, and too few leads to underfitting.
Another solution is hyperparameter tuning. Hyperparameters are parameters that are not acquired during the training, but rather set, or tuned, prior to the fitting process. They affect the learning and behavior of the model in different aspects. Each machine learning model have their own set of hyperparameters, and by using different combinations of these parameters, the model performs differently and may deliver various predictions. From the collection of the same machine learning model with different sets of hyperparameters, a cross comparison of the prediction accuracy called “cross validation” can be operated, from which the best combination for the model can be selected and utilized. In order to tune each model, we decide to pick out their four most common hyperparameters and for each hyperparameter maximum four values.
The following cells of code present the list of hyperparameters, cross validation, and the best combination for each machine learning model.
- Decision Tree Classifier:
In decision tree classifier the test accuracy is 0.995210996731545.
- Random Forest Classifier
Because our dataset contains about 330,000 transaction data, that makes it harder to get the result. So, we make a small dataset which is from the original dataset. The small dataset contains about 3300 transaction data. The following hyperparameters will base on the new dataset.
The test accuracy is 0.99250374812937
- Logistic Regression
The test accuracy is 0.992503748125937
- Supported Vector Machine
The test accuracy is 0.9940029985007496
In the end we can know that the decision tree classifier has the highest test accuracy.
So, we can deduce that the decision tree classifier has the best prediction performance
III. Challenges:
The phase, where most problems occurred, was the process of considering and choosing data for the project. The topic “Credit card fraud” seems simple to comprehend, nonetheless finding the dataset to work with is difficult, since this information is confidential not only for the bank, but also the card holders. We decided to work with the dataset from the western of United States, because it has the most interesting, diverse features and reasonable amount of information.
Additionally, even though the term “Credit card fraud” is familiar to us all, the count of fraudulent transactions in the dataset, or possibly in the reality as well, is minuscule in comparison to that of non-fraud transactions. The number of frauds only makes up to less than ten percent of the complete data, which leads to an unbalanced dataset. It is preferable for the machine learning model to work with an equal or semi-equal amount of frauds to non-frauds; however, since this describes the real situation, we make a decision to use the original unbalanced dataset.
IV. Future opportunities:
There are many areas in the project that we wish to look into, if we had more time. Firstly, the models can be tested on not only one, but rather multiple datasets from regions, countries around the world with different features to analyze new trends and correlations between the features and the authenticity of the transaction. Secondly, we would like to implement other machine learning algorithms not only to broaden our knowledge in data science, but also to look for different models that may create better predictions. Furthermore, the process of selecting training features for the model could be executed more precisely, which hopefully can increase the accuracy of the prediction.
V. Conclusion:
By analyzing, examining correlations between features, we are able to build a model using machine learning algorithm to detect fraudulent credit card transactions. One challenging factor is to find reasonable data to work with, since most the information from the payment is confidential. We hope to expand our project by looking for more datasets from other regions with new features, and implementing different learning and predicting methods to not only discover trends related to credit card fraud, but also deliver more precise predictions.
Team members
1.Anh Duy Pham 2.Wentao Xie 3.Assel Tursunbekova
Mentor
Aravinda Kumaran