Analyzing credit card transactions using machine learning techniques – 2

Principal Component Analysis – Introduction and Data Preperation

Principal Component Analysis [PCA] is an unsupervised algorithm which reduces dimensionality and is widely used. A good visual explanation can be found here:

http://setosa.io/ev/principal-component-analysis/

As mentioned in our previous article, Correspondence Analysis  works exclusively on categorical data. In contrast, PCA accepts only numerical data. This means our data set needs to be pre-processed before being fed into PCA. We perform two pre-processing steps:

  • We turn the categorical columns into numerical data by using one-hot representation. This creates new columns – each populated with “0” or “1” depending on the original data (see link for more details). This process changes the categorical data into numerical data, at the cost of increasing the number of dimensions (this is not always a good thing – see “the curse of data dimensionality“)
  • We normalize the “total” column so that its’ values range between 0 to 1, with 0 being equivalent to the original minimum value and 1 being equivalent to the original maximum value. This makes the “total” column more in line with the other columns which also range between 0 and 1. In turn this makes PCA more accurate

Results

PCA_1

In a similar scenario to what we’ve seen in Correspondence Analysis, most of the observations are grouped in the bottom left corner, with outliers that may represent anomalies trailing off towards the top right. Knowing what we do about our data set, a good guess would be that the relatively linear nature of the anomalies is probably due to the “total” column – which due to it’s numerical nature (as opposed to categorical converted to one-hot representation) has far more variance than the other columns.

Starting with the most obvious top-rightmost outlier:

PCA_2

The outlier corresponds to an entry with an extraordinarily large “total” of money transferred, in fact the amount is the maximum number in the data set:

PCA_3

In fact the more pronounced outliers follow the same pattern, all of the observations are payments to the “HMCOURTS-SERVICE“, each of which have abnormally high “total” entries:

PCA_4

The next interesting outlier is a slightly below this linear streak:

PCA_5

This outlier again corresponds to a rather high “total”, however placed to a different service type (“Furniture-Purchase-Repair“). This would explain why the outlier is on a lower plane than the others.

Conclusions

Compared to correspondence analysis, PCA was more heavily influenced by the “total” column, and hence the results are helpful in identifying outliers which have higher than average “totals”. This is very useful considering this data set is focused around the “total” column – however we would have missed some important relations in the data set had we not also analysed the data using correspondence analysis.

In the next last article of this series we discuss a clustering algorithm which makes it easier to identify outliers in our data without having to manually process the results of PCA as we did above

Analyzing credit card transactions using machine learning techniques – 1
Analyzing credit card transactions using machine learning techniques – 3

 

Advertisements

Analyzing credit card transactions using machine learning techniques

Introduction

In this 3-part series we’ll explore how three machine learning algorithms can help a hypothetical financial analyst explore a real data set of credit card transactions to quickly and easily infer relationships, anomalies and extract useful data.

Data Set

The data set we’ll use in this hypothetical scenario is a real data set released from the UK government, specifically the London Borough of Barnet. The full dataset can be downloaded from here:

https://www.europeandataportal.eu/data/en/dataset/corporate-credit-card-transactions-2015-16

In a nutshell, the dataset contains raw transactions (one per row) which contain the following information:

  • Transaction ID and Date
  • Type of account making the payment
  • Type of service paid for
  • Total amount paid

Algorithms & Workflow

For the purposes of this article, we will process the data set using an excellent python toolset from the University of Ljubljana, called Orange. For more details please have a look at their site: https://orange.biolab.si/. I won’t delve into Orange’s details in this post, it’s enough to display the workflow used to obtain our results:

financial anomalies workflow
Orange Data Pipeline and Workflow

Note how we first load our data from the CSV file, and do one of two things:

  • Discretize the data. Correspondence analysis only accepts categorical data, so we split the “total” column – which is numeric – into categories, for example:
    • <10
    • 11-21
    • 22-31
    • >200
  • Continuize the data. Conversely, PCA and clustering accept only numerical values, so we use this function to perform “one-hot encoding” on categorical data, and we also normalize our data. We then feed this data into PCA and view the results there, before feeding the results into our clustering algorithm to make it easier to spot anomalies.

Correspondence Analysis

We first explore “correspondence analysis“, which is an unsupervised method akin to a “PCA for categorical data”. It is extremely useful in exploring a data set and quickly discovering relationships in your data set. For example, in our financial data set we get the following result:

correspondance analysis

  • The red dots are the “type of service paid for”
  • The blue dots are the “type of account”
  • The green dots are the amount of money, or “total”

Overall results, most observations are clustered in the bottom left of the graph, however a few observations stand out.

On the top right corner of the graph, we see three points which are set apart:

CA_1

This suggests that “Fees and Charges” is closely related to the “Customer Support Group”, as is “Legal and Court Fees”. This is in deed the case, since:

  •  all “fees and charges” were made by “customer support group”

CA_1a

  • all “legal and court fees” were made either by “Customer Support Group” or “Children’s Family Services”. It is interesting to note that “Legal and Court Fees” is pulled closer to the other points since it is connected to “Children’s Family Services” which in turn is responsible for many different transactions within the data set.

CA_1b

Towards the bottom of the graph we see the following cluster:

CA_2

So we expect to find some sort of “special” relationship between these observations. In fact, upon verification we find that:

  • All “operating leases – transport” transactions were made by “Streetscene”:

CA_2a

  • 95% of all “Vehicle Running costs” were made by “Streetscene”
  • Over 53% of all “Streetscene” transactions were of type “operating leases – transport” and “vehicle running costs” – explaining why the blue “Streetscene” observation is “pulled” towards these two red dots.

Towards the middle of the graph we see two particular observations grouped together:

CA_3

Upon investigation we find that the above clustering is justified since:

  • All “grant payments” were above 217.99:

CA_3a

As can be seen, Correspondence Analysis allows us to quickly and easily make educated conclusions about our data. In this particular case, it allows our hypothetical financial analyst to spot relationships between accounts, the type of services they use, and the amount of money they spend. In turn, this allows the financial auditor to determine of any of these relationships are “fishy” any warrant further investigations.

Principal Component Analysis

Our next article exploring this data set using PCA!

Hierarchical Clustering

Watch this space for a link to our next article exploring this data set using Hierarchical Clustering!