Analyzing credit card transactions using machine learning techniques – 3

Introduction

In a previous article, we explored how PCA can be used to plot credit card transactions into a 2D space, and we proceeded to visually analyse the results. In this article, we take this process one step further and use hierarchical clustering to automate parts of our analysis, making it even easier for our hypothetical financial analyst to find anomalies within their data set (for a review of the data set being used, make sure to check out the first article in our series).

Clustering

 Clustering is an unsupervised machine learning method where data points are grouped together according to a given “distance metric”. Usually this defaults to euclidean distance, though other distance functions exist (such as cosine similarity) depending on your application. There are a large number of clustering algorithms, such as DBSCAN, Hierarchical Clustering, and even a hybrid of the two – HDBSCAN. the Orange library implements hierarchical clustering, so we will use this for demonstration purposes in the rest of this article.

Results

After running the algorithm, and selecting the top 5 clusters, we see the following results:

HC_1

The data points are organised hierarchically into a tree with each tree leaf  being a cluster. The Orange library helpfully splits each cluster into separate colors (on the right hand side), so one can easily discern which data points belong to a cluster.

We will plot these cluster colors onto our original PCA scatter plot in order to better compare the two and more easily absorb the extra information that the clustering algorithm gives us:

HC_3

As expected, the clustering algorithm grouped most of the points in the lower left corner together in bright yellow. We discussed in the previous article about how the PCA algorithm is heavily influenced by the “total” data attribute since it introduced the most variance in our data set, so we expect that the clustering algorithm will also be highly dependent on the “total” attribute.

In the previous article, once we processed our data set using PCA, we noticed one extreme outlier, due to the “total” value of this outlier being far higher than that of the other data points. This is reflected in the clustering results by a cluster containing a single point:

HC_2

The first point is the only point in the light blue cluster, while the entries below it make up another small light red cluster. These entries represent those entries in our data set where the “total” value of the credit card transfer was higher than is usual for other transactions in the data set.

Looking at the next (light green) cluster, we spot an anomaly that managed to make it past our manual analysis in our previous article:

HC_4

The first data point in the cluster is labelled “Harold Benjamin Solicitors“, while all the rest are from the expected “HMCOURTS“. If we check the raw data set, we see that this entry has a total of “2124”, which is well over the average total normal seen in our transactions. Therefore we can now add quite a specific anomaly to our already discovered ones: that an abnormally high payment has been made to “Harold Benjamin Solicitors”, because usually such high payments go to “HMCOURTS”. However, there is a high probability that this observation is a false positive (i.e. not really an anomaly), because we know that solicitors do in fact tend to charge large sums of money, and because this data point is the only transaction in our data set which is going to the solicitors, we have no frame of reference to decide if this is an abnormally high payment in the context of solicitors or not (incidentally this highlights why it is difficult to work with “imbalanced data sets” – or data sets where different classes of data do not have adequate representation.

Conclusion

We’re at the conclusion of our three-part series, and hopefully you now have a better understanding of how correspondence analysis, PCA, and clustering work.

We’ve seen that since correspondence analysis works directly on tabular categorical data, it is easy to apply it to data which is already in the form of a table, and it highlights relationships between the categories in your data. PCA on the other hand, is more heavily influenced by those attributes which introduce the highest variation in your data. It is very good in helping to highlight those anomalies that are present within this highly variant attribute, but you may miss some of the relationships within the data – so it’s always better to investigate your data with more than one algorithm and merge the results in your reporting.

While clustering was also highly dependent on “total” – so it did not give us the more subtle insights which Correspondence Analysis did – it was a very useful addition to plain PCA since it helps an analyst to quickly data into more manageable chunks which makes anomaly detection easier, as was the case with our solicitors payment in the above example.

 

Previous Articles in the Series

Analyzing credit card transactions using machine learning techniques – 1
Analyzing credit card transactions using machine learning techniques – 2

Advertisements

Analyzing credit card transactions using machine learning techniques

Introduction

In this 3-part series we’ll explore how three machine learning algorithms can help a hypothetical financial analyst explore a real data set of credit card transactions to quickly and easily infer relationships, anomalies and extract useful data.

Data Set

The data set we’ll use in this hypothetical scenario is a real data set released from the UK government, specifically the London Borough of Barnet. The full dataset can be downloaded from here:

https://www.europeandataportal.eu/data/en/dataset/corporate-credit-card-transactions-2015-16

In a nutshell, the dataset contains raw transactions (one per row) which contain the following information:

  • Transaction ID and Date
  • Type of account making the payment
  • Type of service paid for
  • Total amount paid

Algorithms & Workflow

For the purposes of this article, we will process the data set using an excellent python toolset from the University of Ljubljana, called Orange. For more details please have a look at their site: https://orange.biolab.si/. I won’t delve into Orange’s details in this post, it’s enough to display the workflow used to obtain our results:

financial anomalies workflow
Orange Data Pipeline and Workflow

Note how we first load our data from the CSV file, and do one of two things:

  • Discretize the data. Correspondence analysis only accepts categorical data, so we split the “total” column – which is numeric – into categories, for example:
    • <10
    • 11-21
    • 22-31
    • >200
  • Continuize the data. Conversely, PCA and clustering accept only numerical values, so we use this function to perform “one-hot encoding” on categorical data, and we also normalize our data. We then feed this data into PCA and view the results there, before feeding the results into our clustering algorithm to make it easier to spot anomalies.

Correspondence Analysis

We first explore “correspondence analysis“, which is an unsupervised method akin to a “PCA for categorical data”. It is extremely useful in exploring a data set and quickly discovering relationships in your data set. For example, in our financial data set we get the following result:

correspondance analysis

  • The red dots are the “type of service paid for”
  • The blue dots are the “type of account”
  • The green dots are the amount of money, or “total”

Overall results, most observations are clustered in the bottom left of the graph, however a few observations stand out.

On the top right corner of the graph, we see three points which are set apart:

CA_1

This suggests that “Fees and Charges” is closely related to the “Customer Support Group”, as is “Legal and Court Fees”. This is in deed the case, since:

  •  all “fees and charges” were made by “customer support group”

CA_1a

  • all “legal and court fees” were made either by “Customer Support Group” or “Children’s Family Services”. It is interesting to note that “Legal and Court Fees” is pulled closer to the other points since it is connected to “Children’s Family Services” which in turn is responsible for many different transactions within the data set.

CA_1b

Towards the bottom of the graph we see the following cluster:

CA_2

So we expect to find some sort of “special” relationship between these observations. In fact, upon verification we find that:

  • All “operating leases – transport” transactions were made by “Streetscene”:

CA_2a

  • 95% of all “Vehicle Running costs” were made by “Streetscene”
  • Over 53% of all “Streetscene” transactions were of type “operating leases – transport” and “vehicle running costs” – explaining why the blue “Streetscene” observation is “pulled” towards these two red dots.

Towards the middle of the graph we see two particular observations grouped together:

CA_3

Upon investigation we find that the above clustering is justified since:

  • All “grant payments” were above 217.99:

CA_3a

As can be seen, Correspondence Analysis allows us to quickly and easily make educated conclusions about our data. In this particular case, it allows our hypothetical financial analyst to spot relationships between accounts, the type of services they use, and the amount of money they spend. In turn, this allows the financial auditor to determine of any of these relationships are “fishy” any warrant further investigations.

Principal Component Analysis

Our next article exploring this data set using PCA!

Hierarchical Clustering

Watch this space for a link to our next article exploring this data set using Hierarchical Clustering!