The Data and Analytics team at Kainos worked on an interesting project recently, which involved using machine learning to identify anomalous users of a service based on their activity history. The solution produced a score for each user which is a value that identified how anomalous they were compared to other users and provided the customer with the potential to monitor their service better and allocate their resources effectively. This proof of concept introduced the customer to data driven decisions and highlighted the potential value that machine learning could have to their business.

What is a proof of concept?

A proof of concept (POC) is evidence from a pilot project to demonstrate that a design concept or business proposal is feasible. It shows potential for success in a relatively short period of time. For advanced analytics, the POC generally consists of eight stages from understanding the problem through to a final solution. It involves finding where available data resides, data analysis, data cleaning, data modelling and model validation. The modelling stage is an iterative process, with each iteration improving the accuracy and performance of the model.

What problems did the customer face?

The customer was particularly interested in machine learning and how it could be used to improve their service. The service is used by thousands of users based at different locations across Great Britain. Each location is monitored and given a risk score, however this does not accurately check the work practice standards set by the customer. Rather the score focuses on aesthetics and does not consider the actions carried out by the user. The customer wanted to have the ability to determine if any unusual or potentially fraudulent activity was carried out by their users. They wanted a ‘risk’ score, which highlighted potentially risky individuals – this could be due to fraud or a lack of experience in the required work, or a drop in work standards. Furthermore, they required the ability to compare similar users that are expected to have similar activity profiles.

What is Machine Learning?

Machine learning is talked about a lot within the technology industry but many businesses are still unclear what it is and how it can help them. Machine learning promises the ability to automatically detect unknown patterns, uncover deep insights and leverage high performing predictive models. Machine learning is split into two categories — supervised and unsupervised learning. In supervised learning, labelled data is used where the outcome we wish to predict is known for a given dataset. We can train the model based on these known examples and then predict the outcome for new unknown instances. For unsupervised learning, we do not know the outcome and so we find relationships between the observations within the data.

What was our solution?

The available data for this project did not currently contain any labelled data for risky users or unusual activity. It was unknown what a profile for a “good” user should look like, hence we were faced with an unsupervised approach. Also as the challenge was essentially detecting unusual observations within the data, it was a classical anomaly detection problem. An anomaly is an observation within the data which has characteristics that are unusual or unexpected. These data points deviate from what is “normal”. Anomaly detection models score each point based on its degree of abnormality. There are two different types of anomalies; Global anomalies are data points different to all other points whereas local anomalies are data points which are significantly different to their neighbours. In this case, we were interested in detecting local anomalies, where similar users were compared to one another. The chosen algorithm was Local Outlier Factor which I will explain in more detail soon but first, let me highlight some details about the data and features of the model.

Data Overview

The data consisted of 3.6 million records focusing on one month of user activity data. It contained a range of features for each activity record, such as who the user was, what activity was carried out, date, time and location of activity as well as other relevant details.

Data Quality

When working with real customer data, we must assess the quality before attempting any analysis or modelling. Good quality data is key to producing actionable and reliable insights from data and failing to remove or change the data can significantly affect the accuracy of the model. When data is manually input by a user, it can lead to inaccurate or missing data. This presents current data quality issues to allow implementation of better procedures to improve quality. Examples of data quality issues could include missing or implausible data e.g. a person aged 200. A subset of data was created after cleansing and focusing on a required service area, resulting in 2.3 million records.

Feature Engineering

Feature engineering involves crafting new features from current ones, with the aim of improving the accuracy of the model. Machine learning techniques can be overly sensitive to raw data making it difficult to generalise for new observations. A user profile was created for each user based on activity counts, activity length and other associated rates and averages.

Anomaly Detection Method- Local Outlier Factor

Local Outlier Factor (LOF) is a density based anomaly detection algorithm which seeks to find local outliers within the data. To explain how this works, a figure is shown below where each point could represent a user. If we consider the distance between points, point o1 is far away from any other point and so we can detect it as an outlier. However, the distance between o2 and any point in C2 is similar to the distances between any two points in group C1. Therefore, o2 will not be picked up as an outlier or else lots of points in C1 will also be detected as unusual.

Figure from: Breuing M., Krigel H., Ng R. & Sander J. “LOF: Identifying Density-Based Local Outliers” Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000

The difference in a density based approach is that is compares how dense the points within a cluster are when seeking outliers. Points within C2 are tightly packed whereas C1 is much less dense and hence we expect distances between points in C1 to be larger. If we compare o2 to its closest neighbours we would find that this distance is much larger than we would expect for a point in C2 and so it can be detected as an outlier. Notice that o1 will still be detected because the distance from its neighbours is so large. The reason we want to find local outliers is because we want to compare similar users.

At a glance, the algorithm works by calculating a local reachability density (LRD) of each point. It then compares the LRD of each point to the LRD value of its neighbours to produce a LOF score. Points o1 and o2 would have high LOF values because their LRD values would be much lower than their closest neighbours.

Local Outlier Factor was also chosen because the output is an anomaly score rather than a binary result stating if a point is unusual or not. This score is much more flexible which can be easily combined with the current risk scores and users can be ranked for investigation resulting in the best allocation of resources.


The algorithm produced a risk score for each user highlighting how “potentially risky” they were. The term “potentially risky” is used, as users may be unaware that they are not following the required procedures which results in an unusual profile. Nevertheless, they have characteristics which should be investigated. These scores ranged from 0.96 to 9.18 where a score of around 1 indicated normal or non-anomalous behaviour. A score greater than 1.5 suggested that there is abnormality in the work practices of a user. The scores were compared to centre scores which highlighted the possible added value in scoring users when identifying risk.

Validation of results

After building a model, it is vital to validate how well the model fits. The output from an unsupervised machine learning model can be assessed by comparison to a labelled dataset, comparison to manual data analysis or SME validation. A labelled dataset was not available and due to the large volume of users and lack of understanding around what “risk” looks like, this option was not a possibility. Manual data analysis particularly comparing the top and bottom 20 users was carried out and supported the output of the model; The highest LOF scored user was the most statistically significant observation in the dataset. Manual data analysis is useful for data understanding, visualisation and model validation but it is too time consuming to compare every feature for every user. The machine learning algorithm can compare multiple features at once for multiple users and produce a score in a much shorter period of time and hence is more reliable and useful for the customer. The user with the highest risk score was passed to the customer to validate and it was confirmed that this user was a threat to the current service.

Machine learning provides companies with the potential to use data driven decisions resulting in better allocation of their resources. It can save a significant amount of time and money for a company. This project presents the ability to detect fraud which could benefit a variety of businesses and their services.