Have you ever been a position where you didn’t have enough data? Would that chatbot, recommendation system or fraud detector become possible if more data were available? If so, keep reading — this blog is for you!
The Text Augmenter (henceforth referred to as TA) is a program which can be used to augment text records (generate new records from existing records) in order to supply Machine Learning projects with additional training data. I used research done by Jake Young as a jumping off point for TA. I wanted to use the methods that he had come up with but in a slightly more user-friendly way to be used in the team.
Now you know why I made this, so what does this actually look like?
TA is based on the two augmentation methods set out in Jake’s article — augmenting the data based on using synonyms, and translations.
This works by breaking down each of the rows of text into individual words, then finding the keywords and replacing them with synonyms to create a new data record. This process is explained using the following flow chart:
Due to the nature of Synonym Augmentation, the number of results can vary wildly depending on the length of the original values. If longer sentences are used, then there will be more words to augment, and thus a larger return of data.
Validation: To ensure that the results of the Synonym Augmentation are usable for any given project, the user can specify a ‘blacklist’ which is a list of words that the user defines which will be ignored by the Augmentation function.
For example, the word ‘one’ gives 5 different data records. Using the blacklist means that none of these would show up in the augmented dataset. This is useful as it means that certain things that shouldn’t be augmented, for example words in the name of a law will not be augmented.
This takes the data records and translates them into a different language then back into English using Microsoft Azure’s translation API. It does this for five languages chosen to mangle the syntax of the sentence. This process is explained using the following flow chart:
As each data value will be translated into five different languages, this function will always give a 500% return.
Validation: Universal Sentence Encoder is used to ensure that the augmented data records are close enough to the original values. This works by finding the ‘cosine similarity’ of the augmented value to the original value. The user is able to set a threshold for this similarity and any augmented values that fall outside of the threshold.
TA has a built-in feature for benchmarking the original dataset against the augmented datasets using a variety of common Machine Learning algorithms. It uses the following:
Stochastic Gradient Descent [SGD]Gaussian Naïve-Bayes [GNB] *Complement Naïve-Bayes [CNB]Linear RegressionDecision Tree Classifier [DTC] *Multi-Layer Perceptron [MLP] ** this classifier is used in the Flask application
The augmentation function splits each of the datasets up into training and test values, then trains the classifier and makes predictions. It then gives the user feedback about the classification: accuracy value, classification report and confusion matrix. It does this for all of the datasets — the original and the datasets created by the two methods of augmentation — and displays comparisons between them.
There are three different ways to use it, to allow for flexibility for users with differing levels of technical knowledge:
Python scriptCurrently, the main way we use TA is via the main Python script running on the Terminal. This method allows for the use of all of the selected benchmarking classifiers, and is the fastest way of using TA.
Flask applicationThis project gave me my first ever opportunity to use Flask, a web framework written in Python. Using this, I built a webpage to house the functions for TA. Doing this allowed me to gain a better understanding of endpoints and requests having never really experienced much related to the web. I’m currently unsure about whether or not this will be hosted as a usable web page, or whether the primary method for use will be one of the other methods. The downside of this, is that the number of machine learning classifiers used for benchmarking bulk up the tool, and thus I have restricted the benchmarking to only three classifiers.
Jupyter NotebookAn implementation of the Python script has been created as a Jupyter Notebook. This is separated into cells containing each of the functions, and an explanation of what they do and how to use them.
After having completed TA recently, we have began using it in the team’s projects as a means to create more testing data in order to prove the effectiveness of it.
The Effectiveness of TATA took in a dataset of 11000 values, and returned an augmented dataset of 500000 values (4545% return). This means that the training for the algorithm was able to be more efficient and return better results.
We are planning to use TA in the future to assist in the development of any of our projects that this would be beneficial to.
Sign up to the Kainos newsletter