Types of Machine Learning exemplified by spam analysis: Part 2
Part 1 of the article introduced the topic of machine learning based on the example of spam analysis and described the first three types of machine learning – supervised, reinforcement and unsupervised learning. The remaining two types of machine learning – semi-supervised and active, are discussed in the following Part 2.
Types of Machine Learning: Semi-supervised learning
Semi-supervised learning combines elements of supervised and unsupervised learning by using both marked and unmarked data in the training process. Spam detection typically starts with a small dataset of marked emails that are classified as spam or non-spam to provide a basic understanding of the distinction between these categories. This data set is then expanded to include unmarked emails, which are classified based on the previously established differentiation criteria.
As next step comes self-training, in which the unmarked emails that were classified correctly with a high probability are assumed to be correct and added to the marked data. This process is repeated until the model has achieved a satisfactory level of accuracy.
The Spam functionality of iQ.Suite – the email solution for security and productivity by GBS, uses machine learning-driven CORE – Content Recognition Engine for superior results. It combines different analysis methods to provide email classification, which enables to improve business processes, such as response management, customer support and communication.
The challenge with this approach lies integrating careful monitoring of the learning process and model performance. Semi-supervised learning is often used when collecting large marked data sets is expensive or time-consuming, but large amounts of unmarked data are readily available. Figure 1 illustrates semi-supervised learning.
Figure 1: Semi-supervised learning (Based on Trabold, D., LAMARR Institute for Machine Learning and Artificial Intelligence, 2021)
Types of machine learning: Active learning
Another type of machine learning is active learning, in which the algorithm selects from a pool of previously unclassified data which subset of examples it wants to mark next. The process starts with a small initial set of marked email data, which is used to train the basic model. This model is then applied to a large pool of unmarked emails to identify those with the most uncertain classification. The emails are classified as spam or non-spam by an expert (e.g. a human). The newly marked emails are added to the training dataset, providing the model with further information that it can use for learning. This cycle of prediction, selection, marking and adding repeats until satisfactory model performance is achieved. Active learning can be particularly useful when there is only a small amount of marked data or it is expensive to obtain the marked data. This type of machine learning specifically selects those unlabeled data points that are most likely to improve the performance of the model.
Figure 2: Active learning (Based on Beckh, K. LAMARR Institute for Machine Learning and Artificial Intelligence, 2021)
Suitability of machine learning for spam detection
Which method of machine learning should be used for spam detection depends on various factors. These include the availability of marked data, the dynamics of the spam patterns and the resources for training and maintaining the models. There is generally a large amount of marked data available for spam detection. However, if a system has to be rebuilt and it is not possible to draw on existing data sets, it can happen that only a few emails are classified. The spam patterns are largely similar, even if new patterns are constantly being added. The requirements in terms of resources for training tend to be high, as many new emails that need to be evaluated are added every day. In summary, the suitability of the methods for spam detection can be outlined as follows
-
Supervised learning: very high suitability
Supervised learning is the most suitable method for detecting spam emails. It is based on a large data set of emails that have already been marked as spam or non-spam. One of the advantages of supervised learning is that it can recognize complex patterns and relationships in the data. This makes it ideal for spam detection, where features and patterns are constantly changing.
-
Semi-supervised learning: high suitability
Semi-supervised learning proves to be particularly useful when the marked data is not sufficiently available. This is often the case with spam detection, as manually marking emails as spam or non-spam is time-consuming. By including unmarked data in the training process, semi-supervised learning can be more effective than supervised learning, especially when spam patterns change rapidly.
-
Active learning: moderate to high suitability
Active learning is particularly useful when the marking of data is expensive and time-consuming. It allows an efficient use of resources by specifically selecting the data for marking that offers the greatest benefit for the model. Active learning is therefore suitable for dynamic environments where new types of spam are constantly appearing.
-
Unsupervised learning: moderate suitability
Unsupervised learning can be used to recognize unknown patterns in email data, especially when no marked data is available. New types of spam techniques can be identified with the help of clustering. However, assessing the accuracy and reliability of the results can be more difficult compared to supervised methods.
-
Reinforcement learning: low to moderate suitability
Reinforcement learning proves more complex to implement and optimize as it builds on a reward system that is geared towards achieving specific goals. A typical example of this is to maximize accurate spam detection while minimizing false positives.
In the next part of the blog series on the use of artificial intelligence in email management, the GBS experts will take a closer look at Deep Learning. Deep Learning is a method of machine learning that aims to learn and model complex patterns in large datasets. It can be applied to all types of machine learning presented here.
Authors: Dr. Rolf Kremer & Dirk Nolte