Multi-Task Learning, a Sentiment Analysis Example
Image by Harold Taylor
Imagine at an autonomous driving company, you are tasked with building an object detection solution to identify pedestrians, vehicles, traffic lights, and stop signs in images. How will you tackle this problem?
One simple approach is to frame a multi-label classification problem, which can be solved by building 4 binary classifiers. In this case, each classifier will be in charge of identifying one of the aspects within our interest, such as traffic lights. However, building multiple independent models can be tedious especially if there are dozens of aspects you need to work on. Given the fact that we aim to solve multiple problems with the same nature of image classification, is it possible for us to develop an all-in-one solution to simultaneously address all the problems? This leads to the topic of this article, multi-task learning (MTL).
The general idea of MTL is to learn multiple tasks in parallel by allowing the lower-level features to be shared across the tasks. The training signals learned from one task can help improve the generalization of the other tasks. Let’s take a look at an example using MTL for aspect based sentiment analysis.
Problem Statement & Dataset
Aspect based sentiment analysis (ABSA) requires us to identify the polarity of text data with respect to multiple aspects relevant to the business use case. In ecommerce, extracting sentiment for different quality specs from user reviews allow us to generate product ratings by features such as appearance, battery life, and cost efficiency. With the feature-wise product ratings, customers can gain a comprehensive understanding on the pros and cons of the product, and therefore make better purchase decisions.
Amazon product review
In this article, we are going to use a restaurant review dataset, which is well studied in the ABSA related research. The dataset was taken from SemEval 2014, an international NLP research workshop. The training and test datasets contain 3041 and 800 restaurant reviews with annotations for 5 common aspects: service, food, anecdotes/miscellaneous, price, and ambience. A snippet of the raw data is given as follows.
The first review sentence carries a sentiment towards service, whereas the second review has sentiments for both food and anecdotes/miscellaneous. The raw data is in the .xml format. To make our life eaiser, we can parse the review sentences and sentiment labels into the standard table format using the xml.dom API from Python.
Althought the sentiment labels were created for 5 aspects, it is not necessary for a review to have labels available for all the aspects. In fact, each review in the training data has labels for ~1.2 aspects on average. Here I impute the missing labels with “absent”. Our task is to make predictions for the sentiment of all 5 aspects. To tackle this multi-task problem, we are going to build a multi-head deep learning model. Let’s go!
Data Splitting
It can be observed that in the review data, there are multiple classes within each aspect. With pandas’ value_counts
function, we can have a quick overview on the distribution of labels for different aspects. There are more than just positive and negative labels.
Interestingly, in addition to the typical positive, negative and neutral labels, we also see some reviews with a “conflict” label. Conflict indicates that both positive and negative sentiment were detected from a review. Here’s one example of such reviews.
It took half an hour to get our check, which was perfect since we could sit, have drinks and talk!
To split the given training data for model training and validation, we’d like to maintain the proportion of different classes for all 5 aspects as much as possible. This can be achieved with the IterativeStratification
module from the skmultilearn library. The idea is to consider the combination of labels from different aspects, and assign the examples based on the predefined training/validation split ratio. I used a train_size of 0.8 in practice. The following function was used to perform the stratified split of the data.
Dataset Creation
To create the datasets for training our neural network. I first define a vectorizer class to tokenize the review words and convert them into pre-trained word vectors with a medium-sized English model provided by Spacy. This model can be replaced with a transfermer based model, which can potentially improve the accuracy of classification. Alternatively, we can train the embedding layer within the neural network. For the POC purpose, the Spacy model I chose is sufficient. It will help us quickly finish training the model.
Note that the Spacy model needs to be downloaded before being used.
Then we can create the dataset class as follows. Since we need to make predictions for 5 aspects, the sentiment labels need to be fed as a vector for each review.
Model Training
Now it’s time to build our neural network. How can we enable our network to handle multiple tasks at the same time? The trick is to create multiple “heads” so that each “head” can undertake a specific task. Since all tasks need to be tackled based on the same set of features created out of the reviews, we need to declare a “backbone” to connect the source features with the different “heads”. That’s how we arrive at the following network architecture.
The Sequential API provided by Pytorch can help us declare the “heads” with a loop. Within each “head”, a vanilla attention layer is introduced. The implementation of the attention layer can be found here. Based on my experiment, adding the attention layer improved the model accuracy by ~10%.
The total loss of the multi-task network can be calculated by aggregating the cross entropy losses from all the heads. We can skip calculating the loss for the aspects with no label, since we are not interested in making prediction for them. This is also aligned with the evaluation methodology specified by SemEval-2014. The accuracy is also computed with the “absent” aspects skipped.
Across all 5 aspects, a majority of the review data fall under the “positive” and “negative” categories. To tackle the class imbalance, we can assign higher weights for the examples under “conflict” and “negative” categories. This technique can lead to a slight increase of the accuracy.
Results and Conclusion
After training the multi-task network for 3 epochs, we evaluate its performance with the unseen test dataset. An overall accuracy of 0.7434 can be achieved. This is ~10% higher than the baseline test accuracy obtained with the naive majority class classifier. Let’s also look at the breakdown of accuracy for different aspects.
Aspect | Majority class accuracy | MTL accuracy |
---|---|---|
food | 0.7225 | 0.7823 |
service | 0.5872 | 0.8430 |
price | 0.6145 | 0.7590 |
ambience | 0.6441 | 0.7542 |
anecdotes/miscellaneous | 0.5427 | 0.5897 |
overall | 0.6410 | 0.7434 |
Our multi-task network model outperforms the baseline in all 5 aspects, although the accuracy for anecdotes/miscellaneous is below 0.6. It is potentially due to the high proportion of the neutral examples from this aspect.
One future work for further validating the MTL approach is to train 5 independent multi-class classification models for individual aspects, and compare the accuracy. It’s highly likely that we will end up with a poor accuracy for the price and ambience aspects due to their small volume of data. In this problem the superior performance of MTL is brought by the effect of data enrichment across different tasks. MTL typically works well under the following conditions.
- There are multiple highly similar problems to solve.
- There is sufficient amount of data relevant for solving all the problems.
Hope this example can give you a basic understanding of multi-task learning, and how it can be applied to tackle a practical problem. The full code for this blog can be found on GitHub.