Wanna know more about data science? Make sure to check out my events and my webinar What it's like to be a data scientist and What’s the best way to become a data scientist !
Mining opinions on the web
Sentiment analysis is one of the most successful and widespread applications in natural language processing. However, for all the hype it has generated since its inception, there are still many issues associated with it.
In my work with Brandtix and other startups I had the opportunity to work a lot with sentiment analysis, especially in the context of social media analytics. Doing sentiment analysis can be very easy and cheap, as there are many free libraries for that. Some examples are: Syuzhet (for R), NLTK (python), spacy (python). However, doing sentiment analysis sometimes can be very tricky and difficult and this is what I want to talk about here.
Specifically, sentiment analysis suffers from one major drawback: it is context and language specific. In this article, I am talking about some potential issues that might arise when you try to apply sentiment analysis to some domains.
Sentiment analysis issues
Issue 1: Words have the opposite meaning within your domain
Being “aggressive” in most situations is not considered a very nice trait. However, being aggressive when you are a forward in football can be a very good thing. The contract is even more prominent with words such as “killer”. An attacker in football who is a “killer”, or has a “killer instinct” is probably a good athlete. However, not many people will think that a “killer” in real life is a good thing. General purpose sentiment analysis engines will get very confused in this context.
Issue 2: Emoticons and their usage
Sometimes people can get very creative with emoticon usage. For example ” 😉 ” could be interpreted as negative or positive. The actual meaning depends on context. If you go to emojitracker, you will find a huge number of emojis. The meaning of many of these (e.g. fish) can be challenging to learn for a machine learning model. This can be a very important issue for topics such as sarcasm, which is the next topic on our list.
Issue 3: Irony and sarcasm
These in general are amongst the holy grails of NLP. Algorithms still struggle to understand sarcasm and irony. This is a notoriously difficult problem. There has been some progress recently through the use of deep neural networks. A recent paper concluded that sarcasm is topic-dependent and contextual. This means that an algorithm needs additional information in order to classify sarcasm correctly.
This makes it a considerable more difficult problem, than just understanding whether some words convey positive or negative meaning, and might require pre-trained word embeddings and personality models. A solution, according to this paper, is to use convolutional neural networks. Convolutions are a very interesting method in deep learning. If you are interested about the subject, this article as well as this make good job of explaining how they can be used in text.
Issue 4: Real-world knowledge
One of most challenging aspects in natural language processing is understanding the actual context of what is being said. There is lots of research taking place in this field. Memory networks is a model that can use contextual information in order to answer questions. Attention networks can handle individual parts of a sentence, in a manner similar to a human can pay attention to different words. Recurrent neural networks, such as bidirectional LSTMs, can be used in order to understand the wider context of a sentence. Recently research has started implementing these into sentiment analysis.
This paper, for example, discusses the use of memory networks in sentiment analysis, and this paper discusses the possible use of bidirectional LSTMs with attention. Both are possible solutions, but these models are fairly complicated for the average person, plus they require big datasets.
Tackling the issues of sentiment analysis
So, what can be done to solve these issues? These are some different solutions.
A thing you can try is to train your own model for your own particular domain, ideally using a more advanced method (such as deep neural networks). This is the most expensive and time-consuming solution. However, if you actually have the resources to pull this off, you might find yourself holding a very valuable and unique intellectual property in your hands.
Another solution is to investigate how serious the problem is in the grand scheme of things. It is likely that the problem shows up only on a subset of the data. It is possible that you can come up with some more focused models for this part of the data. Maybe use some simple heuristics to solve the problem. E.g. in the example above regarding the word ‘aggressive’ in the context of sports, a possible solution is to just assign it a neutral meaning. This way you won’t risk a large misclassification. (e.g. a positive post being flagged as negative, or a post about someone who is literally aggressive being flagged as positive). Finally, another solution is to create machine learning models for subsets of the data.
So, keep these points in mind next time you are planning to use sentiment analysis in your project! If you are interested to hear more about topics and learn how to use these technologies in your business, make sure to check out my book, as well as my courses.