There are natural events, and there are social events. We have satellites and sensors to detect natural events such as floods and earthquakes. But is there a way to get an early sign of a social event such as new fashion, a movement like me-too or a political event like the Arab Spring?
If you are a marketer, then getting an early idea of a social trend or a celebrity can go a long way in strengthening your brand presence.
Where to look for the data?
If you are hunting for the social event, then you need to analyse the data that appears in social media sites. The usual suspects such as Twitter, Instagram and Facebook are the places to look for the data.
Data from these sites have some peculiarities and getting a good sense of these is crucial in avoiding any false detection. Some of these peculiarities are listed below
- Social media data is massive in scale. Twitter gets over 500 million tweets per day!
- Not all spike in data is useful. For example, at the beginning of the new year, you would expect a lot of new year greetings, but we do not need AI to learn that
- The text in social media is informal to the extent that an NLP program without suitable adaptation would fail
The above means that raw computation power is not sufficient for us to get real-time insights about social events. We must be smart about the algorithms.
Can a supervised machine learning classifier solve this problem?
If you are tempted to throw a deep neural network at this problem, pause for a moment and look at the image below.
A supervised model will work only if the outcome categories are fixed upfront. But the whole idea of social event detection is to find out novel social ideas that impact a large population. So, we will rule out supervised learning.
If it is not supervised learning, then is it unsupervised learning?
Yes, in theory, but not in practice. There are two problems here. First, we need to store data for a period to create clusters and given the scale of social media data that is not possible. Secondly, if something is far from a cluster centre, is it an outlier or an emerging trend?
Let us try a simpler method
This is a classic case where a complex problem deserves a simple solution. We are trying to capture spikes in discussions without storing all the data. The diagram below is a representation of such a scenario.
There are quite a few methods which can be applied to detect social events. At Centelon, we have found n-gram hashing to be the most effective one. Here is a brief approach to solving this problem
- Create n-grams for a sample of text
- Apply a hash map on them, which means each word or n-gram has a hash bucket assigned to it
- Store exponential moving average and exponential moving variance of the n-gram counts in the hash bucket
- If their ratio crosses a set threshold, then the words or n-grams represent a social event
- To avoid hash collisions, use multiple hash maps
Reach out to us if you would like to know more about social event detection.