Many people say we are in the information era, but it seems that we are passed this. Nowadays, information is within everyone's reach, about everything and as much as we want.
Data is not the issue anymore, at least most of the time. The real issue is how to analyze the data. It seems that having information is not the problem now, but actually having too much data. One of the places in which we can find too much data are social networks. The richness of social networks is that they are a continuous flow of interesting data, what I like to call social data.
Social data is so rich as you can extract information from it in so many ways. One is to analyze what people express over a specific topic on social media. To this end, I developed a way to identify the most important ideas found on a stream of user comments. Basically an algorithmic summary tool.
With a data set of a few tens of user comments, it is easy to grasp the general feelings and thoughts that people have about a specific topic just by reading the comments, but with a larger set it becomes difficult to extract this type of information.
One way to attack this semantic summary is by identifying groups of words that present certain relation. We start with an array of texts related to a certain topic and the goal is to produce a set of ideas or concepts that summarize what people say about that topic. For this, my approach was to generate clusters of words that had strong relation so we could say that these words are related through an idea or concept.
Given the array of texts, each word becomes a point in space whose coordinates are given by different measure functions. Then, I applied cluster analysis to identify word clusters which represent ideas or concepts that are recurrent on the sequence. The challenge then is to extract numerical information from the social data, in this case, words. The goal is to somehow measure the importance or the information content of each word relative to the topic.
One of these semantic distances is given by the frequency count of a word. In other words measuring the probability of appearance of the word can give us a measurement of its information content. Another semantic distance is the average probability of appearance of the word in the array. Other semantic distance that one can use as a coordinate function is the number of words between the word and the topic and also the average distance in the array. Another semantic distance can be taken to be the entropy of a word relative to the array.
Depending on the data set, it is possible to analyze these distance functions and discard correlated ones to avoid information redundancy. A way to improve the information content is to adjoin extra dimensions using spectral clustering. From the set of words we can construct the adjacency matrix where the coefficient between two words gives the number of array elements in which both words are present. With this matrix, we can compute the highest eigenvectors, in which each component gives an information content of each word. Taking the highest few eigenvectors will add extra dimensions to the original information space in which a cluster analysis can be performed to identify groups of correlated words.
With this analysis, some of the clusters generated for the topic "Trump" on Twitter are:
For the topic "Obama" on Twitter:
It is worth noticing that these results were taken from live Twitter feed and hence reflect that people were tweeting about on the afternoon of Tuesday, Jan 12th.
Comments
Post a Comment