COMMUNITY DETECTION ON TWITTER

SUMMARY

We scraped a large amount of community data on Twitter, starting from a seed account, and used it to see what communities we could find.
Skip to results or View our report

INTRODUCTION

The size and utility of networks has increased exponentially in the last few decades. Social media as well as other internet applications have made analysis of the connections between nodes (including people) a highly productive exercise when trying to predict commonality between them, including political affiliation, geographic location, and interest in a celebrity, to name a few examples.

Much like the size of the graphs in general has increased exponentially, so does the cost of finding these communities. As a graph grows, its edges grow exponentially and it is difficult to use many algorithms that would work at small scale to analyze large scale data. We used an algorithm known as the Louvain Method to predict communities based on common edges between nodes in our dataset. After we predicted the communities with the largest size, we told a story about what real community the algorithm had detected based on manual analysis of community members.

GATHERING DATA

We gathered data by scraping Twitter starting at our seed account, @DessaDarling. Starting with Dessa's account, we scraped the first three thousand followers for each user, as well as the first three thousand accounts following them. From this data, we built a list of users on both lists for a given users (mutual followers). These mutuals would be the relationship we would use to construct our graph.

For each mutual, we would check to see if we had already scraped their account and, if we had not, we would add them to a queue of users to scrape. We woul then scrape each of the users in the queue in order, leading to a depth-first-search approach to traversing the graph.
All in all, we ended up with about 140MB of data (~55MB compressed) to analyze. From there, we proceeded to our community detection.

DETECTING COMMUNITIES

The Louvain Method we used to detech graph communities uses modularity to determine if a set of nodes is a community. Modularity is the comparison between the expected number of edges between a set of nodes and the actual number of edges detected. A group more internally connected than the graph as a whole has a high modularity.

Figure 1: Each community has a higher modularity than the graph as a whole.

The communities we chose to analyze manually were the ones with the most nodes. Within each community, we chose the ten most connected members within the group.

RESULTS

Below is a visualization to view the top ten most-connected members of the top ten largest groups that we found in the Twitter dataset. Use the forward and backward buttons to switch communities and tap on a circle to view information about an individual user. Click the center circle to cycle through members of the community in order.

Touch an outer circle to select a community member
Information about the person you select will show here.

Community number 1


CONCLUSION

After utilizing the louvain algorithm to identify communities on the twitter dataset we have identified 10 interesting communities. Characterizing these communities and analyzing their commonalities we found two distinct features that connected the accounts in each community. The first feature being the geographical location of the twitter accounts, for example community 10 consisted of accounts solely based in Essex county, in the United Kingdom. The second feature was the interest and professions of the twitter accounts, for example community 2 consisted of accounts of individuals in medical professions and PHD medical students. Throughout our analysis we also identified communities that were connected based upon the merging of both geographical location and interest. One distinct example can be found within community 5 which consisted of accounts that were all based in the UK and all had a focus on scriptwriting, ranging from tv shows to theatrical productions.

Overall these findings suggest that undetected communities exist within social media networks and can be uncovered with high accuracy by the Louvain algorithm. It is important to carefully test the performance of the louvain algorithm individually on every social media platform to determine whether these communities can be extracted. More specifically, it is important to utilize sampling and random testing to manually assess extracted communities that may contain accounts that are not correlated. An example of this can be seen in community 7 where 4/10 user accounts had unidentifiable similarities to the community of producers/rappers, showcasing a possible case of low accuracy community detection. As previously stated, further research is required to determine the Louvain algorithm’s applicability in the field of community detection within a variety of social media networks that structure their accounts and networks differently.

IMPLICATIONS

Looking to future research, it's essential to take our findings and processes and implement them on various social networks to identify possible applicabilities. Furthermore, it is recommended to make use of API access on social media networks productively, as we learned throughout our research. Twitter’s API key gave us access to only scrape 1 account or 1000 results per request (whichever is less), equating to 15 requests per 15 minutes per api key. With 3 api keys, a minimum of 15 accounts per 15 minutes or up to 45 accounts per 15 minutes depending on a user’s followers/following. This limitation challenged our data gathering capabilities with a database that houses millions of users. Through analysis we discovered that by sampling only the top 3000 followers/following of each user, we were able to still extract meaningful results with our mode. Therefore, analyzing a social network’s API capabilities and identifying creative solutions to collect meaningful data. Utilizing our findings it would be interesting to research the application of community detection on advertisement recommendations. Specifically, through our identification of community interests and locations using less sensitive data, social networks can allow advertisers to target accounts accurately without the release of highly sensitive account data that is currently utilized. This benefits the consumer/account owner by keeping their personal data safe, while also giving advertisers similar advertisement turnover accuracy. This is becoming particularly important in a world where more of our private data is shared on a daily basis.

Furthermore, it would be interesting to apply the louvain algorithm’s community detection capabilities in group and account recommendations on social media platforms. Currently, on many platforms such as Instagram or facebook, account recommendations are based upon the people you follow and their connections. With the uncovering of hidden communities of common interests, platforms may be able to implement account recommendations based upon common interests among individuals.

We have only surfaced the tip of the iceberg to identify communities within social media platforms, it is now in the hands of the social media network giants and scientific community to both further research and implement learnings in this space.

Marwan Kudsi (mkudsi@ucsd.edu) | Freddy Wang (f5wang@ucsd.edu) | Ari Stassinopoulos (aristotle@ucsd.edu); Mentor: Arya Mazumdar; View our poster