Why We Need More Visibility Into The Social Media Analytics Algorithms We Use
Social media has increasingly become the lens through which we observe the modern human world. In turn, the vast industry of social media analytics platforms have become the instruments through which we use that lens to make sense of society. Yet, despite real world business and governmental decisions being made on their results, we know surprisingly little about the algorithms powering most of these platforms, especially their edge cases and interpretative nuances. How can we make meaningful decisions about social media based on the results of algorithms we know nothing about?
Nearly every social media analytics platform today offers some form of sentiment analysis, typically a simple positive-neutral-negative score, but also occasionally a few additional more nuanced categories like joy or fear.
These sentiment scores have become a go-to resource for understanding social trends, classifying reaction to topics and helping guide business and marketing decisions.
Calculating social media sentiment is an extremely difficult task, filled with complexities.
Unfortunately, few social media companies offer any substantive technical detail on how their sentiment algorithms work, treating them as proprietary business secrets.
Some disclose that their systems are traditional simple “bag of words” word counters that simply have two lists of words, one for “positive” words and one for “negative” words and just count up how many of a tweet’s words are in each list. Some add a score to each word recording just how “positive” or “negative” it is, to differentiate between “love” and “like” or “loathe” and “dislike.” Some use more sophisticated statistical or even neural algorithms. However, almost none share their actual word lists or algorithms.
Rudimentary visibility into sentiment algorithms can be found by filtering to just negative or positive tweets and then using the word cloud histogram feature offered by most platforms to see which words seem to dominate those tweets. This can help identify particularly glaring vocabulary mismatches.
For example, one tool consistently ranked basketball tweets as substantially more negative than football tweets. The culprit turned out to be the fact that many basketball tweets in the sample contained the word “court” to refer to the basketball court, while the tool’s sentiment dictionary labeled “court” as a highly negative word, assuming it always referred to a legal court.
Similarly, another analysis showed tweets about Republicans being far more positive than those about Democrats because during the sample period Democrats were often referred to as simply “democrats” whereas Republicans were consistently referred to as “the republican party.” The word “party” was being incorrectly labeled by the algorithm as a very positive word.
Reverse engineering sentiment algorithms in this fashion can help identify misalignments between the algorithm’s dictionaries and the specific domain being examined. Some social analytics platforms permit their users to manually adjust the sentiment dictionaries applied to a given analysis, supporting domain adaptation, though not all do. Such algorithmic analysis can also yield clues as to the provenance of its dictionary, with some platforms using lightly modified well known open source sentiment dictionaries.
Some provide clues buried in their internal user documentation, such as noting that their sentiment system was initially trained on a few hundred thousand or a few million tweets when the company was first founded and has not been updated since. In fact, few update their dictionaries in realtime hour-by-hour to capture the latest linguistic nuances of Twitter.
Using a dictionary based on a few million tweets sampled from a decade ago raises grave concerns about just what those results are actually measuring.
Most social media analytics platform users are not data scientists, meaning they likely aren’t thinking critically about these sorts of questions or performing systematic evaluations of the results they receive.
Language detection is another opaque but critically important algorithm when searching for words that have different meanings in different languages or which represent a brand’s name in one language but a common unrelated word in another.
At first glance it might seem to be relatively trivial to determine the language of a given tweet. However, the small amount of text and prevalence of acronyms and slang terminology makes social media content especially difficult for traditional language detection algorithms. Tools like the Google Chrome Language Detection (CLD2) library can be readily applied to tweets with quite usable results, but many analytics platforms deploy their own custom algorithms that have been optimized for social use, especially Twitter.
Widely used libraries like CLD2 have well understood performance characteristics and extensive documentation on their edge cases. Some, like CLD2, are completely open source, allowing advanced users to fully understand precisely how the algorithm arrives at its determinations and to proactively identify environments where it may struggle.
In contrast, few social media analytics firms provide much in the way of documentation of their proprietary language detection systems. Many decline to answer specific technical questions, including the algorithm type and size and origin of its training data, treating such information as proprietary business information.
Similar to sentiment, it is sometimes possible to reverse engineer that a given company’s “proprietary” algorithm is in reality just a standard library like CLD2 with a few basic commonsense preprocessing steps like removing hyperlinks and @username references.
Most of the time, however, it is simply impossible to know how a company’s language detection algorithm functions.
Relying on a third-party algorithm without any understanding of its nuances and edge cases is extremely dangerous when it comes to interpreting the results it yields. If an analytics platform reports that Dutch tweets about a topic have decreased by five-fold over a 24-month period down to almost zero, does that really mean that that the Dutch have simply stopped talking about that topic or could it instead simply be that Dutch Twitter practices, from slang usage to abbreviations, have evolved in such a way that the company’s language detection algorithm is becoming less and less accurate at detecting the language?
Without additional information there is no way to know whether the linguistic trends being observed are real or merely algorithmic artifacts.
Comparing results across multiple social analytics companies can lend confidence to observed trends, but the lack of technical detail on the underlying algorithms used by each platform makes it impossible to know whether they are all in reality using the same shared algorithms under the hood.
Many platforms offer vague measures of “importance” or “impressions” or “influence” of those tweeting about a given topic. Some provide at least basic definitions of those terms, such as summing the total number of followers of all users who tweeted about a given topic. Yet, none of these approaches is truly satisfying or meaningful across all queries.
If Donald Trump tweets his support of a new book, his endorsement is likely to lead one half of the US population to embrace the book and the other half to demonize it. Similarly, if Barack Obama tweets his support of a book, the reaction will likely be precisely inverse from that of Trump’s endorsement. In short, both individuals have a specific demographic and ideological base they are highly influential to.
A marketer wishing to pitch a liberal-leaning new book can’t just rank all Twitter users by one magical score of “influence” and pick Donald Trump from the top of the list to ask to endorse it nor could they pick Barack Obama to pitch a conservative leaning book. They have to look at the “demographics of influence” of each user.
Few platforms offer such demographic-level influencer scores as part of their routine summary displays.
In fact, few platforms divulge how they compute the demographic information they offer for Twitter users, from age to geography to income and education levels. Estimating the location of non-geotagged tweets is an extremely difficult task and the majority of the most obvious approaches don’t actually work.
Filtering tweets by country of origin is thus an incredibly error prone process at best, with uncertain results.
More broadly, there is often a sharp divide between the marketing materials many social media analytics platforms tout and the limited realities of how those platforms actually work. For example, platforms may aggressively market themselves as deep learning companies that harness the full power of neural networks to make sense of Twitter. In reality, some caveat those claims upon further scrutiny, acknowledging that they limit their use of deep learning to just a few minor specialty tools and that those tools are in turn limited to small random samples of data, with the overwhelming majority of their results being based on non-neural approaches.
In short, don’t trust a company’s marketing brochures – ask the hard questions about whether each algorithm you will be using is neural based, Naïve Bayesian or simply word counting.
To reduce the computational complexity of tools like word clouds, attribute histograms, maps, customized sentiment analysis, clustering and other higher order analyses, some companies limit their analytics tools to small samples of the total data. While the original query might match more than a quarter billion tweets, the resulting word cloud might be based on just the most recent 1,000 tweets or a random sample of 10,000 tweets, vastly reducing its coverage. Some platforms place prominent warnings in their user interfaces about this sampling, while others bury these caveats deep in their technical documentation.
Putting this all together, the social media analytics revolution reflects a broader trend of the big data world: as we assemble ever greater archives of human behavior, we explore that data through ever more opaque algorithms. We shovel petabytes of data through these black boxes and report what came out the other end without even the faintest understanding of whether those reported results are in any way accurate or meaningful or how their accuracy might vary from query to query. In turn, businesses and governments make very real economic and policy decisions based on numbers that could for all intents and purposes have simply been produced by a random number generator.
In the end, how can we make meaningful decisions about social media based on the results of algorithms we know nothing about?