Data Extraction Limitations of Radian6, Sysomos That You Need To Know!
Application of Social Media Listening Tools
Social media data is vast – we all agree. As per this blog, 12.9M texts, 473k data in Twitter, 49k in Instagram, to call out a few sources, have been created in a single minute in 2018!
Text Analytics projects, notably, Social Media Analytics involves extracting huge data relevant to a particular industry/context. Some of the business objectives could be
(a) to identify the emerging trends from these conversations,
(b) to understand the sentiment of a specific brand/event etc.
A quick idea to extract the data from say Twitter, Instagram could be to register the API of individual sources and pull the data that we’re looking for. For selected blogs, forums, we may have to write a web scraping scripts using Python.
What if, there’s an aggregator which pulls massive data across the sources including the historical/past years data? In the market, there are popular social media listening tools such as Radian6 and Sysomos that cater to this requirement. Those tools index the data for every defined frequency and allows us to extract the data.
It would be a topic on another day about how to extract the data on Radian6 or Sysomos. In this article, I would like to list down the data limitations or constraints that I came across so far. By knowing these key constraints, you might plan your extraction phase of your project accordingly.
By the way, Radian6 was acquired by Salesforce and the product was then renamed & released as “Social Studio”.
Data Extraction Limitations of Salesforce Social Studio (formerly, Radian6)
1) In a single day, we can either extract 500k data or 3 months timeline at a single go, whichever is lower. If you want to extract 1 year of data on the topic “Indian Premier League” for instance, you can add the keywords and extract by 4 quarters – at the end, you would have four files indicating four quarters.
2) For Twitter, we can download only 50k data in a single day. Post that limit, it can only extract the External Ids using which we might need TweetPy to pass on the ids and fetch the corresponding tweets.
There’s a good possibility that, say, you have 10k tweets for the time period Jan – Dec 2015 on a selected topic/keywords extracted via Social Studio today, and when you run those external ids using TweetPy, just don’t get surprised if the data volume has reduced significantly. Time & again, Twitter removes the spam tweets and blocks the concerned users! Hence, you might see this mismatch in those numbers which is fine – we don’t really want to have the spam messages, after all.
Since Social Studio had historically indexed those spam tweets/blocked users, we can’t do anything about it. Wish if there was a feature built-in as part of this tool to check back on Twitter if those users were not blocked at least twice in a year to remove a major portion of junk data 🙂
Data Extraction Limitations of Sysomos
1) For Twitter, Forums, Blogs, News and Tumblr, the historical data can only go back up to 13months.
2) For youtube, the download limit for the mentions is 500 while for all the other sources, the download limit is 50,000 mentions per export. So, we need to shorten our date range and download the data in case if it exceeds 50k limit.
3) For Facebook and YouTube, the date limitations are what the API allows us to go back from so we cannot give an exact date.
1) Social Studio could extract the rolling 3 years data whereas Sysomos gives us the last 13 months.
2) Sysomos has sources such as Instagram and none of these listening tools have Pinterest yet.
3) We can’t add the data source, be it a new blog/forum, and hence we end up doing web scraping for custom requirements or websites.
4) The more generic keyword your input is, the more spam/irrelevant data your outcome would be! So, that’s the key challenge here. A case in point, for one of the products we’re extracting – “Kitchen Sink” – there are lots of idioms/phrases being pulled out. Ex: “Let it sink for a minute”. There’s an album called Kitchen Sink as well :). So, all these spams got to be cleaned prior to subsequent analyses.
Based on your requirements, you can choose the tool and extract the desired data.
P.S: The limitations would keep changing/being updated by the respective tools. I’ve written these based on the past 6 months usage.