Puzzling through Social Data Sources

Social listening may seem simple, especially if you’re getting a pitch from a software company. Just enter your search terms, and ta-da—social data. Unfortunately, that’s not how it works. Each software product collects data differently, and the data that’s returned may come with nuances that aren’t obvious to the casual user. Since social listening is all that we do, I’m sharing what we’ve learned working with millions of social mentions from about over a hundred institutions. Hopefully it will help you make an informed decision about how to approach social listening for your campus.

Approaching the Social Data Puzzle

As social listening researchers, you need to know all about the data you’re collecting—where is it from, how did it get to you, and what does it cover? Once you answer these questions, you can see how the pieces fit together. For example, Twitter data is typically a larger puzzle piece (it tends to be a larger data source). How does that affect how you look at smaller, non-Twitter data as part of the puzzle? At Campus Sonar, we know that the size of the puzzle piece doesn’t indicate it's importance—instead it provides context for how the pieces fit into the larger whole for analysis purposes.

When collecting social data, the first aspect to understand is the breakdown between how data is collected from social media sites and non-social media sites.

Piecing Together Data from Social Media

If you’re using enterprise-level social listening software, like we do at Campus Sonar, your software provider likely has a few ways they work with social media sites to collect data.

If data comes directly from social media sites, we get more of it. This inevitably affects our dataset and analysis.

At times, sites like Twitter may agree to partner directly with a social listening software to provide data that matches a social listening query directly from their site. This type of relationship may mean that the site grants near-complete access to publicly-available data on their site.

When our dataset is dominated by one source, our analysts make observations from the dataset by asking questions like:

  • If this happens on that site, will it happen anywhere?
  • If this doesn’t happen on that site, will it not happen anywhere? 
  • If that group is succeeding or having problems on that site, can we be sure that all groups are succeeding or having problems on all sites? 

While we can't use observations from social data to make statistical generalizations, they can help in making logical generalizations (or inferences), but only if we know where our data comes from. 

If data is limited from certain social media sites, it doesn’t match the rest of our dataset. This imbalance affects our analysis.

Some social media sites choose to limit what data social listening software can collect from their site via an Application Programming Interface (API). An API is the part of a site’s server that receives requests and sends responses. 

A social site has terms and conditions in place detailing how much of and how often their site can be accessed through their API by social listening software. Because of this limited coverage, social listening software may have limited access to the site's data—either in volume (you don’t get all matching data), time stamp (you only get data from a certain timeframe), or type of data (you can only access data from certain users, groups, or areas of the site). In other instances, social listening software uses third-party data providers and collects data via their unique data stream from the social media site. Data from third parties is usually subject to the same access limitations as APIs.

Instagram has very nuanced accessibility of their social data, limiting it by certain account types, hashtags, and ownership. And YouTube primarily limits collection of social data by volume—you won’t be able to access all data available during a time period.

When we collect data from social media sites that limit access, our analysts know this data doesn’t match the rest of the data collected in volume, time period, or account type. This informs how we conduct our analysis—we couldn’t compare Twitter data directly to Instagram because one completely dwarfs the other in volume and completeness. However, simple tweaks like analyzing Instagram data separately solves this problem.

If the data is primarily owned (i.e., created by a campus) from social media sites, it changes the nature of our reporting.

Social media sites like Facebook, Instagram, and LinkedIn limit the social data that’s collected from their sites. As of right now, Facebook and LinkedIn only allow data sharing if you own it (e.g., if you’re an admin of a page and able to grant social listening software access to the page). Our analysts can collect nearly real-time data from owned Facebook, Instagram, and LinkedIn pages, and monitor engagement such as post volume and frequency, comments, likes, and follows. Owned data collected this way is typically used to measure, monitor, and strategize a campus’s social media approach.

Privacy is also top of mind for both social media sites and social listening software providers. As sites like Twitter, for example, identify deleted content, newly-private user accounts, or deleted users, social listening software providers must also update the data they’ve collected to stay compliant and respect user privacy.

Adding the Non-Social Media Pieces of the Puzzle

That is—the rest of the internet! “Social listening” is a little bit of a misnomer—as a method, it covers so much more than just social media sites. Social listening software helps you collect non-social media mentions from sites like news, blogs, and forums, giving you a more robust picture of your campus’s reputation online than just social media sites.

Comparing social media data to non-social data is just one way to add value to social listening analysis. Analysts can answer questions like: does your news coverage match the coverage you see on social media or forums in sentiment, topics, or authors?

When data is collected from non-social media sites, social listening software relies on building a strong database and consistent construction of web pages.

The database of non-social media data built by social listening software is only as good as the software’s crawlers.

What’s a crawler? Why does it crawl? Cloudflare has a great analogy.

A web crawler bot is like someone who goes through all the books in a disorganized library and puts together a card catalog so that anyone who visits the library can quickly and easily find the information they need. To help categorize and sort the library's books by topic, the organizer will read the title, summary, and some of the internal text of each book to figure out what it's about.

Social listening software uses proprietary or third party crawlers to collect non-social media data to add to its database, where indexed information from each web page crawled is stored. Some third-party data providers specialize in certain site types, like news, blogs, or review sites. The number of relationships a social listening software provider has with quality third-party data providers can dramatically increase how much relevant data you’re able to collect with your social listening queries.

If you’re ever considering an upgrade to a larger social listening software provider, it’s important to ask about their database, number of crawlers, and frequency of crawling. How do they determine which sites to crawl more often? By site visitors, or some other metric? How large is their historical database of crawled web pages? How fast do they add content to their database?

Once an analyst understands how non-social media data is crawled and indexed, they’re able to understand how their dataset is affected and draw reasonable conclusions.

Collecting non-social media data requires a variety of operators and a good analyst.

Whether you’re searching natively on Twitter or using social listening software, there are a variety of Boolean operators you can use to specify what kind of non-social media data you want to collect.

Perhaps you’re just looking for a key term or phrase, like “wisconsin badgers.” Querying for the phrase “wisconsin badgers” will return all sites that have that phrase in the title, headers, body text, and comments. Good social listening software allows the analyst to use additional operators to search the URL and tags of certain page types, for example. Our query would look like this, to find mentions that match in the text, URL, and tags of pages indexed by social listening software.

“wisconsin badgers” OR url:(Wisconsin AND badgers) OR tags:(wisconsinbadgers)

Then, when the database is queried by a social listening software user, web pages that match the query are returned.

Putting the Puzzle Together

My team of analysts cares about how we collect our social listening data—a lot. We investigate coverage for our social listening software to inform each stage of our research process.

As we build our social listening query, we can select the correct operators based on the sites we’re focusing on for our project. We can also use hashtags, terms, and authors as related to specific sites. This helps us cast a wider net across the internet to collect more relevant data (depending on project goals).

When we analyze the data collected, knowing how the data got to us is critical. If there are fewer YouTube mentions than Tumblr mentions in our dataset—is that a result of site use? Or of crawling restrictions for social listening software? Or something specific to the topic of the dataset? Knowing the answer to those questions is the difference between two vastly unique conclusions.

With more mentions collected from Tumblr, more people must be using that site, so our client should be active on Tumblr.


We’re collecting more mentions from Tumblr because of social listening restrictions on YouTube data. YouTube has over two billion logged-in user visits each month, and while our dataset may not accurately represent the scope of mentions for this topic on the site, we may be able to glean qualitative insight from the data we collected.

The second option is the more nuanced (and more accurate) conclusion—as well as an example of the importance of knowing where your data comes from. This context is also why it’s valuable to understand where social data comes from; a key piece in your social listening puzzle.

Wavy line

Don't miss a single post from Campus Sonar—subscribe to our monthly newsletter to get social listening news delivered right to your inbox.

Subscribe to the NewsletterThis post originally appeared on Campus Sonar's Brain Waves blog.