Query Me This
To Campus Sonar and our clients, data is life. Campus Sonar analysts eat, sleep, and breathe data because it’s the cornerstone of the work we do—the actionable insights our strategist suggests are grounded in tidbits of information that analysts ferret out of the most obscure areas of the interwebz.
The Importance of Operators and Humans in Maximizing Volume and Minimizing Irrelevances in Data
With great power comes great responsibility. Campus Sonar’s duty to our clients is tied up in analysts’ ability to cast a wide-but-not-too-wide-and-specific-but-not-too-specific net to capture as many (if not all) relevant mentions (data) pertaining to a client and/or their areas of interest. This net is our query—our call for information from a social listening database or software. It defines the scope of the conversations we collect through social listening.
It’s as hefty as it sounds, but analysts can figure out early on if they wrote a good query or not. Figuring that out is one part technical and procedural, one part knowledge and past experience, and one part emotion and intuition. Oh, to be human working with software!
A big part of determining if a query is the best it can possibly be early on means checking our Boolean in our software's query editor before we dispatch those crawlers out to retrieve our data (think: test run). Then it's writing the query and sending the crawlers on their merry way. For us, volume and relevance are two key measures of how well a query and subsequent rules are written.
- Low volume indicates that either there isn’t a lot on the topic (which, honestly, may be the case for a small school with little online presence or an incredibly niche topic such as corgis with tails named Sugar in Madison, Wisconsin). Queries with low volumes of mentions mean that findings are less likely to be generalizable and/or that additional data points would help build a stronger case. Don’t get us wrong, we can still work with lower volumes of data, but analysts will always call for moar data!
- High volumes of irrelevant mentions signal a few things: (1) we misunderstand our target, or (2) we didn’t write enough context to really hone in on our topic. Irrelevant mentions are pretty problematic because they cloud the view of the institution or the topic at hand.
In this post, we'll demonstrate how weak query writing (such as using restricted operators or relying on software and not manually validating data) affects the volume and relevance of data captured. I’ll highlight the differences in volume, relevance, and other key metrics when searching for the same topic using basic operators (e.g., AND, OR, NOT) versus complex operators, and by leveraging human data validation versus no human interference in the final dataset.
Operators and Humans: Volume and Relevance
For each of the following examples, we’ll work through three query approaches a person could use towards ideally gathering a high volume of relevant data, and we’ll compare each approach to see which one is better and why. First, the three query approaches.
- Approach 1: Key Terms Only: This approach searches for the topic using only the basic operators of AND, OR, and NOT, and assumes we’re not able to search for compound phrases. These operators are almost universally available on platforms themselves or in most software options.
- Approach 2: Key Terms and Phrases: We’re getting fancy now! This approach assumes we can search for compound phrases and use the AND, OR, and NOT operators. A compound phrase would be something like, “Yukon potatoes” instead of Yukon AND Potatoes. Quotations are typically available in most software packages and are used to link multiple words together.
- Approach 3: Key Terms, Phrases, and Other Digital Properties: This approach includes everything but the kitchen sink. We write a segment of Boolean that harnesses a series of over 40 operators available within our social listening software.
Next, we’ll compare the volume of irrelevant versus relevant mentions for each example. We’ll indicate the value of manual data validation, completed by a human. So for each approach, we provide data on how much was actually relevant after data cleaning.
Meta Example (Campus Sonar)
The first example simply introduces the basics before we move to a more specific higher education example. Let’s look at a topic near and dear to us: Campus Sonar. “Campus Sonar” is a pretty specific phrase: it’s a unique company name, so pulling out relevant mentions should be pretty easy, right?
Approach 1: Key Terms Only
Let’s search for mentions of Campus Sonar using the basic operators AND, OR, and NOT. We’ll use AND to separate Campus and Sonar, because using basic operators means we have to provide a link to hook Campus to Sonar: that comes in the form of AND—we want both words to occur in the same mention.
Using this approach the following word combinations could be captured.
- The word campus occurring immediately before the word sonar
- The word sonar occurring immediately before the word campus
- The words campus and sonar appearing anywhere within the mention, no matter the positioning
In our social listening software, between June 1, 2020 and June 30, 2020, a search for Campus AND Sonar results in 75 mentions … but only four of those mentions (5.3 percent—yikes!) were relevant after manual data validation. Why? Because using the AND operator pulled in any mention that included both campus and sonar anywhere in the text—they didn’t necessarily have to be right next to one another.
Approach 2: Key Terms and Phrases
Let’s assume you can use the all-encompassing quotation operator. Searching for “Campus Sonar” pulls in fewer mentions, only four, but all four are relevant (100 percent).
While the initial volume is smaller, all of the mentions are relevant because using the quotations operator only pulled in mentions where campus occurred right next to sonar. As we mentioned, “Campus Sonar” is a unique company name so any time those two words appear next to each other, it’s likely they’re referring to Campus Sonar, the company.
Approach 3: Key Terms, Phrases, and Other Digital Properties
What would a Campus Sonar analyst write? Hmmm... “Campus Sonar” OR title:(“Campus Sonar”) OR author:(CampusSonar) OR @CampusSonar OR links:(campus AND sonar) OR url:(campus AND sonar) OR #CampusSonar.
This approach surfaces 200 mentions, and guess what? Those mentions come in at 100 percent relevance.
The number of mentions pulled in by the whole hog approach resulted in 50 times as many relevant mentions than were pulled in by the first two approaches. Wow!!!
Note that 100 percent relevancy is not always promised with this approach. By including more terms and properties, we’ll likely pull in more mentions, but they may not be 100 percent relevant. This is why we clean data!
Sentiment: Comparing the Three Approaches
It’s clear that different approaches pull in differing mention volumes and different volumes of relevant mentions. Let’s look at how advanced operators and data validation impact measure of sentiment in our example.
Error-prone Approach 1 with no data validation grossly underestimated positive sentiment (11 percent) compared to Approach 3, which suggested a higher positive sentiment (30 percent) within more relevant mentions and a higher volume of data that is likely telling more of the story. Approach 2, with a smaller volume of data, but all relevant results, overestimated positive sentiment (50 percent) compared to Approach 3 (30 percent).
When you use different approaches you may not realize you don’t have the whole story.
Dragon Day at Cornell
Like the Primal Scream at Harvard, the Krispy Kreme Challenge at North Carolina State University, or Dooley Day at Emory University, Dragon Day at Cornell joins an intriguing list of campus traditions beloved by most campus inhabitants. A form of bonding, these traditions become pillars of brand recognition for the student body.
Dragon Day is a century old Cornell tradition that pits a dragon constructed by first-year architecture students against a phoenix born of the rival engineering students. Occurring every March, the event begins as the dragon parades across campus and ends with the battle in the Arts Quad.
Let’s use our three approaches to identify online mentions about Dragon Day at Cornell between March 17, 2019 and April 14, 2019. We picked those dates because the event was actually held on March 29, 2019.
Approach 1: Key Terms
Let’s try our basic search: Dragon AND Day AND Cornell.
This search initially results in 217 mentions from 113 authors, but after validation, we’re left with 59 relevant mentions (27.2 percent) from 41 authors. The relevance is low because any time dragon, day, and Cornell appeared in the same mention, it was pulled into our dataset. Those mentions don’t always represent full coverage of the boisterous event online.
Approach 2: Key Terms and Phrases
Let’s try “Dragon Day” AND Cornell using our quotations and the AND operator.
This gives us 57 mentions, and guess what? All 57 are relevant (100 percent)! That’s because “Dragon Day” had to appear with Cornell in the same mention—given that “Dragon Day” is pretty unique when strung together and with the additional context of requiring Cornell to be present, we’re more likely to get relevant mentions.
Bonus points if you noticed we had fewer relevant mentions in this approach (57) versus Approach 1 (59). If you’re wondering, that’s likely because someone referred to the Dragon Day event at Cornell, but didn’t link them together (“Dragon Day”). For example, if someone tweeted about “Day of the Dragon at Cornell!” Approach 1 would’ve picked it up, but not Approach 2.
Approach 3: Key Terms, Phrases, and Other Digital Properties
Okay. Here’s what I just know a Campus Sonar analyst would do with unfettered access to advanced operators. (“Dragon Day” AND Cornell) OR title:(“Dragon Day” AND Cornell) OR (“Dragon Day” AND author:(Cornell)) OR (“Dragon Day” AND @Cornell) OR (#Cornell AND “Dragon Day”) OR links:(Dragon AND Day AND Cornell) OR url:(Dragon AND Day AND Cornell) OR #DragonDay OR #DragonDay2019.
This results in a staggering 275 mentions from 135 authors, with 265 relevant mentions (96.4 percent). The number of relevant mentions from Approach 3 is 4.6 times the number of relevant mentions from Approach 2!
A boon to being a human and running this query: The #DragonDay2019 was added after some research into how individuals talked about the event on Twitter. This is the kind of research Campus Sonar analysts perform for clients.
Again, bonus points if you’re wondering about those mentions Approach 2 missed, but Approach 1 identified, and why I didn’t account for those in Approach 3. For the volume of irrelevant data that approach (dragon AND day) brought in, the payoff seemed low. We traded those few mentions for efficiency and a cleaner dataset. This tradeoff determination is again uniquely human and based on experience, intuition, and the knowledge from testing Approach 1.
Content Sources: Comparing the Approaches
We already looked at sentiment for Campus Sonar … let’s look at content sources for Dragon Day to see how they stack up across approaches, both with and without data validation. “Content source” refers to the type of site where a mention surfaced online.
Approach 1 pulled in a moderate amount of data with little relevance and paints a deceiving picture of where content about Dragon Day surfaces. It suggests that most content surfaces in the news (38 percent) and on blogs (23 percent). This is very different from Approach 2, which suggests that most of the Dragon Day content is on Twitter (72 percent) and the news (14 percent). Both approaches underestimate how much content actually emerges on Twitter compared to Approach 3, where 94 percent of content is on Twitter.
These different data views show that a marketing or communications team using one of the first two approaches could be led in the wrong direction if they use social listening to better understand online conversations about a particular topic.
So, So Nerdy—Now What?
There are a few gold nuggets that are key takeaways for you.
- Test your query first! Don’t assume anything. Lessons learned from early iterations of a query will inform later decisions for its final construction. That perfected query results in a more voluminous and relevant dataset, which in turn means more accurate insights to guide action!
- Advanced operators are paramount to increasing data volume and data relevance. Again, more volume and more relevance means revealing a more complete vision of online conversation about a topic, and that means better insights that hopefully drive better outcomes!
- Manual data validation increases data relevance. When conducting analysis and generating insights, you don’t want the truth of online conversation clouded by irrelevant mentions. As we saw, that can have a significant impact on what you believe is going on online. Cleaning takes time, but it pays dividends.
- Naturally, more relevant data gives a more accurate representation of the topic at hand. This means resulting actions and insights are data-driven!
In essence: to better understand any topic, harness the power of good software and even better humans. A good social listening tool will offer a variety of operators in order to find matching online mentions using a variety of digital properties, and a human analyst offers expertise in using those operators, research, coding, and analysis.
The importance of good software and even better humans is clear. We believe software alone isn't enough for a strong, successful social listening program in higher ed. But our clients can tell you even better than we can.