Doing it By the Book: Data Segmentation, Validation, and Analysis

Y’all. Codebooks used in the research process seem insignificant, straightforward, unassuming ... and frankly ... boring. It’s technical. It’s documentation. It’s nitpicky. You split hairs in it. But guess what? They are, I argue, the most important, essential, and interesting element of a successful, rigorous, and adventuresome research process—especially with qualitative work (research dealing in words and not necessarily numbers). Wondering what in the world a codebook is? Read on.

Adventuring with codebooks: An apparent oxymoron.

Campus Sonar analysts (and other researchers too) create codebooks at the beginning of the research process, and they’re adjusted and consulted throughout research and data validation to reflect the overarching research goals or questions and answers. When we draft codebooks at Campus Sonar, we ask What questions are we chasing answers to? because we don’t want to finish an awesome project and realize we caught what we were chasing, but we chased the wrong thing all along! These codebooks serve as our steady companions, ever present and providing direction, as we explore and adventure through the boundless and expansive world of data in our search for meaning. 

In this post I’ll cover what codebooks are, how they’re made, why they matter, and how they’re used in various aspects of our research (with examples, of course). Gather round, my nerdlings, and feast your eyes on this post about the driest but one of the most important aspects of research: codebooks for data segmentation, validation, and analysis. 

It's dangerous to go alone!

Legend of ZeldaCodebooks, as I’ve already gushed, are our item of choice ... once started and in action, it feels like when Link receives his sword for the first time in the Legend of Zelda games: “It’s dangerous to go alone! Take this.” With it, we can explore an exciting adventure through the story of our data with the security and safety of our trusty sword.

We wield this item throughout our entire process; relying on parts of the codebook to write our rules, validate data, and analyze the data, so getting it right matters. 

De-coding Codebooks

According to SAGE, a codebook is “a list of codes with code definitions, allowing researchers to keep track of how codes are being used to make sense of data.”

At Campus Sonar, we include a few key aspects of data in each codebook.

  • Code: The category, tag, or label we apply to each mention. A mention is any piece of online data from any source, including original posts, shares/retweets, or comments/replies.
    • This may be a topic or theme or audience (e.g., athletics, doggos, or pie).
    • The labels can change if codes are combined or altered in some way. The list of codes may also grow if the analysis is inductive versus deductive (where codes are predetermined).
  • Definition: Encapsulates what the code means or includes. 
    • It’s crafted based on our knowledge of the client and client-specified needs. 
    • It provides boundaries for the rules we’ll write to segment this data.
    • It may change slightly during the validation process. 
    • It’s always reviewed internally and in some cases externally with the client.
  • Operationalization: Represents how an abstract topic may be measured ... or in our case, how the code may appear in online conversations. 
    • These are more specific examples of what’s included in the code and what’s based on experience or a cursory look at the data.
    • This may significantly impact the terms and phrases used in the rules that segment the data for analysis.
    • Tends to expand or contract as data is cleaned.
    • It’s always reviewed internally and sometimes externally with the client.
  • Examples: We include examples as concrete manifestations of operationalizations that fall within the code definition.
    • They are very specific and expanded upon throughout the research process.

But really, why does it matter?

It sounds like a lot of hullabaloo, doesn’t it? All this creating codebooks, definitions, operationalizations, and examples to ensure the proper application of codes. 

It can be, but codebooks are so important to the entire research process. It sets us up for success in terms of reliability, validity, and, frankly, efficiency throughout the rule-writing, data validation, and analysis portions of our research.

We use codebooks for a few reasons.

  • Consistency! It’s a place for us to store key definitions that drive the entire analysis. If a code definition changes or we add to our operationalization and examples, we need a place to store that information. 
  • Focus! Codebooks help us focus our attention on the specified definitions. We can ignore the noise when we know exactly what we are looking for.
  • Documentation! This ensures replicability. We should be able to pass our codebooks and data to any person reasonably trained in social listening and have them achieve similar results to ours. 

But how does it all come together, you wonder? Let me walk you through it. We start with a code or category, which is based on a specific client question. For example, How are college access and success discussed in relation to Institution X? (This is a real example, by the way.)

The code or category in this example is access and success. From there, we work with the client to identify what access and success mean from their perspective, because it's a pretty broad theme. 

Based on the words and phrases they use and our own research, we create a definitionConversations related to student access, as well as what resources and support are provided by the college to help students achieve academic and non-academic personal and professional success at school and beyond graduation. It’s still pretty broad, but it's a net tightening around our core interests when it comes to innovation.

Operationalization comes next. This process is helped by discussions with the client, other survey work being done at the institution, and online research. We added more to the code,  including references to admissions, financial aid, graduation, employment, and other enrichment opportunities such as student organizations, advising, volunteering, access to high quality faculty and staff, access to additional opportunities to engage with academic material (e.g., conferences, speakers, etc.).

Concrete examples emerged as we applied the definition and operationalization to the resulting dataset and identified core themes that emerged, such as mentions from or about career services, mentions about disability resources, or discussions about graduation.

As we go through and validate the segmented data, we make sure mentions belong in their appropriate categories. This usually isn’t too heavy of a lift because we wrote the rules with the definition and operationalization in mind. 

However, when we clean non-segmented data to make sure we didn’t miss anything, we rely heavily on the codebook. This is where definitions, operationalizations, and examples may change with the process. We look at mentions with an eye towards if it fits with the definition, our operationalization, relying on a method similar to the constant comparative method, used in qualitative analysis. Which, boiled down to its most basic level, is exactly what it sounds like: constant comparisons of incidents. As we see new trends or patterns and determine inclusion, we may make changes to the definition, operationalization, or examples. But we never lose sight of the ultimate research question from the client.

We may use a similarly formal or less formal version of this process when doing qualitative analysis within the category-specific research. Since we’re inherently mixed methods, reporting on both numbers and words, we have to have a reliable method to investigate topics of conversation. While we may not always create codebooks, we use a similar constant comparative method to catalogue popular topics of conversation.

Please don’t make me do it. Don’t make me remind you how much value a human analyst brings to this process overall. 

Did you go read that? Okay, then let me demonstrate that value to you about how this process plays out in projects.

Tidying Our Data Warehouse

We Sonarians talk about data cleaning or data validation a lot. Like it’s our job. Oh wait. #DadJoke

For real though, what does it mean? Data cleaning or data validation is our way of ensuring that not only is the dataset free from significant amounts of irrelevant data, but the data we’re analyzing to answer specific questions actually speaks to that particular topic. Let’s break that down further.

A Shared and Transferable Understanding of "Relevance"

At the broadest level, Sonarians want to make sure that after we write a query and send those little crawlers out to all corners of the web, they’re retrieving relevant data for us to use. By relevant, we typically mean that the data returned is in fact about the institution in question and doesn’t contain spam. 

I’ve already told you about the important front work analysts do to minimize irrelevancies and maximize institutional relevancies in data. But we also identify a handful of topics that may appear relevant to the institution, but are actually spammy. Without giving it all away, I can say that essay writing services are a perfect example of this. Experience has shown us what those irrelevant areas are, and we’ve written an impressive amount of Boolean to identify those types of mentions, suck them up, and boot them out of the dataset so they don’t dilute our analysis.

We take out the trash.

For example, it would be important for us to kick out mentions of university plazas when searching for Plaza University. 

When we’re doing our initial institutional-level relevancy cleaning routine, we aren’t referring to a codebook in the research sense of the term, but we do have a shared understanding (definition, operationalizations, and examples) of what makes a mention relevant to an institution versus not. We don’t waver from that definition across institutions or clients.

The Nitty Gritty of Bespoke Products

A big part of some of our more bespoke products that are hyper-focused to specific colleges or universities, such as targeted analyses, relies on Sonarians’ abilities to understand our clients and the project at hand so we can deliver a meaningful and value-packed report full of actionable insights. 

We just talked about our universal definition of what relevance means in our higher education datasets. A similar universality doesn’t exist for bespoke products because they’re all unique to the client and situation. That means it's essential for analysts to have a complete grasp on the research question the client wants answered and we need significant understanding of the institution. 

This is where the value of a solid codebook cannot be overstated.

We’ve been asked, in the past, to perform targeted analysis on what online conversation looked like for a particular institution and themes, some of which included business and innovation. Let’s take a closer look at innovation.

Innovation is the code.

Based on conversations with the client and online research, the following definition was piloted: The innovation category represents new and creative approaches.  

Within that definition, mentions of innovation, creativity, patents, sustainability, green practices, etc. were determined to be appropriate operationalizations

Once in the data, we found conversations related to bio-design and materials science to be appropriate for inclusion based on the spirit of those topics relative to the category, definition, and operationalizations. These became specific examples that influenced our operationalization and definition. 

By the time the project was completed, a more specific definition was discovered: Innovative, scientific, creative, sustainable, or eco-friendly approach to design, production, development, marketing, and manufacturing related to Institution X.

Organizing Thematic Analyses

Although analysts may not always create a formal codebook as described—it really depends on the client and the deliverable—a similar process occurs if we report on cursory thematic analyses that may be performed on your data. We’ll look for patterns or themes that emerge in online conversations, then we’ll group (or re-group) them using the constant comparative method.

For example, perhaps a cursory look through admissions forum conversations reveals discussion of some common topics. We may see conversations about housing, financial aid, textbooks, or dining. Separating these items out may not be the best way to communicate such findings to the client, especially if there are only a few mentions that are specifically about each of those items. Instead, an overarching theme (code) may be student logistics, which includes any aspect of the student experience excluding academics (definition), which may include housing, financial aid, textbooks, or dining (operationalization). 

Perhaps we continue going through mentions and find some about scholarships. We may decide to pull financial aid from the student logistics code, create a separate theme about finances (code), which includes financial aspects of the college experience (definition), which may include financial aid, scholarships, FAFSA, grants, work-study, etc. (operationalization).

Adventuring with Codebooks: Not So Oxymoronic

In the end, we avoid unpleasant surprises, such as getting to the end of a research project and realizing we didn’t answer the client’s question, by writing a solid codebook based on a deep understanding of our clients and questions that carries us through data segmentation, data validation, and analysis. We avoid writing rules that don’t capture what we need them to. We avoid miscategorizing mentions during data validation. We avoid seeing patterns that aren’t real and providing incorrect insights during analysis. We avoid getting lost in our adventure. 

All in all and without context, the concept of codebooks is pretty boring in and of itself. Codebooks are technical pieces of documentation in a research process driven by rigor. 

But in the dorkiest way possible, it’s how they function in our work that they’re more thrilling. In the context of research and data, a codebook in and of itself is the outline of an adventure story. We create and continue to build on that creation as we quest through the data ourselves, iteratively using the codebook to guide our every move. Thus, when used throughout the research process, it builds out into an exciting narrative full of twists, turns, and unexpected cliffhangers. That’s the kind of story I love to write.

Now, I’ve gotta run. Adventure is calling, and with my codebook in hand, I’m outta here.

Wavy line

Don't miss a single post from Campus Sonar—subscribe to our monthly newsletter to get social listening news delivered right to your inbox.

Subscribe to the NewsletterThis post originally appeared on Campus Sonar's Brain Waves blog.