Redesigning an Unmoderated
Testing Program
at YouTube

The unmoderated testing program my team ran had outgrown its own infrastructure. I redesigned the workflow and data architecture without being asked.

Company

Google

Role

User Experience Researcher

Stakeholders

UXRs, Program Manager, Research Manager

Year

2021

What I Did

I redesigned a distributed unmoderated testing program tracking usability. This self-initiated effort required building out a research pipeline and restructuring data using dimensional modeling.

Decision Log

Our team regularly discussed how to consistently label videos of unmoderated usability testing. I came to realize that consistent labeling did not just depend on shared judgment, but on a quality end-to-end process.

I built a process redesign on my own, one that incorporated my growing familiarity with tools like UserTesting, Qualtrics, and Google Sheets. The database design concepts I was learning through a SQL course I was enrolled in at the time were also influential. My vision was to centralize logic in Qualtrics and standardize the data architecture.
The program measured usability for 40 user journeys across 5 device types. Test waves had to be carefully coordinated to satisfy screening criteria, task pre-requisites, and a target sample size of 20. The pre-launch process was tedious and locked task assignment into a fixed sequence.

This is where I saw the potential for Qualtrics. It unlocked randomization, even assignment, constraint tracking, and user metadata storage (e.g., Premium vs non-Premium, iOS vs Android, device ownership). Together these ensured only relevant tasks were assigned, in a logical sequence, resembling the many possible ways people use YouTube.

UserTesting was not forgotten. It offered a user panel and screen recording capabilities that the program relied on. Integration let each tool do what it did best.
Research participants were sourced and screened in UserTesting and then linked to Qualtrics for task assignment. I used URL parameters to pass the study and participant ID from one platform to the other. This gave the datasets a shared key for joining later while maintaining the boundary between what I designed each tool for.
The program had become a web of tangled spreadsheets. Many contributors combined new and imported columns, which led to high overlap and blurred source of truth. For example, my peers and I created custom spreadsheets that pulled scoring-relevant columns into one place. We'd then port our scores over to an intimidating, multi-source spreadsheet where it was easy to overwrite data, break a formula, or somehow mess things up.

Some contributors copied and pasted columns, while others relied on vlookups. Some were acquainted with the set up, while others, like new team members, needed to decipher it. The solution was to separate spreadsheets by entity, drawing lines based on grain. The launch of data collection was a coarse grain. Participant data was a finer grain. Since each participant was assigned multiple tasks, task scores operated at the finest grain.
Wide formats were readable but not sustainable. I reshaped spreadsheets from wide to long to create stable structures. Long formats meant that new waves of testing added rows, not columns. The org-wide dashboard that consumed some of this data was already struggling with multiple, poorly managed sources. I couldn't fix that, but I could make the data feeding into it cleaner.
Separating data by entity addressed longstanding issues, but it created a new one: team members still needed views that combined fields across sheets. More specifically, the program manager used views of score aggregations, researchers still needed custom workspaces to rate video sessions, and others pulled ad hoc views when the dashboard wasn't enough.

I defined a system of primary and foreign keys to make joining possible without duplicating raw data. I wrote vlookups that referenced column names so team members could move columns without breaking anything. I imported external spreadsheets so changes to shared definitions, like scoring guidelines and task success criteria, propagated automatically to any sheet referencing them.

I left the team with a short set of maintenance rules: "keep raw and calculated data separate", "don't rename columns", "use this formula to build your configured view."