ML Platform
Benchmarking
at Google

ML teams at Google had no baseline for platform effectiveness. UX teams had no repeatable system for producing one. I built a survey instrument and analysis pipeline that did.

Company

Google

Role

UX Researcher

Stakeholders

Directors, Executives, PMs

Year

2025

What I Did

I exclusively owned survey programming, led analysis, and collaborated with a lead UXR to build a quarterly benchmarking program. I handed off a modular codebase with documentation to support launches after I left.

Decision Log

I inherited a survey program that was still being defined. The first launch had revealed disagreement about what to measure, and by the time I joined to build a revised version, stakeholders were still negotiating scope, expanding the program beyond the original team brief to align with a Google-wide OKR. Alignment didn't conclude before launch. It continued through survey programming and into the analysis phase, with new cuts, filters, and metrics requested as the data came in.

The technical decisions I made throughout — JavaScript for survey logic, hierarchical dataframes, modular Python files — were all responses to the same condition: a program that needed to absorb change without being rebuilt from scratch each time.
The initial planned analysis was simple: differences between task ratings. But the significant differences in task frequency complicated things. Exploration was necessary and analysis multiplied. I adopted hierarchical dataframes to support the flexibility that a flat structure couldn't give. When I noticed the lead UXR doing her own ad hoc analysis on spreadsheet exports, I started sharing exports proactively.

Toward the end, knowing the program needed to repeat without me, I refactored the Colab notebook into modular Python files so the next person wouldn't walk into a sea of code.
The survey logic was too interconnected for Qualtrics native features to handle cleanly. Participants moved through a branching sequence of theme selection, platform filtering, and task exclusions that fed into each other across questions. Tracking that in multiple places while the program kept evolving was a risk I didn't want to take. JavaScript let me define the logic once, update it in one place, and build for future rounds rather than just the current one. The Qualtrics API had outdated documentation, which cost me a few hours of trial and error.
When the survey counts came back lower than expected, I knew the t-test would be a risky option for significance testing considering its normal distribution assumptions. I welcomed this bump in the road, though, because I had recently been introduced to the bootstrap method. It makes no distributional assumptions and handles low sample sizes well.

Later, when the lead UXR needed to present findings to stakeholders, I generated bootstrap histograms to help her explain the method. She hadn't used it before and visual context made the difference.
At first I used a fixed 0.05 alpha level. The bootstrap analysis produced more significant results than I expected. Time to check my work. I hadn't accounted for multiple comparisons. A Bonferroni correction fixed it.

The second problem was more subtle. We'd aligned on a minimum response threshold per comparison group to avoid wild west analysis, which meant comparison counts varied by platform. I'd been correcting for total possible comparisons rather than actual ones. When I caught it, I wrote a function to calculate the alpha dynamically, which also handled last-minute response additions without manual recalculation each time.
At times it felt like I was holding the line on what we could credibly claim. Stakeholders kept requesting data drills: significance testing at levels with sample sizes that couldn't support it, incorrectly inferring significance based on margins of error. Sometimes a well-explained limitation did it. Other times, the best thing I could do was document caveats.
Stakeholders needed a weighted effectiveness score per platform. How to weight was defined in rooms I couldn't join. I shared analysis to inform the discussions anyway: comparing candidate dimensions by identifying the highest signal tasks per metric to understand whether they were capturing distinct or overlapping information.
There was no peer review on this project. I designed around that quietly by building verification into the analysis as I went. Some checks were preventative: validating that n thresholds were honored, confirming delta signs were consistent with significance outcomes, making sure visual and tabular results told the same story. Others caught real errors: pulling from the wrong spreadsheet, averaging averages instead of returning to raw data, handling NaN values downstream when they should have been addressed upstream. The Bonferroni correction came from a similar instinct.

Toward the end I raised the peer review gap as a consideration for future round timelines. It wasn't just a personal workaround. It was a program design issue.