Ecommerce
Behavior
Analytics

A Kaggle dataset with 42 million ecommerce events. I built a dbt pipeline to transform them into sessionization, funnel conversion, and cohort retention models.

Company
Independent
Role
Analytics Engineer
Dataset
42M ecommerce events
Year
2026

I independently built a dbt pipeline against a Kaggle ecommerce dataset, modeling 42 million events across staging, intermediate, and mart layers. The project produces sessionization, funnel conversion, and cohort retention models, with every decision documented in the project README.

  • My plan was to build locally with PostgreSQL to build quickly and then migrate to BigQuery to handle data volume once the models were stable. But when I tried to materialize the mart layer with dbt, I hit memory limits. The migration needed to happen earlier.

    Migration followed a series of steps: installing the dbt-bigquery adapter, configuring Google Cloud Platform, and updating my profiles.yml and sources.yml files. On GCP, I needed to create a project, enable the BigQuery API, configure a service account, and load data via Google Cloud Storage.

    The first dbt run against BigQuery surfaced 3 syntax errors: interval arithmetic, argument order, and cast syntax. These were differences between PostgreSQL and BigQuery. None were conceptually difficult, but each required identifying the BigQuery equivalent before proceeding.

  • The goal was to build reusable models. Staging needed to remain the data cleaning layer for cleaning: dropping columns, type casting, removing structural noise. Sessionization encodes business logic: the time gap that defines where one session ends and the next begins. My decision was to use the standard, 30-minute threshold, a choice that could change based on factors like the time gap distribution, business question behind analysis, or targeted user segment. By defining session IDs at the intermediate layer, changes propagate to downstream models without needing to reach back to staging.

  • When I began analysis, I wrote nested queries in SQL to build up to the conversion rate calculation. When I translated my code to dbt, I had to identify the line between the intermediate and mart layer. The innermost layer pivots data to capture each user's first timestamp per funnel step. Any downstream model that needs funnel entry data can reference int_funnel without reimplementing the pivot. As a result, the rest of the logic, which counts users per step, enforces order, and computes the conversion rate, moved downstream to the mart layer. Intermediate models hold building blocks. Marts hold answers.

  • It seemed reasonable for both mart tables to have a wide format, until I took a closer look. If the funnel changed, mrt_conversion would need to, as well. By updating the grain to funnel transitions, there was no need to restructure if a funnel step was added or removed. The from_step and to_step columns push the changes to the row level and support filtering and aggregation by BI tools. mrt_retention stayed wide. The retention grid is consumed as a matrix where cohorts on rows and days on columns is the insight. Long format would require the consuming tool to unpivot to resurface the insight.

    I used UNION ALL to stack funnel transition rows, but this meant the subquery I had used to aggregate user counts had to run twice, not a clean or performant implementation against 42 million rows. I used a CTE to define the aggregation once instead.

  • Testing was built into the process from the beginning. With each layer, a schema.yml was created: staging tests to ensure structural integrity of source data, intermediate tests to confirm grain and business logic, and mart tests to assert complete and valid outputs.

    The not_null test on int_funnel.view_time failed on the first run. 160 rows had null view_time values, meaning users with no view events were passing through because the funnel entry condition was missing. Without the test, the conversion rate would have been silently incorrect.

  • The pipeline answers retention and conversion questions.

    Behavior changes after users add to their cart. 11.1% of users who viewed a product in October 2019 added it to their cart, a fairly low rate. Of those, 49.8% completed a purchase, which suggests that users who add to their cart are largely intending to buy.

    A large share of visits are by one-time users, not users with habitual engagement. Day 1 retention averaged roughly 15 to 20% of the day 0 user count across cohorts, dropping further by day 4.