Mixpanel calculates retention wrong, here’s how to fix it

Tags
Published
Author
They do it based on the count of people that did something in an interval, and the subsequent weeks after they continue to do it. Meaning that everyone in the January cohort that performed the event 2 months out show up in the February cohort events 1 month out. This is incorrect, users should not appear in multiple cohorts. Retention is an insight that shows you users that continue to do some action, some time range out from the first time they did it. For example the number of users that continue to login after their first login. This is based on their “On or After” method, which they recommend.
Not only are they duplicating users across intervals (users in week 1 will also appear in week 2), but when they go to show you the average they are putting users in multiple times in the average, which ruins its value. They’re doing a sort of “rolling event retention” which is not by users, but by the occurance of events then uniqued by users. They are effectively moving the goal post as time goes on. Or rather “retention of the event from the given interval in isolation”, which is not usable for user retention.
They aren’t actually looking at the first time they appear in the time range (still incorrect), nor the first time they appear ever (correct).
The correct way is to first take the first time users appear, then filter out those that haven’t started within the time window you are looking at (say 8 months). Then you find all the weeks those users appear in within the 8 months, subtract current week - start week to get the delta, and then you count the number of users in each start week, delta combo. Then you can divide to get the %, look at other metrics like net dollar retention, etc.
This is not user retnetion, this is sequencing of events from a start date, which is not user retention. This causes retention to include far too many users in each interval, and be a lot higher than real calculations.
I’ve confirmed that the number of events is the same in bigquery, mixpanel, and posthog.
with firsts as ( select user_id , date_trunc(min(date(timestamp)), week) week from `tangia-prod.tangia.interaction_event_created` where price != 0 and processor IN ('stripe', 'twitch', 'credit') and date(timestamp) > date_sub(current_date(), interval 8 month) group by user_id ) , weeks as ( select user_id , date_trunc(date(timestamp), week) as week from `tangia-prod.tangia.interaction_event_created` where user_id in (select user_id from firsts) and price != 0 and processor IN ('stripe', 'twitch', 'credit') and date(timestamp) > date_sub(current_date(), interval 8 month) group by user_id, week ) , diffs as ( select weeks.user_id user_id , weeks.week week , date_diff(weeks.week, firsts.week, week) delta , firsts.week as first_week from weeks left join firsts on firsts.user_id = weeks.user_id where date_diff(weeks.week, firsts.week, week) is not null and date_diff(weeks.week, firsts.week, week) >= 0 ) , counts as ( select count(user_id) as users , week , delta , first_week from diffs group by week, delta, first_week ) , first_counts as ( select users, week from counts where delta = 0 ) select counts.users users , counts.week week , counts.delta delta , counts.first_week first_week , first_counts.users first_users , if(first_counts.users is null, 1, counts.users / first_counts.users) as retention from counts left join first_counts on counts.first_week = first_counts.week order by first_week, delta
With posthog, I see the exact same results as this BigQuery SQL, so good to know that I am correct!