Group event counts indexed by relative day

Assuming aggregated event counts for a whole day and some user: ```python import pandas as pd # 0. Create a sample DataFrame df = pd.DataFrame({'user': [1, 1, 2, 1, 2, 3, 2, 3, 3], 'count': [2, 3, 7, 2, 4, 4, 7, 8, 1], 'date': pd.to_datetime(['2022-01-01', '2022-01-03', '2022-01-06', '2022-01-14', '2022-01-14', '2022-01-21', '2022-02-03', '2022-02-03', '2022-02-12'])}) ``` That original data frame would have been: ``` user count date 0 1 2 2022-01-01 1 1 3 2022-01-03 2 2 7 2022-01-06 3 1 2 2022-01-14 4 2 4 2022-01-14 5 3 4 2022-01-21 6 2 7 2022-02-03 7 3 8 2022-02-03 8 3 1 2022-02-12 ``` 1. Now, add a `start_date` column to `df` by using `.tranform('min')` on the user-grouped date column. 2. Calculate the relative `day` for each user and event with respect to their start date `(df.date - df.start_date).dt.days + 1`, and add that relative `day` offset as another column to the new df 3. Now, pivot the resulting table using the new day column as the index, the user column as columns, and the count column as values: `df.pivot(index='day', columns='user', values='count')` 4. Finally, `.fillna(0.)` the holes for users that had no events on a given relative day in that pivoted df. ```python # 1. Calculate the minimum date for each user df['start_date'] = df.groupby('user').date.transform('min') # 2. Calculate the day difference for each user df['day'] = (df['date'] - df['start_date']).dt.days + 1 # 3. Pivot the DataFrame pivoted = df.pivot(index='day', columns='user', values='count') # 4. Fill the holes with zeros pivoted.fillna(0., inplace=True) ``` The resulting pivoted data frame is: ``` user 1 2 3 day 1 2.0 7.0 4.0 3 3.0 0.0 0.0 9 0.0 4.0 0.0 14 2.0 0.0 8.0 23 0.0 0.0 1.0 29 0.0 7.0 0.0 ``` To then get the log-normal mean of the counts per day, use: ```python data = pivoted.apply('log1p').mean(axis='columns').apply('expm1') ``` The `log1p` and `expm1` add and subtract 1 before and after the operation, to handle the zero counts. This produces the following series: ``` day 1 3.932424 3 0.587401 9 0.709976 14 2.000000 23 0.259921 29 1.000000 ``` Compared to the normal mean, which would produce: ``` day 1 4.333333 3 1.000000 9 1.333333 14 3.333333 23 0.333333 29 2.333333 ``` If you now want to plot this data, you will have to fill in the zero-count days: ```python idx = pd.Series(range(data.index.min(), data.index.max() + 1)) # Alternatively, if you have a fixed range, say 14 days idx = pd.Series(range(1, 15)) series = data.reindex(idx, fill_value=0.) ``` Which transforms the data to the following series for plotting: ``` day 1 3.932424 2 0.000000 3 0.587401 4 0.000000 5 0.000000 6 0.000000 7 0.000000 8 0.000000 9 0.709976 10 0.000000 11 0.000000 12 0.000000 13 0.000000 14 2.000000 ```