DEV Community

Cover image for Using RudderStack To Backfill Data In Mixpanel
Team RudderStack for RudderStack

Posted on • Originally published at rudderstack.com

Using RudderStack To Backfill Data In Mixpanel

We recently announced our partnership with Mixpanel, and for the last several months we've been busy building dashboards driving insights as a result of connecting RudderStack to Mixpanel. Unfortunately, our analysis was limited to the timeframe from when we started submitting data. To report from an earlier start date, we needed a way to load previously processed RudderStack data into Mixpanel.

To accomplish this, we explored two options:

  • Replay all of the events from our S3 bucket and send them to Mixpanel
  • Create a RudderStack Warehouse Action to replay the data already stored in our snowflake warehouse

The first option, replaying events from our S3 bucket and sending them to Mixpanel, was the most obvious. With a new user transformation to update the timestamp for this backfill source, we thought we'd be up and running in no time. In general this approach makes a lot of sense, but we've made considerable changes to our stack in the past six months, and we changed our naming conventions when we implemented our Tracking Plans. So this option meant we'd be creating a bunch of extra user profile and event properties in Mixpanel which have been previously deprecated within our RudderStack app. We could clean up this noise within the user transformation, but we decided it might be easier to do this within the data warehouse.

So, we went with option number two: create a RudderStack Warehouse Action to replay the data already stored in our snowflake warehouse. By leveraging Snowflake, we could clean, prep and test the data loading it into a staging project within Mixpanel prior to loading it into production. This route also allowed us to normalize the data schema and pair down the number of properties we wanted to load to only those important to our specific Mixpanel use cases. Yes, similar to the S3 option, this could have been done with user transformations, but we decided cleaning up the data and normalizing schemas in a database would be easier. Here's how we did it:

Step 1: Determine the right Track, Identify and Page calls

If you are new to RudderStack, you may not be familiar with the schema RudderStack creates to store the different event types and track calls in the data warehouse. For our use case we were principally focused on user conversion activity pre and post conversion, so we wanted to process prior identify calls along with the page and track calls from our website. All of these prior events would be loaded into a single "replay" table along with an event type column - which would be used in the user transformation from the Warehouse Action. If we only needed to replay our page calls, we could have used the warehouse action on the page_view table from within Snowflake, but because we have over a dozen different event types, we made the decision to consolidate all of the events into a single replay and only run the Warehouse Action once as opposed to running multiple Warehouse Actions, one for each event type table in Snowflake.

Step 2: Create the staging table in Snowflake

As we mentioned above, we wanted to replay events from a variety of different page, track and identify tables, each of which has more columns than we need. We decided to consolidate the various tables into a single replay table that would be the source for a single warehouse action. This also allowed us to rename events and properties from older events that have since changed. If we did not rename the columns in the new table, we could still create formulas within Mixpanel to combine the various properties, but it's much easier to clean this up in snowflake and/or a user transformation.

Review Mixpanel lexicon

When creating the new table, it was helpful to review the Mixpanel Lexicon in our production instance to review the event and user profile properties. It's important to note that both RudderStack and Mixpanel are case sensitive and since the RudderStack schema in Snowflake is created in all capitals, these columns would need to be modified within the user transformation.

Step 3: Define User Transformation mapping

The RudderStack Warehouse Action will load each row from our replay table as an identify call with each column as a trait. Therefore, we created a User Transformation to change the event type to page or track if necessary and move the traits to the appropriate properties or context objects. We also mapped the email address to the userId property and explicitly set the timestamp to our original timestamp value (or otherwise the Rudderstack Destination Transformer would use the current timestamp for the event).

We also added a new property called "backfiled_date" to make it easy to filter out these events in Mixpanel if we ever need to.


export function transformEvent(event) {

        event.timestamp = event.traits.TIMESTAMP
        event.userId = event.traits.CONTEXT_TRAITS_EMAIL
        event.new_event = event.traits.EVENT 

        if (
            event.type == 'identify'  
            )  {   
                event.context["traits"];
                event.context["page"];

                let page = {
                        initial_referrer : event.CONTEXT_PAGE_INITIAL_REFERRER,
                        initial_referring_domain : event.CONTEXT_PAGE_INITIAL_REFERRING_DOMAIN,
                        path : event.CONTEXT_PAGE_PATH,
                        referrer : event.CONTEXT_PAGE_REFERRER,
                        search : event.CONTEXT_PAGE_SEARCH,
                        title : event.CONTEXT_PAGE_TITLE,
                        url : event.CONTEXT_PAGE_URL
                }
                event.context = {page : page}

                let traits = {
                        account_type : event.traits.CONTEXT_TRAITS_ACCOUNT_TYPE,
                        utm_campaign : event.traits.CONTEXT_TRAITS_UTM_CAMPAIGN,
                        utm_content : event.traits.CONTEXT_TRAITS_UTM_CONTENT,
                        utm_medium : event.traits.CONTEXT_TRAITS_UTM_MEDIUM,
                        utm_source : event.traits.CONTEXT_TRAITS_UTM_SOURCE,
                        utm_term : event.traits.CONTEXT_TRAITS_UTM_TERM,
                }
                event.context = {traits : traits}

                // Mixpanel Mapped Fields
                event.context.traits.email = event.traits.CONTEXT_TRAITS_EMAIL
                event.context.traits.firstName = event.traits.CONTEXT_TRAITS_FIRST_NAME
                event.context.traits.lastName = event.traits.CONTEXT_TRAITS_LAST_NAME
                event.context.traits.backfill_date = "2021-11-07T12:00:00.00Z"
            }

        if (
            (event.new_event == 'form_submit'  || event.new_event == 'webhook_source_event' || event.new_event == 'user_signed_up')
            )  {   
                event.type = 'track';    
                event.event = event.traits.EVENT_TEXT
                event.name = event.EVENT_TEXT
                event.properties.form_id = event.context.traits.form_id                                               
            }             
        if (
            event.new_event == 'page'  
            )  {   
                event.type = 'page';    
                event.event = "page_view"
                event.name = "page_view"
            } 

             delete event.traits;
             return event
}
Enter fullscreen mode Exit fullscreen mode

Step 4: Set Up test environment in Mixpanel

If possible, it is always recommended to create a new project within Mixpanel to test your data replay prior to loading the data into production.

Quick Tip: Mixpanel dropdown menus only display events and their properties for events within the last 30 days. This was confusing to us initially because we could see the events in the eventstream window (when we set the timeframe back several months) but did not see them when we were QA'ing the data in the report builder. In order to populate these menus we had to submit a sample page and track call with current timestamp.

Step 5: Create Warehouse Action & send test events

With the data staged in the replay table and the User Transformation created, the last step was to actually create the RudderStack Warehouse Action to stream the replay events to Mixpanel. We used the Full Sync option and then disabled the source after the initial sync since we only needed to send the data once to the test environment (Check out our Snowflake Warehouse Actions Sources documentation for specifics on how we connected our replay table). Once the QA was completed, we simply updated the Mixpanel Destination API Key and Secret to the production process and reran a full sync.

With the historical data loaded into production we were able to gain insights from a full year's worth of data in Mixpanel, which made our sales and marketing teams very happy.

Try it for yourself

If you would like to learn more about backfilling data in Mixpanel with RudderStack, sign up for free, check out more specifics in our Mixpanel Documentation, and send us a note if you have any questions.

Top comments (0)