5 Learnings From Creating a Sessions Fact Table From GA4 Data in BigQuery SQL

5 Learnings From Creating a Sessions Fact Table From GA4 Data in BigQuery SQL

With the new Google Analytics and Google’s decision to allow a direct integration between GA4 and BigQuery, a new world of possibilities has opened up for us. The direct integration allows us to export GA4 raw data into a BigQuery dataset, and from there, we can use SQL to model the data into a usable and query-able form for easier access.

Ralph Kimball teaches a helpful framework for modeling data into concepts called dimension tables and fact tables. Read here for a great primer on it.

In this post, I show you how to create a sessions fact table in BigQuery, where one row in the table represents a single unique session.

Each row will have columns that are “attributes” for each session, essentially describing characteristics for each session.

Firstly, Why Is a Sessions Fact Table Helpful?

A sessions fact table is helpful, because once it’s created, you can query it easily to answer key business questions and also build more SQL data models on top of it.

An example of a sessions fact table would include a session key column to uniquely identify a single session and different columns to describe that session like user id, session start date, session start time, count of pageviews, and number of signup completions.

Below is how we would describe the schema of the table.

Column nameDatatypeDescription
session_keySTRINGUnique identifier for a session
user_idSTRINGIdentifier for each user
session_start_dateDATEDate the session started
session_start_timeTIMETime of day the session started
count_page_viewsINTEGERNumber of pageviews navigated on website during the session.
count_signupINTEGERNumber of signup completions during the session.
An example sessions fact table

Notice how each column name is lowercase and separated by an underscore instead of a space.

Each column includes a datatype, which is how we store the data in the database. (Assigning a datatype for each column allows us to do some very special things with the data, like manipulating, transforming, and searching very easily.)

In creating this table, I found a number of things interesting about the GA4 export specifically and how GA4 collects data in general. Below are my findings…

1. You Need to Combine User Pseudo Id and Session Id to Get a Unique Session Key

This one is a big one. Initially, I had the incorrect assumption that the “session_id” found in the raw export could represent a single unique session. But I found that not to be the case.

In many instances, I saw:

  • Duplicate session ids assigned to different user pseudo ids – (You would expect all session ids to be unique and no duplicates)
  • Count of unique user pseudo id is greater than count of unique session ids – (You would expect there to be more unique session ids than unique user ids, because a single user can visit a website multiple times aka multiple sessions)

These instances proved that you need to find a different way to get a unique “session id”.

Luckily, Google’s Developer Docs here mentions the solution.

The ga_session_id is actually the unique session key for each individual user.

That’s why we were seeing duplicate session ids across different users – because they’re only unique per user.

How to Get Unique Session Key in GA4:

A session key is a unique identifier for each individual session. To get the unique session key in GA4 across all datasets, you will need to combine the field user_pseudo_id and ga_session_id.

To do that, you can concatenate the two like so:

CONCAT(user_pseudo_id, CAST(session_id as STRING)) as session_key

The above code is assuming you’ve already unnested the ga_session_id field from the raw events table.

2. Some Session Ids May Be Null

When combining the two fields: session id and user id, I happened to find that some session id’s ended up being null.

Some people believe it to be due to people denying consent in GA4 settings and ad blockers.

3. User Pseudo Id Is Never Null

As of now, we have not seen any user_pseudo_id fields null. Of course, that could change depending on the situation you’re in and how you have your tracking set up.

4. If Session Id is Null, Then Session Key Will Also Be Null

With the concatenation of session id field and user id fields, if a session id is null, then the resulting concatenated field of session key will also be null.

For this reason, when you create a sessions fact table, you can add a conditional WHERE clause statement to filter out any session key that is null. AKA: WHERE session_key is not null

Since we found our user_pseudo_id field to not contain any null values, whenever we add both the concatenated session_key field and user_id field in the same table, without filtering out for null session key values, we would end up getting a greater count of user_id than session_key. Thus, better to filter out null session keys.

5. A Simple SQL Model For a Sessions Fact Table

And finally, we get to the meat of the post.

Below is a simple SQL model for creating a sessions fact table from GA4 data in BigQuery.

Table Schema:

Column nameDatatypeDescription
session_keySTRINGUnique identifier for a session
user_idSTRINGIdentifier for each user
session_start_dateDATEDate the session started
session_start_timeTIMETime of day the session started
count_page_viewsINTEGERNumber of pageviews navigated on website during the session.
count_signupINTEGERNumber of signup completions during the session.
Table schema for a sessions fact table

SQL Query Model:

WITH t1 as (
  SELECT
    session_key
    , user_id
    , MIN(event_date) as session_start_date
    , EXTRACT(TIME FROM MIN(TIMESTAMP_MICROS(event_timestamp))) as session_start_time
    , COUNTIF(event_name = 'page_view') as count_page_views
    , COUNTIF(event_name = 'signup_conversion') as count_signup
  FROM `ls-funnel-stg.google_analytics_4.base_events`
  WHERE session_key is not null
  GROUP BY 1,2
  ORDER BY 3,4
)
SELECT * FROM t1

Notes on SQL Query Model:

  • Model is built on top of a base_events table (Comment below if you’d like the SQL for this)
  • The event_name “signup_conversion” is dependent on how you named your conversion event when you set up tracking for that. Use the actual event name you used for your own scenario.
  • Every row represents one unique session and its attributes
  • There are no duplicate session_key values in this table

Conclusion

And that’s it! If there is anything I missed or you’d like clarification on, drop a comment below and I’ll get to it as soon as I can.

How to Export Your Universal Analytics (GA3) Data the Fastest & Easiest Way [w/ Free Templates Included]
A Beginner Data Engineer Project: Extract Random Jokes via API with Python (Part 2)
A Beginner Data Engineer Project: Extract Random Jokes via API with Python (Part 1)
What Are Python Lambda Functions?
How to Make a BigQuery Real-Time MTD Job Cost Monitoring Dashboard (Part 1): Setting Up the Data
3 Things You Need to Get Paid More
How to Connect BigQuery with dbt
How to Model GA4 Data Easier in BigQuery with this Helpful Tool
When to Use IF NULL THEN 0 (How to Handle Null Values in SQL)
GA4 & BigQuery: A SQL Model for a Base Events Table (How to Unnest Fields)

Related Posts