library(tidyverse)
library(bigrquery)
library(skimr)
library(janitor)
library(factoextra)
library(ids)
library(here)
library(psych)
library(gt)
Messing with models: k-means clustering of Google Analytics 4 data
Marketing campaigns target specific audiences. Some form of customer segmentation is performed to identify these audiences. These segments can be identified using several approaches. This post describes one such approach: the use of data and machine learning to specify clustering models present within Google Analytics data for an e-commerce store.
The goal is simple: create customer segments a marketing team could use to craft and target specific marketing campaigns. To achieve this goal, this post overviews the use of a straightforward, useful machine learning algorithm called k-means clustering. Packages and functions from R, a statistical programming language, are used for the segmentation process.
<-
ggplot_theme theme_bw() +
theme(
text = element_text(size = 12),
title = element_text(size = 16, face = "bold")
)# Override the default ggplot2 theme
theme_set(ggplot_theme)
Extract and explore data
First, we need some data. This section briefly discusses the data extraction process. I’ve detailed the extraction of Google Analytics data in a previous post. For the sake of time, then, this topic won’t be discussed in-depth. If you’re already familiar with how to access this data, feel free to skip this section.
The data used here is obfuscated Google Analytics data for the Google Merchandise Store. This data is stored and accessed via BigQuery, a cloud-based data warehouse useful for analytics purposes.
This data is useful for example tutorials, so the conclusions drawn here should not be used to infer anything about true purchasing behavior. The purpose of this post is to be a tutorial on how to perform k-means clustering, rather than about deriving true conclusions about Google Merchandise store customers.
Before data extraction, let’s do some exploration of the source data. The focus here is to get a sense of what’s in the data, while also creating an understanding of the source data’s structure. This mostly involves identifying the available fields and data types.
The bigrquery
package provides the bq_tables_fields()
function to retrieve this information. The following code example shows how to use this function to return field names within the dataset’s tables:
<-
table_ecommerce bq_table(
"bigquery-public-data",
"ga4_obfuscated_sample_ecommerce",
"events_20210101"
)
bq_table_fields(table_ecommerce)
Then, the following code submits a SQL query to return the ga_obfuscated_sample_ecommerce
data from BigQuery. It’s important to note, similar to what was done in past posts, I’m querying data associated with transactions occurring over the U.S. Christmas holiday season. Here’s the query string if you want specific details:
<- "
query select
event_date,
user_pseudo_id,
ecommerce.transaction_id,
items.item_category,
items.quantity,
ecommerce.purchase_revenue_in_usd
from `<your-project-name>.ga4_obfuscated_sample_ecommerce.events_*`,
unnest(items) as items
where _table_suffix between '20201101' and '20201231' and
event_name = 'purchase'
order by user_pseudo_id, transaction_id
"
<- bq_project_query(
data_ga_transactions "<your-project-name>",
query|>
) bq_table_download()
Before moving forward, some data validation is in order. Columns containing missing values are the biggest concern.
# Verify if there are any missing values
map(data_ga_transactions, \(x) any(is.na(x)))
$event_date
[1] FALSE
$user_pseudo_id
[1] FALSE
$transaction_id
[1] TRUE
$item_category
[1] FALSE
$quantity
[1] TRUE
$purchase_revenue_in_usd
[1] FALSE
No surprise: some features contain missing values. transaction_id
and quantity
both contain missing values. Our data wrangling step will need to address these issues. Let’s look closer and explore why missing values might be present within the data. This code can be used to do this:
# Examples with a missing `transaction_id`
|> filter(is.na(transaction_id)) data_ga_transactions
# A tibble: 59 × 6
event_date user_pseudo_id transaction_id item_category quantity purchase_revenue_in_usd
<dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 20201101 1494019. <NA> New 1 25
2 20201101 2422026. <NA> New 2 72
3 20201101 2422026. <NA> Fun 1 72
4 20201101 29640693. <NA> Apparel 1 55
5 20201101 29640693. <NA> Shop by Brand 2 59
6 20201101 3297047. <NA> Apparel 1 87
7 20201101 3297047. <NA> Apparel 1 87
8 20201101 3297047. <NA> Apparel 1 87
9 20201101 3297047. <NA> Apparel 1 87
10 20201101 33027284. <NA> Accessories 1 63
# ℹ 49 more rows
# Examples where a `quantity` is missing
|> filter(is.na(quantity)) data_ga_transactions
# A tibble: 148 × 6
event_date user_pseudo_id transaction_id item_category quantity purchase_revenue_in_usd
<dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 20201217 1086663. (not set) (not set) NA 0
2 20201120 11080045. (not set) (not set) NA 0
3 20201204 12422651. (not set) (not set) NA 0
4 20201213 13309292. (not set) (not set) NA 0
5 20201126 13429550. (not set) (not set) NA 0
6 20201213 1396855. (not set) (not set) NA 0
7 20201127 1423291. (not set) (not set) NA 0
8 20201210 1540308157. (not set) (not set) NA 0
9 20201201 15915163. (not set) (not set) NA 0
10 20201120 15980073. (not set) (not set) NA 0
# ℹ 138 more rows
For the transaction_id
column, 59 rows contain missing values. If you dig a little further, you’ll notice the majority of these missing transaction_id
s occur near the beginning of data collection (i.e., 2020-11-01
). This was likely a measurement issue with the Google Analytics setup, which would warrant further exploration. Since access to information about how Google Analytics was set up for the Google Merchandise Store (i.e., I can’t speak with the developers), this issue can’t be further explored. The only option, then, is to simply drop these rows.
Missing quantity
values seem to be associated with examples where transaction_id
and item_category
both contain a (not set)
character string. Again, this could be a measurement issue worth further exploration. Given the available information, these examples will also be dropped.
At this point, dplyr
’s glimpse()
and skimr
’s skim()
functions are used for some additional exploratory analysis. glimpse()
is great to get info about the data’s structure. skim()
provides summary statistics for each variable within the dataset, regardless of type.
glimpse(data_ga_transactions)
Rows: 13,113
Columns: 6
$ event_date <dbl> 20201210, 20201210, 20201210, 20201103, 20201103, 20201103, 202011…
$ user_pseudo_id <dbl> 10111056, 10111056, 10111056, 1014825, 1014825, 1014825, 1014825, …
$ transaction_id <chr> "741471", "741471", "741471", "(not set)", "(not set)", "(not set)…
$ item_category <chr> "Apparel", "Apparel", "Apparel", "Shop by Brand", "Office", "Shop …
$ quantity <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ purchase_revenue_in_usd <dbl> 94, 94, 94, 183, 183, 183, 183, 183, 183, 86, 86, 86, 86, 86, 86, …
skim(data_ga_transactions)
Name | data_ga_transactions |
Number of rows | 13113 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
transaction_id | 59 | 1 | 2 | 9 | 0 | 3563 | 0 |
item_category | 0 | 1 | 0 | 23 | 189 | 22 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
event_date | 0 | 1.00 | 20201167.86 | 4.740000e+01 | 20201101 | 20201121 | 20201202 | 20201212 | 20201231 | ▇▁▁▂▇ |
user_pseudo_id | 0 | 1.00 | 289012023.38 | 1.274012e+09 | 1014825 | 5706716 | 25060616 | 62921408 | 9202815833 | ▇▁▁▁▁ |
quantity | 148 | 0.99 | 1.49 | 2.910000e+00 | 1 | 1 | 1 | 1 | 160 | ▇▁▁▁▁ |
purchase_revenue_in_usd | 0 | 1.00 | 106.68 | 1.171100e+02 | 0 | 40 | 71 | 137 | 1530 | ▇▁▁▁▁ |
The data contains a total of 13,113 rows with six features. After reviewing the data types and comparing some of the example values outputted from glimpse()
, it’s concerning that some variables are of type character rather than type numeric. Values of type integer would be best in this case. Let’s do some more digging.
# Only printing the first 10 values for brevity
unique(data_ga_transactions$transaction_id) |>
head(n = 10)
[1] "741471" "(not set)" "983645" "406646" "2105" "886501" "76937" "29460"
[9] "339943" "614046"
As expected, some transaction_id
s are of type character, which is keeping the column from being a true numeric variable. This is due to some rows containing the (not set)
character string. Since the transaction_id
is critical for our analysis, these rows will need to be filtered out during data wrangling.
Initial data exploration is complete. We now have enough information to begin data wrangling. The wrangling step will strive to format the data into a structure necessary for performing k-means clustering. The following code chunk contains the code to do this. After, there’s a step-by-step explanation of what this code is doing.
<-
data_user_items |>
data_ga_transactions filter(transaction_id != "(not set)") |>
drop_na() |>
mutate(
user_id = random_id(),
.by = user_pseudo_id,
.after = user_pseudo_id
|>
) select(-user_pseudo_id) |>
summarise(
quantity = sum(quantity),
.by = c(user_id, item_category)
|>
) mutate(
item_category = case_when(
== "" ~ "Unknown",
item_category TRUE ~ as.character(item_category)
)|>
) pivot_wider(
names_from = item_category,
values_from = quantity,
values_fill = 0
|>
) clean_names()
In the code above, the (not set)
issue in the transaction_id
feature is addressed first. Any row with the (not set)
value is simply filtered from the data. drop_na()
follows. This function drops any rows that contain a NA
value. As a result of dropping NA
s, we’re left with 11,615 rows: a loss of 1,498 examples (~11.4%).
Merely dropping examples is convenient in this case, but it may not be ideal or even valid in every context. You’ll want to explore what is appropriate for the data you’re working with before applying this strategy.
Some mutating and summarization of the data is next. The mutation step applies a random id in place of the user_pseudo_id
. This provides an extra layer of privacy for users, since the data is being used outside of its original source system.
This may be a little inordinate, but I argue it’s important. It’s best to do what we can to create another layer of privacy to the information we’re using outside of the original source system. Since this data is public, it’s not too much of a concern here. However, this may be an important consideration when you’re working with ‘real world’ Google Analytics data.
I’m not a privacy expert, so you’ll want to identify best practice for the context you work in.
With the modelling goal top-of-mind, aggregation will sum values to individual users rather than transactions. This way, we’ll be able to use k-means clustering to identify customer cohorts, rather than transaction cohorts. To do this, the data is grouped by user_id
and item_category
and the quantity
feature is summed.
Post aggregation, item_category
s that don’t contain any values need to be addressed. To address this, any missing string is replaced with ‘Unknown’ using the case_when()
function. Finally, tidyr
’s pivot_wider()
is used to transform the data from long format to wide format. When taken together, the transformation results in a data set where each feature, other than user_id
, is a sum of the item categories purchased by each user during the period. Each example, then, could comprise of multiple transactions that occur over the period.
Post wrangling, readr
’s write_csv()
function is used to save the data to disk. This way a request isn’t sent to BigQuery every time the post is built. You don’t have to do this, but it’s useful for limiting query costs associated with the service, though it’s pretty economical to query data in this way.
write_csv(data_user_items, "data_user_items.csv")
<- read_csv("data_user_items.csv") data_user_items
Rows: 2941 Columns: 22
── Column specification ────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): user_id
dbl (21): apparel, bags, shop_by_brand, drinkware, new, clearance, accessories, campus_collectio...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data exploration for modeling
Before modeling, let’s do some data exploration. The goal during exploration is to ensure data meets the assumptions of k-means clustering. First, let’s get a general sense of the structure of our wrangled data. dplyr
’s glimpse()
is once again useful here.
glimpse(data_user_items)
Rows: 2,941
Columns: 22
$ user_id <chr> "vqnzdyytdx", "lhkctjeylp", "fkgfswdpur", "bycrrxogpw", "xdecczvgb…
$ apparel <dbl> 3, 0, 3, 0, 0, 0, 4, 15, 1, 1, 0, 1, 2, 0, 2, 3, 0, 0, 0, 0, 1, 6,…
$ bags <dbl> 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ shop_by_brand <dbl> 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 6, 0, 0, 0, 2, 0, 0, 0, 0, 4, 0, 0, …
$ drinkware <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, …
$ new <dbl> 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 6, 2, 0, 0, 0, 0, 1, 1, 0, 4, 0, 0, …
$ clearance <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, …
$ accessories <dbl> 0, 0, 0, 5, 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 3, 0, 0, 1, 0, 0, 0, …
$ campus_collection <dbl> 0, 0, 0, 1, 0, 0, 0, 27, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 0,…
$ office <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ lifestyle <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ small_goods <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ uncategorized_items <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, …
$ stationery <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ google <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ writing_instruments <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ fun <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ unknown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ electronics_accessories <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ notebooks_journals <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ gift_cards <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ black_lives_matter <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
For this time period, the data contains over 2,941 customers and 21 features available for the k-means clustering model. The user_id
feature will be excluded from the modeling. Features represent counts of items purchased within each item category for each user during the period. skimr::skim()
is again handy for getting a sense of the shape of the data.
skim(data_user_items)
Name | data_user_items |
Number of rows | 2941 |
Number of columns | 22 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 21 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
user_id | 0 | 1 | 10 | 10 | 0 | 2941 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
apparel | 0 | 1 | 1.42 | 2.13 | 0 | 0 | 1 | 2 | 28 | ▇▁▁▁▁ |
bags | 0 | 1 | 0.28 | 2.02 | 0 | 0 | 0 | 0 | 75 | ▇▁▁▁▁ |
shop_by_brand | 0 | 1 | 0.30 | 1.32 | 0 | 0 | 0 | 0 | 40 | ▇▁▁▁▁ |
drinkware | 0 | 1 | 0.30 | 1.49 | 0 | 0 | 0 | 0 | 30 | ▇▁▁▁▁ |
new | 0 | 1 | 0.56 | 2.32 | 0 | 0 | 0 | 0 | 87 | ▇▁▁▁▁ |
clearance | 0 | 1 | 0.23 | 1.03 | 0 | 0 | 0 | 0 | 20 | ▇▁▁▁▁ |
accessories | 0 | 1 | 0.49 | 2.23 | 0 | 0 | 0 | 0 | 48 | ▇▁▁▁▁ |
campus_collection | 0 | 1 | 0.58 | 3.40 | 0 | 0 | 0 | 0 | 128 | ▇▁▁▁▁ |
office | 0 | 1 | 0.56 | 3.21 | 0 | 0 | 0 | 0 | 80 | ▇▁▁▁▁ |
lifestyle | 0 | 1 | 0.21 | 0.97 | 0 | 0 | 0 | 0 | 20 | ▇▁▁▁▁ |
small_goods | 0 | 1 | 0.05 | 0.28 | 0 | 0 | 0 | 0 | 4 | ▇▁▁▁▁ |
uncategorized_items | 0 | 1 | 0.21 | 1.32 | 0 | 0 | 0 | 0 | 48 | ▇▁▁▁▁ |
stationery | 0 | 1 | 0.19 | 2.16 | 0 | 0 | 0 | 0 | 100 | ▇▁▁▁▁ |
0 | 1 | 0.19 | 3.39 | 0 | 0 | 0 | 0 | 160 | ▇▁▁▁▁ | |
writing_instruments | 0 | 1 | 0.10 | 0.99 | 0 | 0 | 0 | 0 | 40 | ▇▁▁▁▁ |
fun | 0 | 1 | 0.04 | 1.40 | 0 | 0 | 0 | 0 | 75 | ▇▁▁▁▁ |
unknown | 0 | 1 | 0.16 | 1.08 | 0 | 0 | 0 | 0 | 40 | ▇▁▁▁▁ |
electronics_accessories | 0 | 1 | 0.01 | 0.24 | 0 | 0 | 0 | 0 | 12 | ▇▁▁▁▁ |
notebooks_journals | 0 | 1 | 0.01 | 0.30 | 0 | 0 | 0 | 0 | 12 | ▇▁▁▁▁ |
gift_cards | 0 | 1 | 0.02 | 0.40 | 0 | 0 | 0 | 0 | 15 | ▇▁▁▁▁ |
black_lives_matter | 0 | 1 | 0.00 | 0.02 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
Upon visual inspection, the shape of the distributions are concerning. Each histogram exhibits the presence of highly skewed data. Outliers are most likely present, and they will need to be addressed. Otherwise, they’ll negatively impact the k-means clustering algorithm. The mean values for the features also indicate a likely issue: limited purchase frequency for certain item categories.
Before moving ahead with removing examples, dropping some features might be worth exploring. The objective here is to identify item categories with limited purchase frequency. Any item category with a limited purchase frequency could likely be dropped.
summarise(
data_user_items, across(apparel:black_lives_matter, sum)
|>
) glimpse()
Rows: 1
Columns: 21
$ apparel <dbl> 4169
$ bags <dbl> 832
$ shop_by_brand <dbl> 889
$ drinkware <dbl> 893
$ new <dbl> 1634
$ clearance <dbl> 684
$ accessories <dbl> 1438
$ campus_collection <dbl> 1700
$ office <dbl> 1633
$ lifestyle <dbl> 607
$ small_goods <dbl> 143
$ uncategorized_items <dbl> 608
$ stationery <dbl> 554
$ google <dbl> 572
$ writing_instruments <dbl> 291
$ fun <dbl> 122
$ unknown <dbl> 463
$ electronics_accessories <dbl> 35
$ notebooks_journals <dbl> 43
$ gift_cards <dbl> 46
$ black_lives_matter <dbl> 1
Reviewing the output, small_goods
; fun
; electronics_accessories
; notebooks_journals
; gift_cards
; and black_lives_matter
have a small enough purcharse frequency to be dropped. Since we don’t have information on what is being purchased, the unknown
feature is also another feature that could be dropped. Here’s the code to drop these features:
<-
data_items |>
data_user_items select(
-c(
small_goods,
fun,
electronics_accessories,
notebooks_journals,
gift_cards,
black_lives_matter,
unknown
) )
While the skimr::skim()
output includes histograms, we’ll use ggplot2
to examine the distributions in more detail.
<- data_items |>
data_item_hist pivot_longer(
cols = apparel:stationery,
values_to = "items"
)
ggplot(data_item_hist, aes(x = items)) +
geom_histogram(binwidth = 1) +
facet_wrap(~name, ncol = 4, scales = "free") +
labs(
title = "Distribution of purchases by item category",
y = "",
x = ""
)
The following histograms further confirm the presence of skewed data. It also further provides evidence of another characteristic of concern: low item purchase frequency. Both of these issues will need to be addressed before applying k-means clustering.
Let’s first normalize the data, so we can get values across features to be within a similar range. To do this, we’ll transform features into z-scores. The summary()
function can be used to confirm the transformation was applied. This step can be done by utilizing the following code:
scale()
accepts a dataframe with only numeric features. So, I have to remove it, then add it back.
<-
data_items_stnd as_tibble(scale(select(data_items, -user_id))) |>
mutate(user_id = data_items$user_id, .before = 1)
# Verify the standarization was applied
summary(data_items_stnd)
user_id apparel bags shop_by_brand drinkware
Length:2941 Min. :-0.6655 Min. :-0.1399 Min. :-0.2297 Min. :-0.2032
Class :character 1st Qu.:-0.6655 1st Qu.:-0.1399 1st Qu.:-0.2297 1st Qu.:-0.2032
Mode :character Median :-0.1960 Median :-0.1399 Median :-0.2297 Median :-0.2032
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.2734 3rd Qu.:-0.1399 3rd Qu.:-0.2297 3rd Qu.:-0.2032
Max. :12.4797 Max. :36.9390 Max. :30.1695 Max. :19.8776
new clearance accessories campus_collection office
Min. :-0.2391 Min. :-0.2252 Min. :-0.219 Min. :-0.1701 Min. :-0.1728
1st Qu.:-0.2391 1st Qu.:-0.2252 1st Qu.:-0.219 1st Qu.:-0.1701 1st Qu.:-0.1728
Median :-0.2391 Median :-0.2252 Median :-0.219 Median :-0.1701 Median :-0.1728
Mean : 0.0000 Mean : 0.0000 Mean : 0.000 Mean : 0.0000 Mean : 0.0000
3rd Qu.:-0.2391 3rd Qu.:-0.2252 3rd Qu.:-0.219 3rd Qu.:-0.1701 3rd Qu.:-0.1728
Max. :37.2019 Max. :19.1369 Max. :21.277 Max. :37.4975 Max. :24.7254
lifestyle uncategorized_items stationery google writing_instruments
Min. :-0.2135 Min. :-0.1566 Min. :-0.08721 Min. :-0.05737 Min. :-0.1002
1st Qu.:-0.2135 1st Qu.:-0.1566 1st Qu.:-0.08721 1st Qu.:-0.05737 1st Qu.:-0.1002
Median :-0.2135 Median :-0.1566 Median :-0.08721 Median :-0.05737 Median :-0.1002
Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
3rd Qu.:-0.2135 3rd Qu.:-0.1566 3rd Qu.:-0.08721 3rd Qu.:-0.05737 3rd Qu.:-0.1002
Max. :20.4745 Max. :36.2012 Max. :46.21115 Max. :47.13470 Max. :40.4114
Reviewing the summary()
output post normalization, the maximum values are concerning. There are some users who purchased items within categories at a very significant rate. Although this is a nice problem to have for the merchandise store (i.e., big purchases are good for the bottom line), it may cause problems when specifying the clustering algorithm.
Indeed, a couple of approaches could be taken here. For one, the outliers could be retained. This will likely highly affect the k-means algorithms’ ability to effectively identify useful clusters, though. The second option is to drop examples that we consider to be outliers. Let’s think this through a bit.
More than likely, any example with the number of items purchased above 3 standard deviations beyond the mean should be dropped. The merchandise store is likely meant for business-to-consumer (B2C) sales, rather than business-to-business (B2B) sales. As such, the amount of items purchased during a typical customer transaction will likely be of a volume that is reasonable for consumer purchases (e.g., who needs more than 50 items of stationery?). Such large purchases are likely a B2B transaction, where large volumes of items are being bought. Given the intent of the store to be B2C, then examples exhibiting such large purchase volumes should be dropped.
Additional information could be used to further verify this assumption. We likely have a Customer Relationship Management (CRM) system with information about who the customers of these purchases are, and thus we could use this information to confirm if a purchase was for a business or individual. Since the ability to obtain this additional information is not possible, dropping these outliers before clustering is the best option. With all that said, here’s the code to further explore customers considered to be outliers. head()
is used here to limit the output.
|>
data_items_stnd filter(if_any(-c(user_id), ~ . > 3)) |>
head(n = 10)
# A tibble: 10 × 15
user_id apparel bags shop_by_brand drinkware new clearance accessories campus_collection
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 qifwtgogoz 6.38 -0.140 -0.230 -0.203 -0.239 -0.225 -0.219 7.78
2 dootkphyen -0.665 -0.140 4.33 -0.203 2.34 -0.225 -0.219 -0.170
3 qksqqiqfjm -0.665 -0.140 1.29 -0.203 0.191 0.743 0.677 0.124
4 cdykzwveuk -0.665 -0.140 -0.230 10.5 -0.239 -0.225 -0.219 -0.170
5 yigpconufy -0.665 -0.140 -0.230 -0.203 6.65 -0.225 -0.219 -0.170
6 ubblgupnrd 4.97 -0.140 -0.230 5.82 -0.239 -0.225 -0.219 1.30
7 shyjuzbuci -0.665 36.9 -0.230 -0.203 -0.239 -0.225 -0.219 -0.170
8 cbpqtertvd -0.665 -0.140 4.33 -0.203 -0.239 -0.225 -0.219 -0.170
9 djoibydgth 0.743 -0.140 0.530 2.47 1.05 -0.225 -0.219 -0.170
10 zqgnutolrn 3.09 -0.140 -0.230 -0.203 12.7 -0.225 -0.219 -0.170
# ℹ 6 more variables: office <dbl>, lifestyle <dbl>, uncategorized_items <dbl>, stationery <dbl>,
# google <dbl>, writing_instruments <dbl>
Here’s the code to filter out customers considered to be outliers:
<- data_items_stnd |>
data_items_stnd select(-user_id) |>
filter(!if_any(everything(), ~ . > 3))
summary(data_items_stnd)
apparel bags shop_by_brand drinkware new
Min. :-0.66549 Min. :-0.13986 Min. :-0.22973 Min. :-0.20324 Min. :-0.23910
1st Qu.:-0.66549 1st Qu.:-0.13986 1st Qu.:-0.22973 1st Qu.:-0.20324 1st Qu.:-0.23910
Median :-0.19602 Median :-0.13986 Median :-0.22973 Median :-0.20324 Median :-0.23910
Mean :-0.09975 Mean :-0.05728 Mean :-0.07586 Mean :-0.08145 Mean :-0.08105
3rd Qu.: 0.27345 3rd Qu.:-0.13986 3rd Qu.:-0.22973 3rd Qu.:-0.20324 3rd Qu.:-0.23910
Max. : 2.62080 Max. : 2.82645 Max. : 2.81020 Max. : 2.47420 Max. : 2.77339
clearance accessories campus_collection office lifestyle
Min. :-0.22516 Min. :-0.21897 Min. :-0.17010 Min. :-0.17281 Min. :-0.21349
1st Qu.:-0.22516 1st Qu.:-0.21897 1st Qu.:-0.17010 1st Qu.:-0.17281 1st Qu.:-0.21349
Median :-0.22516 Median :-0.21897 Median :-0.17010 Median :-0.17281 Median :-0.21349
Mean :-0.07933 Mean :-0.08889 Mean :-0.05533 Mean :-0.09425 Mean :-0.07426
3rd Qu.:-0.22516 3rd Qu.:-0.21897 3rd Qu.:-0.17010 3rd Qu.:-0.17281 3rd Qu.:-0.21349
Max. : 2.67916 Max. : 2.91591 Max. : 2.77268 Max. : 2.93947 Max. : 2.88970
uncategorized_items stationery google writing_instruments
Min. :-0.15659 Min. :-0.08721 Min. :-0.05737 Min. :-0.10021
1st Qu.:-0.15659 1st Qu.:-0.08721 1st Qu.:-0.05737 1st Qu.:-0.10021
Median :-0.15659 Median :-0.08721 Median :-0.05737 Median :-0.10021
Mean :-0.04983 Mean :-0.05096 Mean :-0.03856 Mean :-0.05943
3rd Qu.:-0.15659 3rd Qu.:-0.08721 3rd Qu.:-0.05737 3rd Qu.:-0.10021
Max. : 2.87322 Max. : 2.69069 Max. : 1.71234 Max. : 2.93816
Let’s take a look at the histograms again, just to get a sense if dropping outliers helped.
|>
data_items_stnd pivot_longer(
cols = apparel:stationery,
values_to = "items"
|>
) ggplot(aes(x = items)) +
geom_histogram(binwidth = 1) +
facet_wrap(~name, ncol = 4, scales = "free") +
labs(
title = "Distribution of purchases by item category post wrangling",
y = "",
x = ""
)
Although this helped with the issues caused by outliers, we still have to contend with the fact that some customers just don’t buy certain items. We’ll want to keep this in mind when drawing conclusions from the final clusters.
Determine a k-value
Now that the data is in a format acceptable for modeling, we need to explore a value for the number of cluster centers, the k-value. Various methods can be used to determine this value. I’ll rely on three methods here: the elbow method; the average silhouette method; and using the realities imposed by the business case (Lantz 2023). Each will be discussed more in-depth in the following sections.
The first method is the elbow method, where we visually examine an elbow plot of the total within sum of squares based on the number of potential clusters used for the model. The fviz_nbclust()
function from the factoextra
package is useful here. We first pass in the data we intend to use for modeling, then base R’s stats
kmeans()
function. We’re also interested in creating the plot using the within sum of squares method, so we specifiy that using the method
argument.
fviz_nbclust(data_items_stnd, kmeans, method = "wss")
Given visual examination of the plot, six clusters seems to be a good starting point. The average silhouette method is another visulization useful for confirming the number of cluster groups for our cluster modelling.
fviz_nbclust(data_items_stnd, kmeans, method = "silhouette")
Just as was expected, the silhouette method provides additional evidence for 6 clusters.
Considering the business case is another useful method for determining the k-value. For instance, say our goal is to cluster the data based on the number of campaigns our marketing team has capacity to manage. The number used here is completely arbitrary (i.e., I don’t have a marketing team to confirm capacity). Thus, for the sake of example, let’s say we have a marketing team capable of managing 3 campaigns over the holiday season.
So, based on the information above, let’s examine k-means clustering using 3 and 6 groups.
Specify the model
Now the modeling step. We’ll use base R’s kmeans()
function from the stats
package. Since a clustering model with 3 or 6 clusters is being explored here, purrr
’s map()
function is used to iterate the model specification. map()
returns a list object, which each element of the list is a model output. One for the three and six cluster model. set.seed()
is also used for the reproducibility of the results.
set.seed(20240722)
<-
mdl_clusters map(c(3, 6), \(x) kmeans(data_items_stnd, centers = x))
Evaluating model performance
To assess model fit, we’ll look to the cluster sizes for both clustering models. purrr
’s map
function makes this easy. Use the following code to return the size
element of kmeans
output:
map(mdl_clusters, "size")
[[1]]
[1] 1642 762 278
[[2]]
[1] 223 296 1284 660 80 139
Identifying imbalanced groups is priority here. Some imbalance is tolerable, but major imbalances might indicate the presence of model fit issues. The six cluster model looks fairly balanced, where only one group includes a small subset of customers (~80 customers). The three cluster model has one larger group followed by ever decreasing sized groups. Overall, the balance across the different groups seems to be acceptable here.
Visualizing the clusters is also useful for model assessment. The factoextra
package is once again helpful. The fviz_cluster()
function from the package can be used to visualize the clusters. The function first takes our model output (mdl_clsuters[[1]]
) and the initial data object (data_item_stnd
) as arguments. Both the three and six cluster model are visualized using the example code below.
fviz_cluster(mdl_clusters[[1]], data = data_items_stnd, pointsize = 3) +
labs(
title = "Three-cluster model of Google Merchandise Store customers",
subtitle = "View of cluster groupings during the U.S. holiday season"
+
) theme_replace(ggplot_theme)
fviz_cluster(mdl_clusters[[2]], data = data_items_stnd, pointsize = 3) +
labs(
title = "Six-cluster model of Google Merchandise Store customers",
subtitle = "View of cluster groupings during the U.S. holiday season"
+
) theme_replace(ggplot_theme)
fviz_cluster()
uses dimension reduction methods to allow for plotting of a multi-dimensional dataset into a two-dimensional representation. The output is useful for identifying general clustering patterns within the data, which could provide additional information about the shape of the clusters. For instance, this type of visualization can be used to visually identify any outliers which may be influencing the shape clusters take.
Indeed, the plot for both models shows some cluster overlap. This is an indication that the clusters for this data may not be as distinct as we would like. There might be some common products every customer buys, and a few peripheral products that other clusters purchase beyond these common products. These initial results indicate the presence of ‘upsell’ opportunities for the marketing team. You have your core items that most customers purchase, but some clusters seem to purchase items beyond the core products. Thus, some of the marketing campaigns might strategise ways to highlight the upselling of some products.
Draw conclusions about clusters
The next step is to examine the cluster centers and factor loadings. The goal is to derive conclusions about each cluster from this information. Let’s first draw conclusions using our three-cluster model. We’ll look to identify specific audience segments based on item category loadings.
The kmeans()
’s output has an object labelled centers
, which is a matrix of cluster centers. We then inspect these center values for each cluster, which are on the left, and the values associated within each column.
1]]$centers mdl_clusters[[
apparel bags shop_by_brand drinkware new clearance accessories
1 -0.4476288 -0.06097525 -0.06495514 -0.08339455 -0.09966977 -0.2251558 -0.096510610
2 0.7127262 -0.06395056 -0.09907281 -0.07850703 -0.07644876 -0.1933938 -0.103778685
3 -0.2720182 -0.01715296 -0.07663600 -0.07803938 0.01632421 1.0946694 -0.003105749
campus_collection office lifestyle uncategorized_items stationery google
1 -0.05701557 -0.1077973177 -0.09379899 -0.073095026 -0.06268202 -0.04012104
2 -0.09325080 -0.0992918176 -0.05330947 -0.005497098 -0.05926368 -0.04110829
3 0.05854479 -0.0004035005 -0.01628649 -0.033980814 0.04102364 -0.02235330
writing_instruments
1 -0.062586565
2 -0.072300004
3 -0.005490095
1]]$centers |>
mdl_clusters[[as_tibble() |>
mutate(
cluster = 1:3,
across(where(is.double), \(x) round(x, digits = 4)),
.before = 1
|>
) gt() |>
data_color(
columns = !("cluster"),
rows = 1:3,
direction = "row",
method = "numeric",
palette = "Blues"
|>
) tab_header(
title = md(
"**Three-cluster k-means model of Google Merchandise store customers factor loadings**"
)|>
) tab_source_note(
source_note = md(
"Source: Google Analytics data for the Google Merchandise Store"
)|>
) opt_interactive()
After inspecting the three-cluster k-means model factor loadings, a few groups emerge. Cluster 1 (n = 1,642) seems to be the ‘anything but apparel’ customers. This customer segment really doesn’t purchase any specific item, but when they shop, they’re purchasing items other than apparel. Perhaps a campaign could be created to improve customer’s familiarity with products other than apparel that are available in the merchandise store.
Cluster 2 (n = 762) are the ‘fashion fanatics’. In fact, it seems this group is mainly purchasing apparel. Maybe our product and marketing teams could consider releasing a new apparel line around the holiday season. A campaign focused on highlighting different apparel pieces could also be explored.
Cluster 3 (n = 278) are ‘discount diggers’. Indeed, this data covers the holiday season, so maybe some customers around this time are trying to find a unique gift, but want to do so on a budget. Perhaps a campaign focused on ‘holiday gift deals’ might appeal to these types of customers.
If more nuance is required for the segmentation discussion, the factor loadings for the six-cluster model can be examined. Again, these results suggest the presence of ‘anything but apparel customers’; ‘fashion fanatics’; and ‘discount diggers’. However, three additional groups emerge from the six cluster model.
2]]$centers |>
mdl_clusters[[as_tibble() |>
mutate(
cluster = 1:6,
across(where(is.double), \(x) round(x, digits = 4)),
.before = 1
|>
) gt() |>
data_color(
columns = !("cluster"),
rows = 1:6,
direction = "row",
method = "numeric",
palette = "Blues"
|>
) tab_header(
title = md(
"**Six-cluster k-means model of Google Merchandise store customers factor loadings**"
)|>
) tab_source_note(
source_note = md(
"Source: Google Analytics data for the Google Merchandise Store"
)|>
) opt_interactive()
The first additional segment emerging from the six-cluster model includes ‘lifestyle looters’ (n = 223): the customers who purchase products that fit or enhance their lifestyle. Perhaps there’s room for a campaign focused on highlighting how the Google Merchandise Store’s products fit within the lives of its customers: most likely people who work in tech.
The second segment are the ‘brand buyers’ (n = 296). These customers are mostly interested in purchasing branded items. Thus, a campaign highlighting the various branded items that are available might be explored.
The final group to emerge is our ‘accessory enthusiasts’ (n = 139). These are customers most interested in purchasing accessories. Perhaps a focus on accessories could be another campaign our marketing team might look at to create.
Depending on the model reviewed, clustering resulted in the identification of three customer segments campaigns could be targeted: ‘anything but apparel’, ‘fashion fanatics’, and ‘discount diggers’. If an expanded list of segments was required, the six-cluster model provides additional information. This includes segments like ‘lifestyle looters’, ‘brand buyers’, and ‘accessory enthusiasts’. Indeed, the segment names are up for debate. I would lean on my marketing team to workshop them some more. Analysts are poor at naming things.
Wrap up
This post was a tutorial on how to perform clustering using Google Analytics e-commerce data. The k-means algorithm was used to identify clusters within obfuscated analytics data from the Google Merchandise store. Information about the clusters was used to generate various customer segments. The intent was to use these clusters to better inform future marketing campaigns and targeting strategies.
The kmeans()
function from the base R stats
package was used to specify two clustering models. The elbow method, silhouette method, and information about the business case were used to determine the k-values for the clustering models. The fviz_nbclust()
function from the factoextra
package was useful for creating visualizations to further confirm the selected k-values. It was determined both a three and six cluster model would be effective to meet our modelling goal for this data. Lastly, the fviz_cluster()
function from the factoextra
package was used to create visualizations of each model’s clusters.
In truth, this dataset lacked the presence of any interesting cluster groups. I was hoping for some more cross-product segments that could be used for customer segmentation identification. Unfortunately, this wasn’t the case. Many of the Google Merchandise Store’s customers fell within a single item-category. This was likely due to low purchase frequency for item categories, which likely is due to customers only buying one or two products with each purchase. Nonetheless, we were able to still identify various customer segments useful for targeting purposes and provide some additional support for potential marketing campaigns.
So there you have it, another messing with models post. I hope you found something useful. If not, I hope it was somewhat informative and you found a few takeaways.
Until next time, keep messing with models.
References
Reuse
Citation
@misc{berke2024,
author = {Berke, Collin K},
title = {Messing with Models: K-Means Clustering of {Google}
{Analytics} 4 Data},
date = {2024-09-14},
langid = {en}
}