library(tidyverse)
library(bigrquery)
library(here)
library(glue)
library(skimr)
library(plotly)
library(reactable)
library(arules)
library(arulesViz)
Messing with models: Market basket analysis of online merchandise store data
Background
According to estimates cited by Forbes, the 2024 global e-commerce market is expected to be worth over $6.3 trillion (Snyder and Aditham 2024). Behind this figure is an unfathomable amount of transaction data. If you work in marketing, you more than likely have encountered this type of data, where customer purchases are tracked and each item is logged. How, then, can this data be used to create actionable insights about customer’s purchasing behavior? Market basket analysis is one such tool.
Market basket analysis is used to identify actionable rules about customers’ purchase behavior. As an unsupervised machine learning method, market basket analysis utilizes an algorithm to create association rules from a set of unlabeled data (Lantz 2023). If you’re working with transaction data, this type of analysis is essential to know. It’s an interpretable, useful machine learning technique.
This post explores the use of market basket analysis to identify clear and interesting insights about customer purchase behavior, in the context of an online merchandise store. Specifically, this post uses obfuscated Google Analytics data from the Google Analytics Merchandise Store.
The data used in this post is example data, and it is used for tutorial purposes. Conclusions drawn here are not representative, at least to my knowledge, of real customer purchasing behavior.
Specifically, this post overviews the steps involved when performing market basket analysis using Google Analytics data. The data used in this analysis represents transaction data collected during the 2020 US holiday season (2020-11-01 to 2020-12-31). First, I highlight the use of Google BigQuery and the bigrquery
R package to extract Google Analytics data. Second, I overview the wrangling and exploratory steps involved when performing a market basket analysis. Then, I utilize the arules
package to generate association rules from the data. Finally, the last section discusses rule interpretation, where I aim to identify clear and interesting insights from the generated rules.
Extract Google Analytics data
I’m going to take a moment to describe the data extraction process. If you’re already familiar with how to extract data from Google BigQuery, you might skip ahead to the next section.
Google Analytics provides several interfaces to access analytics data. I’ve found the BigQuery export very useful for this type of work. To extract analytics data in this way, I utilize the bigrquery
package, which provides an interface to submit queries to the service and return data to your R session.
BigQuery is a cloud service which incurs costs based on use, and some setup steps are required to authenticate and use the service. These steps are outside the scope of this post, however Google and the bigrquery
package have some great resources on how to get started.
The bigrquery
package makes the process of submitting queries to BigQuery pretty straightforward. Here is a list summarizing the steps:
- Set up a variable that contains the BigQuery dataset name.
- Create a variable containing your query string.
- Use the package’s
bq_project_query()
andbq_table_download()
function to submit the query and return data.
These steps look like the following in code.
Towards the end of this example code, I write a .csv
file using readr
’s write_csv()
function. This way I don’t need to re-submit my query to BigQuery every time the post is built. I’ve found this to be useful in other analysis contexts as well, as it limits the amount of times I need to use the service, and it helps reduce cost.
<- bq_dataset(
data_ga "bigquery-public-data",
"ga4_obfuscated_sample_ecommerce"
)
# Check if the dataset exists
bq_dataset_exists(data_ga)
<- "
query select
event_date,
ecommerce.transaction_id,
items.item_name as item_name
from `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`, unnest(items) as items
where _table_suffix between '20201101' and '20201231' and
event_name = 'purchase'
order by event_date, transaction_id
"
<- bq_project_query(
data_ga_transactions "insert-your-project-name",
query|>
) bq_table_download()
# Write and read the data to limit calls to BigQuery
write_csv(data_ga_transactions, "data_ga_transactions.csv")
Now that I have a .csv
file containing transactions from 2020-11-01 to 2020-12-31, I can use readr
’s read_csv()
function to import data into the session. Again, this will help limit the number of times I need to query the dataset, as a result, reducing cost.
<- read_csv("data_ga_transactions.csv") data_ga_transactions
Rows: 13113 Columns: 3
── Column specification ────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): transaction_id, item_name
dbl (1): event_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Before we go about wrangling data, let’s get a sense of its structure. dplyr
’s glimpse()
function can be used to do this.
glimpse(data_ga_transactions)
Rows: 13,113
Columns: 3
$ event_date <dbl> 20201101, 20201101, 20201101, 20201101, 20201101, 20201101, 20201101, 20201…
$ transaction_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ item_name <chr> "Android Iconic Backpack", "Google Cork Card Holder", "Google Sherpa Zip Ho…
At first glance, we see the dataset contains 13,113 logged events (i.e., rows) and three columns: event_date
, transaction_id
, and item_name
. Examining the first few examples, a few issues are immediately apparent. Our data wrangling steps will need to address these issues. For one, we’ll need to do some date parsing. We’ll also need to address the NA
(i.e., missing) values in the transaction_id
column.
Wrangle data
Address missing transaction ids
After reviewing the dataset’s structure, my first question is how many examples are missing a transaction_id
? transaction_id
s are critical here, as they are used to group items into transactions. To answer this question, I used the following code:
|>
data_ga_transactions mutate(
has_id = case_when(
is.na(transaction_id) | transaction_id == "(not set)" ~ FALSE,
TRUE ~ TRUE
)|>
) count(has_id) |>
mutate(prop = n / sum(n))
# A tibble: 2 × 3
has_id n prop
<lgl> <int> <dbl>
1 FALSE 1498 0.114
2 TRUE 11615 0.886
Out of the 13,113 events, 1,498 (or 11.4%) are missing a transaction_id
. Missing transaction_id
’s can take two forms. First, a missing value can be an NA
value. Second, missing values occur when examples contain the (not set)
character string.
I address missing values by dropping them. Indeed, other approaches are available to handle missing values. Given the data and context you’re working within, you may decide dropping over 11% of examples is not appropriate. As such, the use of nearest neighbor or imputation methods might be explored.
Although the temporal aspects of the data are not relevant for this analysis, for consistency, I’m also going to parse the event_date
column into type date
. lubridate
’s ymd()
function can be used for this task.
Here’s the code to perform the wrangling steps described above:
<-
data_ga_transactions |>
data_ga_transactions drop_na(transaction_id) |>
filter(transaction_id != "(not set)") |>
mutate(event_date = ymd(event_date)) |>
arrange(event_date, transaction_id)
We can verify these steps have been applied by once again using dplyr
’s glimpse()
function. Base R’s summary()
function is also useful to verify the data is as expected.
glimpse(data_ga_transactions)
Rows: 11,615
Columns: 3
$ event_date <date> 2020-11-11, 2020-11-11, 2020-11-11, 2020-11-11, 2020-11-11, 2020-11-11, 20…
$ transaction_id <chr> "133395", "133395", "133395", "133395", "133395", "133395", "133395", "1342…
$ item_name <chr> "Google Recycled Writing Set", "Google Emoji Sticker Pack", "Android Iconic…
summary(data_ga_transactions)
event_date transaction_id item_name
Min. :2020-11-11 Length:11615 Length:11615
1st Qu.:2020-11-24 Class :character Class :character
Median :2020-12-04 Mode :character Mode :character
Mean :2020-12-04
3rd Qu.:2020-12-13
Max. :2020-12-31
Looks good from a structural standpoint. However, I did notice that our wrangling procedure resulted in events occurring before 2020-11-11 to be removed. This might indicate some data issues prior to this date, which might be worth further exploration. Given that I don’t have the ability to speak with the developers of the Google Merchandise Store to explore a potential measurement issue, I’m just going to move forward with the analysis. Indeed, our initial goal was to identify association rules during the holiday shopping season, so our data is still within the date range intended for our analysis.
Explore data
Data exploration is the next step. How many transactions are represented in the data? Let’s use summarise()
with n_distinct()
on the transaction_id
to get a unique count of the total number of transactions. Our distinct count reveals a total of 3,562 transactions are present within the data. We’ll use each transaction to generate association rules here in a little bit.
# Number of transactions
|>
data_ga_transactions summarise(transactions = n_distinct(transaction_id))
# A tibble: 1 × 1
transactions
<int>
1 3562
skim()
from the skimr
package can now be used to calculate summary statistics for our data. Given the type of data we’re working with, there’s not much to report from this output. The output does confirm our previous count of transactions we got above. We also can see there are 371 unique items purchased during this 51 day period.
skim(data_ga_transactions)
Name | data_ga_transactions |
Number of rows | 11615 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
character | 2 |
Date | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
transaction_id | 0 | 1 | 2 | 6 | 0 | 3562 | 0 |
item_name | 0 | 1 | 8 | 41 | 0 | 371 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
event_date | 0 | 1 | 2020-11-11 | 2020-12-31 | 2020-12-04 | 51 |
Let’s get a sense of what the most purchased items were during this time. Here we’ll use dplyr
’s count()
function.
We can view the top ten items purchased during this period with the use of a simple bar chart. I’m using plotly
for its interactivity. You could use any other plotting package. I’ve also created a table embedded in a tabset using reactable
’s reactable()
function. This is useful if you want to explore all the items within the dataset.
<-
data_item_counts |>
data_ga_transactions count(item_name, name = "purchased", sort = TRUE)
Code
plot_ly(
data = slice_head(data_item_counts, n = 10),
x = ~purchased,
y = ~item_name,
type = "bar",
orientation = "h"
|>
) layout(
title = list(
text = "<b>Top 10 Google Merchandise Store items purchased during the 2020 holiday season (2020-11-11 to 2020-12-31)</b>",
font = list(size = 24, family = "Arial"),
x = 0.1
),yaxis = list(title = "", categoryorder = "total ascending"),
xaxis = list(title = "Number of items purchased"),
margin = list(t = 105)
)
Code
options(reactable.theme = reactableTheme(
color = "#000000",
backgroundColor = "#ffffff"
)
)
reactable(
|> count(item_name, sort = TRUE),
data_ga_transactions searchable = TRUE,
columns = list(
item_name = colDef(name = "Item name"),
n = colDef(name = "Purchases")
) )
Create a sparse matrix
We’re now at the point where we can start performing our market basket analysis. To perform the analysis, I’ll be using the arules
R package (Hahsler, Gruen, and Hornik 2005). Specifically, I’ll be using the package’s apriori()
function. This function will use the apriori algorithm to create the association rules for us (Lantz 2023).
Before we can go about creating association rules, though, we need to transform the data_ga_transactions
tibble into an object that can be used by arules
’ functions. Indeed, the arules
package works well when writing code in a tidyverse style, especially if you’re used to working with tibbles. In fact, the transactions()
function accepts a tibble as an input and outputs the needed data structure used by the package’s functions. You do, however, need to be aware of and specify if your data is in ‘wide’ or ‘long’ format. Since this post’s data initially is in a long format, we specify "long"
for the format
argument in the transactions()
function.
<- data_ga_transactions |>
ga_transactions select(-event_date) |>
mutate(across(transaction_id:item_name, factor)) |>
transactions(format = "long")
ga_transactions
transactions in sparse format with
3562 transactions (rows) and
371 items (columns)
Now we have a ga_transactions
object we can use with arules
’ functions. Specifically, this transactions()
function is creating a sparse matrix, where the matrix has 3,562 rows (the number of transactions) and 371 columns (the number of unique items). If we do some simple multiplication, we can calculate that the matrix contains over 1.3 million spaces.
3562 * 371
[1] 1321502
Explore the sparse matrix
Each space within the matrix either contains a value (i.e., an item was purchased during a transaction) or it does not (i.e., an item was not purchased during a transaction). It’s essentially a 1 or 0 representing whether an item was purchased within a transaction. In fact, it’s called a sparse matrix because many of the matrix’s spaces will not contain a value.
We can now pass the ga_transactions
object into summary()
. Outputted will be some summary information about the transactions.
summary(ga_transactions)
transactions as itemMatrix in sparse format with
3562 rows (elements/itemsets/transactions) and
371 columns (items) and a density of 0.007850158
most frequent items:
Super G Unisex Joggers Google Crewneck Sweatshirt Navy Google Camp Mug Ivory
200 180 177
Google Zip Hoodie F/C Google Heathered Pom Beanie (Other)
159 148 9510
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1266 824 536 295 211 134 100 60 37 23 21 18 10 5 4 3 4 2 1 1
21 22 23 24 26 27 29
1 1 1 1 1 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 2.912 4.000 29.000
includes extended item information - examples:
labels
1 #IamRemarkable Journal
2 #IamRemarkable Ladies T-Shirt
3 #IamRemarkable Lapel Pin
includes extended transaction information - examples:
transactionID
1 100659
2 100953
3 101047
In addition, the summary contains information about various items and transactions. Some of this info we’ve already identified. Other information is new. The first section of the output confirms what we found in our exploratory analysis, the number of transactions and unique items. This section does provide one additional useful metric, density
.
The density
metric represents the proportion of the matrix that contains a transaction (Lantz 2023). In other words, it gives us a sense of proportion of non-zero cells within the data. For this dataset, .7% of matrix’s cells contain a value. Although this is a small amount, it’s really not when you consider we’re exploring transactions for an online merchandise store. If we we’re exploring rules for say a grocery store, product purchases would probably be more patterned and some common items would be purchased more frequently (e.g., milk).
With these values in mind, we can also calculate the total number of items purchased during the period and the average number of items within a transaction. Here’s the code to perform these calculations using the summary information.
# Over 1.3M different positions within the matrix
<- 3562 * 371
mat_pos
# Total number of items purchased during the period (ignoring duplicates)
<- mat_pos * .007850158
items_purchased
# Average number of items in each transaction
/ 3562 items_purchased
[1] 2.912409
The next section of the output details the most frequent items purchased. If you’re comparing these numbers to what’s above, you may notice the most frequent item values don’t match. The sparse matrix does not store information about how many items were purchased during each transaction. Rather, it stores binary values. 1 = at least one of the items was purchased during the transaction and 0 = no items of this type were purchased during the transaction. Thus, it’s measuring the presence of an item in a transaction rather than the frequency of items. Nevertheless, the item rankings closely match what we found above.
The output then provides a distribution of transaction lengths. The majority of transactions (1,266) only contained one item. We can also see there was one transaction that contained 29 different items. Following the distribution output, some additional summary statistics for the transactions are provided. Here we can see transactions, on average, had a total of 2.912 items.
The final two sections of the output really don’t reveal any additional information about the transactions. Rather, it provides some example item labels and transaction ids to verify data was imported correctly.
Although I mention these last two sections are not useful, they were essential for me identifying that the transactions()
function had a format
argument. When first attempting to transform this data, I noticed some unexpected values that lead me to learn about this argument. Lesson learned–check your data.
Inspect transactions further
Beyond summary information, the arules
package provides functions to further explore transactions. For instance, say we wanted to explore the first 10 items in our transactions. To do this, we convert the matrix into a long format data.frame
. This data.frame
will contain two columns: TID
and item
. arules
’ toLongFormat()
function does this transformation for us. Here we need to set the function’s decode
argument to TRUE
, so as to return the item’s actual name in the printed output.
When we run the following code, printed to the console will be the first ten columns of our converted matrix as a two column data frame. One column contains the ID, and the second contains the item name.
head(toLongFormat(ga_transactions, decode = TRUE), n = 10)
TID item
1 1 Super G Unisex Joggers
2 2 Google Beekeepers Tee Mint
3 2 Google Bellevue Campus Sticker
4 2 Google Black Cork Journal
5 2 Google Blue YoYo
6 2 Google Boulder Campus Sticker
7 2 Google Cambridge Campus Sticker
8 2 Google Camp Mug Gray
9 2 Google Chicago Campus Unisex Tee
10 2 Google Dino Game Tee
If we’re interested in exploring complete sets of transactions, arules
’ inspect
generic function is useful. Say we want to inspect the item sets for the first five transactions. We can do this by using the following code.
inspect(ga_transactions[1:5])
Warning in seq.default(length = NCOL(transactionInfo)): partial argument match of 'length' to
'length.out'
items transactionID
[1] {Super G Unisex Joggers} 100659
[2] {Google Beekeepers Tee Mint,
Google Bellevue Campus Sticker,
Google Black Cork Journal,
Google Blue YoYo,
Google Boulder Campus Sticker,
Google Cambridge Campus Sticker,
Google Camp Mug Gray,
Google Chicago Campus Unisex Tee,
Google Dino Game Tee,
Google Emoji Magnet Set,
Google Felt Refillable Journal,
Google Kirkland Campus Sticker,
Google LA Campus Sticker,
Google Mural Notebook,
Google NYC Campus Ladies Tee,
Google NYC Campus Sticker,
Google Packable Bag Black,
Google Red YoYo,
Google Seaport Tote,
Google Seattle Campus Sticker} 100953
[3] {Google Crewneck Sweatshirt Green,
Google SF Campus Zip Hoodie} 101047
[4] {Google Cork Journal,
Google Cup Cap Tumbler Grey,
Google Leather Strap Hat Blue,
Google Pen Bright Blue,
Google Pen Citron,
Google Perk Thermal Tumbler,
Google Speckled Beanie Grey,
Google Speckled Beanie Navy,
YouTube Leather Strap Hat Black} 101412
[5] {Google Campus Bike Tote Navy,
Google Red Speckled Tee,
Supernatural Paper Lunch Sack,
White Google Shoreline Bottle} 101669
The itemFrequency()
generic function is also useful to make quick calculations on individual items. Say we want to calculate the proportion of times our most popular items, the Super G Unisex Jogger and the Google Camp Mug Gray, were a part of a transaction. We can do this with the following code:
itemFrequency(
c("Super G Unisex Joggers", "Google Camp Mug Gray")]
ga_transactions[, )
Super G Unisex Joggers Google Camp Mug Gray
0.05614823 0.02751263
We can also get the count values by passing absolute
to itemFrequency()
’s type
argument. This is exactly the same as we saw in the summary output of the ga_transaction
object. However, this can be useful to find the frequency of items that are not the top items purchased.
itemFrequency(ga_transactions[, c("Super G Unisex Joggers", "Google Camp Mug Gray")], type = "absolute")
Super G Unisex Joggers Google Camp Mug Gray
200 98
Perhaps we would also like to visualize these results. arules
itemFrequencyPlot()
function allows us to do this. Take for example the following code, which outputs a plot of the top 20 items with the greatest support.
itemFrequencyPlot(ga_transactions, topN = 20)
In addition to these plots, we can also output some visualizations of the sparse matrix. Since the matrix contains over 1.3 million spaces, we’re limited in the output we can generate. To output a visualization of the sparse matrix, we can use arules
’ image()
generic function. The first following code chunk outputs a visualization of the first 100 transactions, while the second is a random sample of 100 transactions.
image(ga_transactions[1:100])
image(sample(ga_transactions, 100))
Such visualizations of the matrix may not lead to immediate inferences to be derived from our data. However, they are useful for two reasons. For one, they’re useful for identifying data issues. For instance, if we see a line of boxes filled in with a grey pixel for every transaction, this may be an indication that an item may have been mislabelled in our transactions. Second, this visualization might provide some evidence of unique patterns worth further exploration, perhaps seasonal items. Again, on its face, this visualization may not lead to clear, insightful conclusions from our data to be immediately apparent, but it might point us in a direction for further exploration.
Train the model
Brief overview of association rules
Although others have gone more in-depth on this topic (Lantz 2023), it’s worth taking a moment to discuss what an association rule is before we use a model to create them. Let’s say we have the following five transactions:
{t-shirt, sweater, hat}
{t-shirt, hat}
{socks, beanie}
{t-shirt, sweater, hat, socks, beanie, pen} {socks, pen}
Each line represents an individual transaction. What we aim to do is use an algorithm to identify rules that give us a sense if someone buys one item, what other items will they also likely buy. For example, does buying a t-shirt lead someone to also buy a hat? If so, given our data, can we quantify how confident we are in this rule? Not everyone buying a t-shirt will buy a hat.
When we create rules, they’ll be made up of a left-hand (i.e., an antecedent) and right-hand side (i.e., a consequent). They’ll look something like this:
{t-shirt} => {hat}
# or
{t-shirt, hat} => {sweater}
Rule interpretation is pretty straightforward. In simple terms, the first rule states that when a customer purchases a t-shirt, they’ll also likely purchase a hat. For the second, if a customer purchases a t-shirt and a hat, then they’ll also likely purchase a sweater. It’s important to recognize that the left-hand side can be one or many items. Now that we understand the general makeup of a rule, let’s explore some metrics to quantify each.
Market basket analysis metrics
Before overviewing the model specification steps, we need to understand the key metrics calculated with a market basket analysis. Specifically, we need a little background on the following:
- Support
- Confidence
- Lift
In this section, I’ll spend a little time defining and describing each. However, others do a more through treatment of each metric (see Lantz 2023, chap. 8; Kadlaskar 2021; Li 2017), and I suggest checking out each for more detailed information.
Support
Support is calculated using the following formula:
\[ support(x) = \frac{count(x)}{N} \]
count(x)
is the number of transactions containing a specific set of items. N
is the number of transactions within our data.
Simply put, support is interpreted as how frequently an item occurs within the data.
Confidence
Alongside support, confidence is another metric provided by the analysis. It’s built upon support and calculated using the following formula:
\[ confidence(X\rightarrow Y) = \frac{support(X, Y)}{support(X)} \]
Confidence is a proportion of transactions containing item X (or itemset) results in the presence of item Y (or itemset).
Both confidence and support are important; both are parameters we’ll set when specifying our model. These two parameters are the dials we adjust to narrow or expand our rule set generated from the model.
Lift
Lift’s importance will become more evident once we specify our model and look at some rule sets. But let’s discuss its definition. It’s calculated using the following formula:
\[ lift(X \rightarrow Y) = \frac{confidence(X \rightarrow Y)}{support(Y)} \]
Put into simple terms, lift gives us a number of how likely one item or itemset is to be purchased to its typical rate of purchase. In even simpler terms, the higher the lift, the stronger evidence that there is a true connection between items or item sets.
Use the apriori()
function
Now let’s do some modeling. Here we’ll use arules
’ apriori()
function to specify our model. This step requires a little trial and error, as there’s no exact method for picking support
and confidence
values. As such, let’s just start with apriori
’s defaults for the support
, confidence
, and minlen
parameters. The code looks like the following:
# Start with the default support, confidence, minlen
# support: 0.1
# confidence: 0.8
# minlen: 2
<- apriori(
mdl_ga_rules
ga_transactions, parameter = list(support = 0.1, confidence = 0.8, minlen = 2)
)
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.8 0.1 1 none FALSE TRUE 5 0.1 2 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 356
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[371 item(s), 3562 transaction(s)] done [0.00s].
sorting and recoding items ... [0 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
mdl_ga_rules
set of 0 rules
0 rules were created. This was due to the default support
and confidence
parameters being too restrictive. We will need to modify these values so the algorithm is able to identify rules from the data. However, we also need to be aware that loosening our rules could increase the size of our rule set, and this might result in a rule set unreasonably large to explore. This is where the trial and error comes into play.
So, then, are there any means for determining reasonable starting points? Yes, we just need to consider the business case. Let’s start with support, a parameter measuring how frequently an item or item set occurs within the transactions. A good starting point is to think about how many times a typical item might appear within a transaction throughout the measured period (Lantz 2023).
Since this is an online merchandise store, my expectation for item and item set purchase is quite low. Thus, I’d expect a typical item to be purchased at least once a week, that is, a total of 8 times during the period. A reasonable starting point for support, then, would be:
# Determining a reasonable support parameter
8 / 3562
[1] 0.002245929
When it comes to confidence, it is more about picking a starting point and adjusting from there. For our specific case, I’ll start at .1.
minlen
represents the minimum length the rule (including both the right- and left-hand sides) needs to be before it’s considered for inclusion by the algorithm. It is the last parameter to be set. Our exploratory analysis identified the majority of items in a transaction were quite low, so I believe setting minlen = 2
is sufficient given our data.
Here’s the updated code for our model with the adjusted parameters:
# Use parameters we think are reasonable
# support: 0.002
# confidence: 0.1
# minlen: 2
<- apriori(
mdl_ga_rules
ga_transactions, parameter = list(support = 0.002, confidence = 0.1, minlen = 2)
)
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.1 0.1 1 none FALSE TRUE 5 0.002 2 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 7
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[371 item(s), 3562 transaction(s)] done [0.00s].
sorting and recoding items ... [298 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [277 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
mdl_ga_rules
set of 277 rules
With more reasonable parameters in place, the model identified 277 rules. Is 277 rules too much, too little? You’ll have to decide. For this specific case, 277 rules seems reasonable.
Evaluate model performance
We now need to assess model performance. To obtain information useful for evaluating model performance, we pass the mdl_ga_rules
object to summary()
. Printed to the console is summary information about our rule set.
summary(mdl_ga_rules)
set of 277 rules
rule length distribution (lhs + rhs):sizes
2 3
226 51
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 2.000 2.184 2.000 3.000
summary of quality measures:
support confidence coverage lift count
Min. :0.002246 Min. :0.1017 Min. :0.002246 Min. : 2.016 Min. : 8.00
1st Qu.:0.002246 1st Qu.:0.1574 1st Qu.:0.006738 1st Qu.: 9.054 1st Qu.: 8.00
Median :0.002807 Median :0.2619 Median :0.012072 Median : 17.208 Median :10.00
Mean :0.003169 Mean :0.3466 Mean :0.013363 Mean : 30.519 Mean :11.29
3rd Qu.:0.003369 3rd Qu.:0.4545 3rd Qu.:0.019371 3rd Qu.: 43.975 3rd Qu.:12.00
Max. :0.008703 Max. :1.0000 Max. :0.049691 Max. :168.615 Max. :31.00
mining info:
data ntransactions support confidence
ga_transactions 3562 0.002 0.1
call
apriori(data = ga_transactions, parameter = list(support = 0.002, confidence = 0.1, minlen = 2))
This output contains several sections. Each section provides detailed information about the rule set generated by the algorithm. The first portion of the summary is pretty straightforward. Here, again, we confirm the algorithm identified 277 rules. In addition, similar to the above summary of our transactions object, we get a distribution and summary information of the length of the rule sets. There’s not much to note here, other than the algorithm identified rule sets of a length between 2 (226 rules) and 3 (51 rules), where the mean number of rule length is 2.18.
Let’s start by focusing our attention on the summary of quality measures
portion of the output. This section gives us a sense of how well the model performed with our rule sets. It contains summary information from the three metrics we defined above, along with some summaries of additional metrics.
Let’s move from the simplest to the more complex. First, we have count
. count
is simply the numerator for the support
calculation. Looking at the beginning of this part of the output, we can see the support
and confidence
metrics. The important thing to note here is if any values are close to the parameter we set when specifying the model (.002 or .1). If values are close to the parameter values we set, this may be an indication we were too restrictive (Lantz 2023). Here, because we have a range of values, I’m not too concerned with the parameters values used.
coverage
is the next metric to inspect. According to Lantz (2023), coverage
has a useful real-world interpretation. In short, coverage
represents the chance that a rule applies to any given transaction. When using the summary estimate here, we can estimate that at a minimum, the least applicable rule covers less than 1% percent of transactions, or 0.2% of transactions to be exact. The maximum suggests one rule covers almost 5% of transactions.
The last metric to review from this output is lift
. Lift was defined above, so I won’t spend much time rehashing it here. Rather, let’s interpret it and put it into accessible terms.
Lift is an indicator helpful for identifying rules with a true connection between items. That is, rules with a lift at or greater than one implies items appear more together than by chance (Lantz 2023). The minimum value of this section indicates that the weakest rule shows one such item set, when present, results in the purchase of another item at a two times more likely chance to be purchased. Considering the maximum value, one rule in our rule set shows items to be connected greater than 168 times by chance.
Although we can use these metrics to get a sense of how well a rule represents connections between item purchases, we also need to account for the number of transactions for each rule, as rules with limited transactions can bias the lift measurement. Nevertheless, it’s a metric, as we’ll show later, that’s helpful for ranking rules for additional assessment.
The final section of the output, mining info
, is a copy of what we already specified when we created the model. It simply reports the number of transactions and what we set for our support and confidence parameters. In addition, it provides the full model call. There’s not much to report from this section, other than it’s good to confirm that what was submitted was what we expected.
Inspect the rules
We can now begin exploring our rules for insights. arules
’ inspect()
generic function is used for rule exploration. Perhaps we just want to view the top 20 rules by lift. To do this, we can sort()
our output, use some vector subsetting, and wrap this code within inspect()
.
inspect(sort(mdl_ga_rules, by = "lift")[1:20])
Warning in seq.default(length = NCOL(quality)): partial argument match of 'length' to 'length.out'
lhs rhs support confidence coverage lift count
[1] {Google NYC Campus Mug,
Google Seattle Campus Mug} => {Google Kirkland Campus Mug} 0.002245929 0.6153846 0.003649635 168.61538 8
[2] {Google Pen Bright Blue,
Google Pen Citron} => {Google Pen Grass Green} 0.003649635 0.8666667 0.004211117 147.00317 13
[3] {Google Confetti Slim Task Pad,
Google Crew Combed Cotton Sock} => {Google See-No Hear-No Set} 0.002245929 1.0000000 0.002245929 127.21429 8
[4] {Google Crew Combed Cotton Sock,
Google See-No Hear-No Set} => {Google Confetti Slim Task Pad} 0.002245929 0.8888889 0.002526670 126.64889 8
[5] {Google Pen Bright Blue,
Google Pen Neon Coral} => {Google Pen Grass Green} 0.002526670 0.6428571 0.003930376 109.04082 9
[6] {Google Pen Grass Green,
Google Pen Neon Coral} => {Google Pen Bright Blue} 0.002526670 1.0000000 0.002526670 107.93939 9
[7] {Google Pen Citron,
Google Pen Grass Green} => {Google Pen Bright Blue} 0.003649635 1.0000000 0.003649635 107.93939 13
[8] {Google Clear Framed Blue Shades,
Google Clear Framed Gray Shades} => {Google Clear Framed Yellow Shades} 0.002807412 0.9090909 0.003088153 104.45748 10
[9] {Google Pen Bright Blue,
Google Pen White} => {Google Pen Grass Green} 0.002245929 0.6153846 0.003649635 104.38095 8
[10] {Google Pen Bright Blue,
Google Pen Lilac} => {Google Pen Neon Coral} 0.002245929 0.7272727 0.003088153 103.62182 8
[11] {Google Boulder Campus Mug,
Google NYC Campus Mug} => {Google Cambridge Campus Mug} 0.002245929 0.8888889 0.002526670 98.94444 8
[12] {Google Austin Campus Mug,
Google Seattle Campus Mug} => {Google Cambridge Campus Mug} 0.002245929 0.8888889 0.002526670 98.94444 8
[13] {Google Pen Grass Green} => {Google Pen Bright Blue} 0.004772600 0.8095238 0.005895564 87.37951 17
[14] {Google Pen Bright Blue} => {Google Pen Grass Green} 0.004772600 0.5151515 0.009264458 87.37951 17
[15] {Google Pen Grass Green,
Google Pen White} => {Google Pen Bright Blue} 0.002245929 0.8000000 0.002807412 86.35152 8
[16] {Google NYC Campus Mug,
Google Seattle Campus Mug} => {Google Cambridge Campus Mug} 0.002807412 0.7692308 0.003649635 85.62500 10
[17] {Google Pen Bright Blue,
Google Pen Neon Coral} => {Google Pen Lilac} 0.002245929 0.5714286 0.003930376 84.80952 8
[18] {Android Iconic Crewneck Sweatshirt,
Android SM S/F18 Sticker Sheet} => {Google Land & Sea Journal Set} 0.002807412 1.0000000 0.002807412 82.83721 10
[19] {Google NYC Campus Mug,
Google PNW Campus Mug} => {Google Cambridge Campus Mug} 0.002245929 0.7272727 0.003088153 80.95455 8
[20] {Google Kirkland Campus Mug,
Google NYC Campus Mug} => {Google Seattle Campus Mug} 0.002245929 1.0000000 0.002245929 79.15556 8
If you’re more comfortable working in a tidyverse style, you can obtain the same result by doing the following:
|>
mdl_ga_rules head(20, by = "lift") |>
as("data.frame") |>
tibble()
# A tibble: 20 × 6
rules support confidence coverage lift count
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 {Google NYC Campus Mug,Google Seattle Campus Mug} => {Go… 0.00225 0.615 0.00365 169. 8
2 {Google Pen Bright Blue,Google Pen Citron} => {Google Pe… 0.00365 0.867 0.00421 147. 13
3 {Google Confetti Slim Task Pad,Google Crew Combed Cotton… 0.00225 1 0.00225 127. 8
4 {Google Crew Combed Cotton Sock,Google See-No Hear-No Se… 0.00225 0.889 0.00253 127. 8
5 {Google Pen Bright Blue,Google Pen Neon Coral} => {Googl… 0.00253 0.643 0.00393 109. 9
6 {Google Pen Grass Green,Google Pen Neon Coral} => {Googl… 0.00253 1 0.00253 108. 9
7 {Google Pen Citron,Google Pen Grass Green} => {Google Pe… 0.00365 1 0.00365 108. 13
8 {Google Clear Framed Blue Shades,Google Clear Framed Gra… 0.00281 0.909 0.00309 104. 10
9 {Google Pen Bright Blue,Google Pen White} => {Google Pen… 0.00225 0.615 0.00365 104. 8
10 {Google Pen Bright Blue,Google Pen Lilac} => {Google Pen… 0.00225 0.727 0.00309 104. 8
11 {Google Boulder Campus Mug,Google NYC Campus Mug} => {Go… 0.00225 0.889 0.00253 98.9 8
12 {Google Austin Campus Mug,Google Seattle Campus Mug} => … 0.00225 0.889 0.00253 98.9 8
13 {Google Pen Grass Green} => {Google Pen Bright Blue} 0.00477 0.810 0.00590 87.4 17
14 {Google Pen Bright Blue} => {Google Pen Grass Green} 0.00477 0.515 0.00926 87.4 17
15 {Google Pen Grass Green,Google Pen White} => {Google Pen… 0.00225 0.8 0.00281 86.4 8
16 {Google NYC Campus Mug,Google Seattle Campus Mug} => {Go… 0.00281 0.769 0.00365 85.6 10
17 {Google Pen Bright Blue,Google Pen Neon Coral} => {Googl… 0.00225 0.571 0.00393 84.8 8
18 {Android Iconic Crewneck Sweatshirt,Android SM S/F18 Sti… 0.00281 1 0.00281 82.8 10
19 {Google NYC Campus Mug,Google PNW Campus Mug} => {Google… 0.00225 0.727 0.00309 81.0 8
20 {Google Kirkland Campus Mug,Google NYC Campus Mug} => {G… 0.00225 1 0.00225 79.2 8
The code should be pretty self-explanatory. All we’re doing is using head()
to return the first 20 rules by the lift metric. Then, we convert that output to a data.frame
using base R’s as()
function. The tibble()
function from the tibble package transforms the output into a tibble for us.
Interpret rules
Let’s use our first rule as an example, where we’ll seek to interpret it.
<- mdl_ga_rules |>
first_rule head(1, by = "lift") |>
as("data.frame") |>
tibble()
$rules first_rule
[1] "{Google NYC Campus Mug,Google Seattle Campus Mug} => {Google Kirkland Campus Mug}"
Written out, the rule means: If someone buys the Google NYC Campus Mug and the Google Seattle Campus Mug, then they’ll also likely purchase the Google Kirkland Campus Mug. Support is 0.00225, which indicates this rule is included in roughly .2% of transactions. And when these two mugs are purchased together, this rule, where the third mug is purchased, covers around 62% of these transactions. Moreover, the lift metric indicates the presence of a strong rule, where people who buy the first two mugs are more than 169 times more likely to purchase the third mug.
Why might this be? Perhaps it’s customers who, by purchasing the first two mugs, are primed to just go ahead and buy the third to finish the set. This might be a great cross-selling opportunity. Maybe we present this rule to our developers and suggest a store feature that encourages customers–when they buy a certain set of items–to complete sets of items within their purchases. Maybe the marketing team could think up some type of pricing scheme along with this feature to further encourage the additional purchase to complete the set.
Despite this rule, we also need to consider it alongside the count
metric. If you recall, count
represents the number of transactions this rule’s item set is included. 8 transactions might be too low, and the lift
metric may be biased here as a result. We may want to include additional transaction data to further explore this rule or simply be aware this limitation exists when making decisions.
Subset rules
If you recall, earlier in our analysis the Google Camp Mug was identified as being a frequently purchased item. Say our marketing team wants to develop a campaign around this one product and would like to review all the rules associated with it. arules
’ subset()
generic function in conjunction with some infix operators (e.g., %in%
) and inspect()
is useful to complete this task. Here is what the code looks like to return these rules:
inspect(subset(mdl_ga_rules, items %in% "Google Camp Mug Ivory"))
lhs rhs support confidence coverage
[1] {Google Flat Front Bag Grey} => {Google Camp Mug Ivory} 0.005053341 0.2571429 0.01965188
[2] {Google Camp Mug Ivory} => {Google Flat Front Bag Grey} 0.005053341 0.1016949 0.04969118
[3] {Google Unisex Eco Tee Black} => {Google Camp Mug Ivory} 0.002245929 0.1311475 0.01712521
[4] {Google Large Tote White} => {Google Camp Mug Ivory} 0.002807412 0.1851852 0.01516002
[5] {Google Magnet} => {Google Camp Mug Ivory} 0.002807412 0.1562500 0.01796743
[6] {Google Camp Mug Gray} => {Google Camp Mug Ivory} 0.005053341 0.1836735 0.02751263
[7] {Google Camp Mug Ivory} => {Google Camp Mug Gray} 0.005053341 0.1016949 0.04969118
lift count
[1] 5.174818 18
[2] 5.174818 18
[3] 2.639252 8
[4] 3.726721 10
[5] 3.144421 10
[6] 3.696299 18
[7] 3.696299 18
Say for example the marketing team finds these rules are not enough to build a campaign around, so they request all rules associated with any mug. The %pin%
infix operator can be used for partial matching.
inspect(subset(mdl_ga_rules, items %pin% "Mug"))
Warning in seq.default(length = NCOL(quality)): partial argument match of 'length' to 'length.out'
lhs rhs support confidence coverage lift count
[1] {Google Austin Campus Tote} => {Google Austin Campus Mug} 0.002245929 0.6153846 0.003649635 36.533333 8
[2] {Google Austin Campus Mug} => {Google Austin Campus Tote} 0.002245929 0.1333333 0.016844469 36.533333 8
[3] {Google Kirkland Campus Mug} => {Google Seattle Campus Mug} 0.003368894 0.9230769 0.003649635 73.066667 12
[4] {Google Seattle Campus Mug} => {Google Kirkland Campus Mug} 0.003368894 0.2666667 0.012633352 73.066667 12
[5] {Google Kirkland Campus Mug} => {Google NYC Campus Mug} 0.002245929 0.6153846 0.003649635 27.061728 8
[6] {Google Boulder Campus Mug} => {Google Cambridge Campus Mug} 0.002245929 0.3478261 0.006457047 38.717391 8
[7] {Google Cambridge Campus Mug} => {Google Boulder Campus Mug} 0.002245929 0.2500000 0.008983717 38.717391 8
[8] {Google Boulder Campus Mug} => {Google NYC Campus Mug} 0.002526670 0.3913043 0.006457047 17.207729 9
[9] {Google NYC Campus Mug} => {Google Boulder Campus Mug} 0.002526670 0.1111111 0.022740034 17.207729 9
[10] {Google NYC Campus Bottle} => {Google NYC Campus Mug} 0.002245929 0.2962963 0.007580011 13.029721 8
[11] {YouTube Play Mug} => {YouTube Leather Strap Hat Black} 0.002245929 0.1568627 0.014317799 10.347131 8
[12] {YouTube Leather Strap Hat Black} => {YouTube Play Mug} 0.002245929 0.1481481 0.015160022 10.347131 8
[13] {YouTube Play Mug} => {YouTube Twill Sandwich Cap Black} 0.003368894 0.2352941 0.014317799 12.325260 12
[14] {YouTube Twill Sandwich Cap Black} => {YouTube Play Mug} 0.003368894 0.1764706 0.019090399 12.325260 12
[15] {Google Flat Front Bag Grey} => {Google Camp Mug Ivory} 0.005053341 0.2571429 0.019651881 5.174818 18
[16] {Google Camp Mug Ivory} => {Google Flat Front Bag Grey} 0.005053341 0.1016949 0.049691185 5.174818 18
[17] {Google Unisex Eco Tee Black} => {Google Camp Mug Ivory} 0.002245929 0.1311475 0.017125211 2.639252 8
[18] {Google Chicago Campus Mug} => {Google LA Campus Mug} 0.002245929 0.1702128 0.013194834 14.099951 8
[19] {Google LA Campus Mug} => {Google Chicago Campus Mug} 0.002245929 0.1860465 0.012071870 14.099951 8
[20] {Google Chicago Campus Mug} => {Google Austin Campus Mug} 0.002526670 0.1914894 0.013194834 11.368085 9
[21] {Google Austin Campus Mug} => {Google Chicago Campus Mug} 0.002526670 0.1500000 0.016844469 11.368085 9
[22] {Google Chicago Campus Mug} => {Google NYC Campus Mug} 0.002526670 0.1914894 0.013194834 8.420804 9
[23] {Google NYC Campus Mug} => {Google Chicago Campus Mug} 0.002526670 0.1111111 0.022740034 8.420804 9
[24] {Google LA Campus Sticker} => {Google LA Campus Mug} 0.003649635 0.4193548 0.008702976 34.738185 13
[25] {Google LA Campus Mug} => {Google LA Campus Sticker} 0.003649635 0.3023256 0.012071870 34.738185 13
[26] {Google NYC Campus Zip Hoodie} => {Google NYC Campus Mug} 0.003088153 0.1929825 0.016002246 8.486463 11
[27] {Google NYC Campus Mug} => {Google NYC Campus Zip Hoodie} 0.003088153 0.1358025 0.022740034 8.486463 11
[28] {Google LA Campus Mug} => {Google Cambridge Campus Mug} 0.002245929 0.1860465 0.012071870 20.709302 8
[29] {Google Cambridge Campus Mug} => {Google LA Campus Mug} 0.002245929 0.2500000 0.008983717 20.709302 8
[30] {Google LA Campus Mug} => {Google Austin Campus Mug} 0.002245929 0.1860465 0.012071870 11.044961 8
[31] {Google Austin Campus Mug} => {Google LA Campus Mug} 0.002245929 0.1333333 0.016844469 11.044961 8
[32] {Google LA Campus Mug} => {Google NYC Campus Mug} 0.003930376 0.3255814 0.012071870 14.317542 14
[33] {Google NYC Campus Mug} => {Google LA Campus Mug} 0.003930376 0.1728395 0.022740034 14.317542 14
[34] {Google Cambridge Campus Mug} => {Google PNW Campus Mug} 0.002245929 0.2500000 0.008983717 20.238636 8
[35] {Google PNW Campus Mug} => {Google Cambridge Campus Mug} 0.002245929 0.1818182 0.012352611 20.238636 8
[36] {Google Cambridge Campus Mug} => {Google Seattle Campus Mug} 0.003088153 0.3437500 0.008983717 27.209722 11
[37] {Google Seattle Campus Mug} => {Google Cambridge Campus Mug} 0.003088153 0.2444444 0.012633352 27.209722 11
[38] {Google Cambridge Campus Mug} => {Google Austin Campus Mug} 0.003088153 0.3437500 0.008983717 20.407292 11
[39] {Google Austin Campus Mug} => {Google Cambridge Campus Mug} 0.003088153 0.1833333 0.016844469 20.407292 11
[40] {Google Cambridge Campus Mug} => {Google NYC Campus Mug} 0.005614823 0.6250000 0.008983717 27.484568 20
[41] {Google NYC Campus Mug} => {Google Cambridge Campus Mug} 0.005614823 0.2469136 0.022740034 27.484568 20
[42] {Google PNW Campus Mug} => {Google Seattle Campus Mug} 0.004491859 0.3636364 0.012352611 28.783838 16
[43] {Google Seattle Campus Mug} => {Google PNW Campus Mug} 0.004491859 0.3555556 0.012633352 28.783838 16
[44] {Google PNW Campus Mug} => {Google NYC Campus Mug} 0.003088153 0.2500000 0.012352611 10.993827 11
[45] {Google NYC Campus Mug} => {Google PNW Campus Mug} 0.003088153 0.1358025 0.022740034 10.993827 11
[46] {Google Sunnyvale Campus Mug} => {Google Austin Campus Mug} 0.002245929 0.1600000 0.014037058 9.498667 8
[47] {Google Austin Campus Mug} => {Google Sunnyvale Campus Mug} 0.002245929 0.1333333 0.016844469 9.498667 8
[48] {Google Sunnyvale Campus Mug} => {Google NYC Campus Mug} 0.002526670 0.1800000 0.014037058 7.915556 9
[49] {Google NYC Campus Mug} => {Google Sunnyvale Campus Mug} 0.002526670 0.1111111 0.022740034 7.915556 9
[50] {Google Seattle Campus Mug} => {Google Austin Campus Mug} 0.002526670 0.2000000 0.012633352 11.873333 9
[51] {Google Austin Campus Mug} => {Google Seattle Campus Mug} 0.002526670 0.1500000 0.016844469 11.873333 9
[52] {Google Seattle Campus Mug} => {Google NYC Campus Mug} 0.003649635 0.2888889 0.012633352 12.703978 13
[53] {Google NYC Campus Mug} => {Google Seattle Campus Mug} 0.003649635 0.1604938 0.022740034 12.703978 13
[54] {Google Large Tote White} => {Google Camp Mug Ivory} 0.002807412 0.1851852 0.015160022 3.726721 10
[55] {Google NYC Campus Sticker} => {Google NYC Campus Mug} 0.003368894 0.3157895 0.010668164 13.886940 12
[56] {Google NYC Campus Mug} => {Google NYC Campus Sticker} 0.003368894 0.1481481 0.022740034 13.886940 12
[57] {Google Austin Campus Mug} => {Google NYC Campus Mug} 0.004211117 0.2500000 0.016844469 10.993827 15
[58] {Google NYC Campus Mug} => {Google Austin Campus Mug} 0.004211117 0.1851852 0.022740034 10.993827 15
[59] {Google Magnet} => {Google Camp Mug Ivory} 0.002807412 0.1562500 0.017967434 3.144421 10
[60] {Google Camp Mug Gray} => {Google Camp Mug Ivory} 0.005053341 0.1836735 0.027512633 3.696299 18
[61] {Google Camp Mug Ivory} => {Google Camp Mug Gray} 0.005053341 0.1016949 0.049691185 3.696299 18
[62] {Google Kirkland Campus Mug,
Google Seattle Campus Mug} => {Google NYC Campus Mug} 0.002245929 0.6666667 0.003368894 29.316872 8
[63] {Google Kirkland Campus Mug,
Google NYC Campus Mug} => {Google Seattle Campus Mug} 0.002245929 1.0000000 0.002245929 79.155556 8
[64] {Google NYC Campus Mug,
Google Seattle Campus Mug} => {Google Kirkland Campus Mug} 0.002245929 0.6153846 0.003649635 168.615385 8
[65] {Google Boulder Campus Mug,
Google Cambridge Campus Mug} => {Google NYC Campus Mug} 0.002245929 1.0000000 0.002245929 43.975309 8
[66] {Google Boulder Campus Mug,
Google NYC Campus Mug} => {Google Cambridge Campus Mug} 0.002245929 0.8888889 0.002526670 98.944444 8
[67] {Google Cambridge Campus Mug,
Google NYC Campus Mug} => {Google Boulder Campus Mug} 0.002245929 0.4000000 0.005614823 61.947826 8
[68] {Google Cambridge Campus Mug,
Google LA Campus Mug} => {Google NYC Campus Mug} 0.002245929 1.0000000 0.002245929 43.975309 8
[69] {Google LA Campus Mug,
Google NYC Campus Mug} => {Google Cambridge Campus Mug} 0.002245929 0.5714286 0.003930376 63.607143 8
[70] {Google Cambridge Campus Mug,
Google NYC Campus Mug} => {Google LA Campus Mug} 0.002245929 0.4000000 0.005614823 33.134884 8
[71] {Google Cambridge Campus Mug,
Google PNW Campus Mug} => {Google NYC Campus Mug} 0.002245929 1.0000000 0.002245929 43.975309 8
[72] {Google Cambridge Campus Mug,
Google NYC Campus Mug} => {Google PNW Campus Mug} 0.002245929 0.4000000 0.005614823 32.381818 8
[73] {Google NYC Campus Mug,
Google PNW Campus Mug} => {Google Cambridge Campus Mug} 0.002245929 0.7272727 0.003088153 80.954545 8
[74] {Google Cambridge Campus Mug,
Google Seattle Campus Mug} => {Google Austin Campus Mug} 0.002245929 0.7272727 0.003088153 43.175758 8
[75] {Google Austin Campus Mug,
Google Cambridge Campus Mug} => {Google Seattle Campus Mug} 0.002245929 0.7272727 0.003088153 57.567677 8
[76] {Google Austin Campus Mug,
Google Seattle Campus Mug} => {Google Cambridge Campus Mug} 0.002245929 0.8888889 0.002526670 98.944444 8
[77] {Google Cambridge Campus Mug,
Google Seattle Campus Mug} => {Google NYC Campus Mug} 0.002807412 0.9090909 0.003088153 39.977553 10
[78] {Google Cambridge Campus Mug,
Google NYC Campus Mug} => {Google Seattle Campus Mug} 0.002807412 0.5000000 0.005614823 39.577778 10
[79] {Google NYC Campus Mug,
Google Seattle Campus Mug} => {Google Cambridge Campus Mug} 0.002807412 0.7692308 0.003649635 85.625000 10
[80] {Google Austin Campus Mug,
Google Cambridge Campus Mug} => {Google NYC Campus Mug} 0.002526670 0.8181818 0.003088153 35.979798 9
[81] {Google Cambridge Campus Mug,
Google NYC Campus Mug} => {Google Austin Campus Mug} 0.002526670 0.4500000 0.005614823 26.715000 9
[82] {Google Austin Campus Mug,
Google NYC Campus Mug} => {Google Cambridge Campus Mug} 0.002526670 0.6000000 0.004211117 66.787500 9
This can also be achieved in a more tidyverse style by doing the following:
# Google Camp Mug Ivory tidy way
|>
mdl_ga_rules as("data.frame") |>
tibble() |>
filter(str_detect(rules, "Google Camp Mug Ivory"))
# A tibble: 7 × 6
rules support confidence coverage lift count
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 {Google Flat Front Bag Grey} => {Google Camp Mug Ivory} 0.00505 0.257 0.0197 5.17 18
2 {Google Camp Mug Ivory} => {Google Flat Front Bag Grey} 0.00505 0.102 0.0497 5.17 18
3 {Google Unisex Eco Tee Black} => {Google Camp Mug Ivory} 0.00225 0.131 0.0171 2.64 8
4 {Google Large Tote White} => {Google Camp Mug Ivory} 0.00281 0.185 0.0152 3.73 10
5 {Google Magnet} => {Google Camp Mug Ivory} 0.00281 0.156 0.0180 3.14 10
6 {Google Camp Mug Gray} => {Google Camp Mug Ivory} 0.00505 0.184 0.0275 3.70 18
7 {Google Camp Mug Ivory} => {Google Camp Mug Gray} 0.00505 0.102 0.0497 3.70 18
# All mugs tidy way
|>
mdl_ga_rules as("data.frame") |>
tibble() |>
filter(str_detect(rules, "Mug"))
# A tibble: 82 × 6
rules support confidence coverage lift count
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 {Google Austin Campus Tote} => {Google Austin Campus Mug} 0.00225 0.615 0.00365 36.5 8
2 {Google Austin Campus Mug} => {Google Austin Campus Tote} 0.00225 0.133 0.0168 36.5 8
3 {Google Kirkland Campus Mug} => {Google Seattle Campus M… 0.00337 0.923 0.00365 73.1 12
4 {Google Seattle Campus Mug} => {Google Kirkland Campus M… 0.00337 0.267 0.0126 73.1 12
5 {Google Kirkland Campus Mug} => {Google NYC Campus Mug} 0.00225 0.615 0.00365 27.1 8
6 {Google Boulder Campus Mug} => {Google Cambridge Campus … 0.00225 0.348 0.00646 38.7 8
7 {Google Cambridge Campus Mug} => {Google Boulder Campus … 0.00225 0.25 0.00898 38.7 8
8 {Google Boulder Campus Mug} => {Google NYC Campus Mug} 0.00253 0.391 0.00646 17.2 9
9 {Google NYC Campus Mug} => {Google Boulder Campus Mug} 0.00253 0.111 0.0227 17.2 9
10 {Google NYC Campus Bottle} => {Google NYC Campus Mug} 0.00225 0.296 0.00758 13.0 8
# ℹ 72 more rows
Visualize our model
Slicing and dicing rules can be useful. However, say we want to explore all the rules visually. Within the arules
family of packages there’s an arulesViz
package (Hahsler 2017), which provides a simple plot()
generic to create static and interactive plots. Let’s explore some of this functionality with the rule set created above.
We start with a simple scatter plot of our rules. confidence
for each rule is plotted on the y-axis, support
is on the x-axis, and the brightness of each point represents the rule’s lift
. The hover tool provides additional useful information about each rule represented in the plot.
plot(mdl_ga_rules, engine = "html")
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Now, we can also use the plot()
generic function to create a network graph of the rule sets. To do this, we just set method = "graph"
. You’ll also notice I set the argument limit = 300
. This is because rule sets can be quite large, which when plotted interactively can cause issues if too many rules are plotted. Thus, the generic defaults to the top 100 rule rule sets based on lift. Given we have a relatively small rule set (i.e., 277 rules), I bumped this up a bit.
plot(mdl_ga_rules, method = "graph", engine = "html", limit = 300)
Warning: Too many rules supplied. Only plotting the best 100 using 'lift' (change control parameter
max if needed).
The interactive network is useful for exploring various rules, along with rules associated with specific items. You can either click on individual nodes within the graph to highlight connections, or you can use the drop-down to select individual components. Give it a try.
Identify clear, insightful rules
Now that we have a set of rules to explore, our next task is to identify clear and interesting rules for whoever needs results from this analysis. According to Lantz (2023), this task can be challenging. We want to avoid clear, but obvious rules. Such rules will likely already be known by the marketing team (e.g., {peanut butter, jelly} => {bread}). Also, interesting, but not clear rules may simply be an anomaly and not worth pursuing.
This stage of the process is more subjective, rather then objective. As such, we may need to collaborate with other professionals more knowledgeable of the business context. Working with other knowledgable individuals will help us further identify clear, interesting rules to share. Nonetheless, our model has gotten us from raw transaction data to a more focused set of rules worth additional exploration.
Sort and export
Not all collaborators will have the time to review every rule. So, how do we get these rules into a format useful for other’s to review? It’s pretty straightforward; we just sort and export the rules to a .csv
file.
To share these rules with other collaborators, I’m going to sort and slice the top 100 rules based on the lift
metric, transform into a tibble, and write a .csv
file for stakeholders to view the rules. A file in this format should be easily opened in programs familiar to your typical business user (e.g., Excel, Google Sheets).
<- mdl_ga_rules |>
rules_top head(100, by = "lift") |>
as("data.frame") |>
tibble()
write_csv(
rules_top, glue("{str_replace_all(Sys.Date(), '-', '_')}_data_top_rules.csv")
)
Wrap-up
This wraps up our analysis of Google Merchandise Store transactions for the 2020 holiday season. In this post, I overviewed the steps involved when performing a market basket analysis using Google Analytics data. This post started with a brief discussion on how to extract Google Analytics data stored in BigQuery using the bigrquery
package. I covered the general wrangling and exploratory analysis necessary to perform a market basket analysis. This involved transforming transaction data into a sparse data matrix and how to use the matrix to calculate simple summaries about various transactions and items. Using the arules
package, I created association rules using the apriori
algorithm, as implemented in the apriori()
function. This post then covered some of the basic definitions of key metrics used for specifying and interpreting this type of model (e.g., support; confidence; and lift). A section was also devoted to the interpretation and visualization of the outputted rule sets. Finally, this post finished with a description on how to export rules, so as to easily share and collaborate with others.
Taken in whole, market basket analysis is an interpretable, useful tool for analyzing e-commerce transaction data. Hopefully you can add it to your toolbox and find it useful in identifying impactful rules for more informed marketing efforts.
Until next time, keep messing with models.
References
Reuse
Citation
@misc{berke2024,
author = {Berke, Collin K},
title = {Messing with Models: {Market} Basket Analysis of Online
Merchandise Store Data},
date = {2024-06-11},
langid = {en}
}