library(tidyverse)
library(ids)
Summarize logical vectors to calculate numeric summaries
mean()
and sum()
Background
Today I relearned you can easily calculate counts and proportions with a logical vector (e.g., c(TRUE, FALSE, FALSE, TRUE)
) in R.
I’ve been re-reading the second edition of R for Data Science for a Data Science Learning Community bookclub (check us out). While reading Chapter 12: Logical vectors, I was reminded counts and proportions can be calculated from a logical vector.
I wanted to share what I learned out loud, so others have another example. I also hope writing this post helps me remember it in the future.
Summaries from logical vectors
The concept is simple:
sum()
gives the number of TRUEs andmean()
gives the proportion of TRUEs (becausemean()
is justsum()
divided bylength()
)
This works because TRUE
= 1 and FALSE
= 0 in the R programming language.
Let’s look at an example of this in action. We start by creating an example dataset, which builds on the example used in the book:
<- data.frame(
data_user_engagement date = sort(rep(seq(as_date("2024-12-02"), as_date("2024-12-08"), by = 1), times = 100)),
user_id = rep(random_id(bytes = 4, n = 100), times = 7),
time_engaged_sec = sample(c(1:100), 700, replace = TRUE)
|>
) tibble()
This dataset is loosely based on the domain I work in: digital analytics. It’s modeled after event-based timeseries data for a week of web site visits. The dataset contains the following columns:
date
- The date the event occurred.user_id
- A 4-byte user ID.time_engaged_sec
- Time spent engaged during the event (e.g., time spent on a webpage).
Some questions we might ask about this dataset are: How many daily events were considered low-engagement events for users? What proportion of events each day were users engaged? Here’s the code to answer these questions, leveraging the summarization of logical vectors:
|>
data_user_engagement group_by(date) |>
summarise(
count_low_engagement = sum(time_engaged_sec <= 10, na.rm = TRUE),
proportion_engaged = mean(time_engaged_sec >= 30, na.rm = TRUE)
)
# A tibble: 7 × 3
date count_low_engagement proportion_engaged
<date> <int> <dbl>
1 2024-12-02 13 0.7
2 2024-12-03 8 0.67
3 2024-12-04 7 0.78
4 2024-12-05 10 0.69
5 2024-12-06 10 0.72
6 2024-12-07 11 0.72
7 2024-12-08 7 0.79
At first glance, you might ask where are the logical vectors? They’re created in the sum()
and mean()
functions when we use the <=
and >=
operators. That is, the statement time_engaged_sec <= 10
initially creates the logical vector in the background, and then the sum()
or mean()
is computed on that logical vector.
Pretty neat, huh ?!
Wrap up
There are many other uses for logical vectors, but this was the most useful one I recently relearned. Check out Chapter 12: Logical vectors from the R4DS book to learn more.
One more tool to add to the analysis tool box. Thanks for spending time learning with me.
Reuse
Citation
@misc{berke2024,
author = {Berke, Collin K},
title = {Summarize Logical Vectors to Calculate Numeric Summaries},
date = {2024-12-08},
langid = {en}
}