Summarize logical vectors to calculate numeric summaries

til

data wrangling

logical vectors

summary statistics

Need proportion and count summaries from a logical vector? Use mean() and sum()

Author

Collin K. Berke, Ph.D.

Published

December 8, 2024

Background

Today I relearned you can easily calculate counts and proportions with a logical vector (e.g., c(TRUE, FALSE, FALSE, TRUE)) in R.

library(tidyverse)
library(ids)

I’ve been re-reading the second edition of R for Data Science for a Data Science Learning Community bookclub (check us out). While reading Chapter 12: Logical vectors, I was reminded counts and proportions can be calculated from a logical vector.

I wanted to share what I learned out loud, so others have another example. I also hope writing this post helps me remember it in the future.

Summaries from logical vectors

The concept is simple:

sum() gives the number of TRUEs and mean() gives the proportion of TRUEs (because mean() is just sum() divided by length())

This works because TRUE = 1 and FALSE = 0 in the R programming language.

Let’s look at an example of this in action. We start by creating an example dataset, which builds on the example used in the book:

data_user_engagement <- data.frame(
  date = sort(rep(seq(as_date("2024-12-02"), as_date("2024-12-08"), by = 1), times = 100)),
  user_id = rep(random_id(bytes = 4, n = 100), times = 7),
  time_engaged_sec = sample(c(1:100), 700, replace = TRUE)
) |>
tibble()

This dataset is loosely based on the domain I work in: digital analytics. It’s modeled after event-based timeseries data for a week of web site visits. The dataset contains the following columns:

date - The date the event occurred.
user_id - A 4-byte user ID.
time_engaged_sec - Time spent engaged during the event (e.g., time spent on a webpage).

Some questions we might ask about this dataset are: How many daily events were considered low-engagement events for users? What proportion of events each day were users engaged? Here’s the code to answer these questions, leveraging the summarization of logical vectors:

data_user_engagement |>
  group_by(date) |>
  summarise(
    count_low_engagement = sum(time_engaged_sec <= 10, na.rm = TRUE),
    proportion_engaged = mean(time_engaged_sec >= 30, na.rm = TRUE)
  )

# A tibble: 7 × 3
  date       count_low_engagement proportion_engaged
  <date>                    <int>              <dbl>
1 2024-12-02                   13               0.7 
2 2024-12-03                    8               0.67
3 2024-12-04                    7               0.78
4 2024-12-05                   10               0.69
5 2024-12-06                   10               0.72
6 2024-12-07                   11               0.72
7 2024-12-08                    7               0.79

At first glance, you might ask where are the logical vectors? They’re created in the sum() and mean() functions when we use the <= and >= operators. That is, the statement time_engaged_sec <= 10 initially creates the logical vector in the background, and then the sum() or mean() is computed on that logical vector.

Pretty neat, huh ?!

Wrap up

There are many other uses for logical vectors, but this was the most useful one I recently relearned. Check out Chapter 12: Logical vectors from the R4DS book to learn more.

One more tool to add to the analysis tool box. Thanks for spending time learning with me.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@misc{berke2024,
  author = {Berke, Collin K},
  title = {Summarize Logical Vectors to Calculate Numeric Summaries},
  date = {2024-12-08},
  langid = {en}
}

For attribution, please cite this work as:

Berke, Collin K. 2024. “Summarize Logical Vectors to Calculate Numeric Summaries.” December 8, 2024.