Notes on identifying explicit and implicit missing values

til

notes

links

data wrangling

Highlights from the DSLC book club discussion of Chapter 18: Missing values from R4DS

Author

Collin K. Berke, Ph.D.

Published

February 1, 2025

library(tidyverse)
library(skimr)
library(nycflights13)

Today I learned more about identifying explicit and missing values in R. During our weekly Data Science Learning Community’s (DSLC) bookclub meeting for the R for Data Science (R4DS) book, I was re-introduced to several methods to identify explicit and implicit missing values. Much of what is covered here comes from Chapter 18: Missing values of the book. I wanted to share what I’ve learned, in hopes I can better remember this information in the future.

Identify and count missing values

When faced with a dataset, the need to identify the presence of and estimate the extent of explicit missing values (i.e., NAs) may come up. I made note of four different approaches from our discussion, which all output the same information.

Base R’s `sapply()`

The first suggestion was to use base R’s sapply() with an anonymous function. There’s two variations: one that identifys the presence of any NAs across the columns. The second provides a count of NAs for each column.

sapply(starwars, function(x) any(is.na(x)))

      name     height       mass hair_color skin_color  eye_color birth_year        sex     gender 
     FALSE       TRUE       TRUE       TRUE      FALSE      FALSE       TRUE       TRUE       TRUE 
 homeworld    species      films   vehicles  starships 
      TRUE       TRUE      FALSE      FALSE      FALSE

sapply(starwars, function(x) sum(is.na(x)))

      name     height       mass hair_color skin_color  eye_color birth_year        sex     gender 
         0          6         28          5          0          0         44          4          4 
 homeworld    species      films   vehicles  starships 
        10          4          0          0          0

`purrr::map_df` with `any()` and `is.na()`

Similar to the base R approach is the use of purrr::map_df() with an anonymous function. I’m quite partial to this approach, as it’s even more succinct, though it requires purrr as a dependency. However, if you’re already importing the tidyverse into your session, then why not go ahead and use it?

map_df(starwars, \(x) any(is.na(x)))

# A tibble: 1 × 14
  name  height mass  hair_color skin_color eye_color birth_year sex   gender homeworld species films
  <lgl> <lgl>  <lgl> <lgl>      <lgl>      <lgl>     <lgl>      <lgl> <lgl>  <lgl>     <lgl>   <lgl>
1 FALSE TRUE   TRUE  TRUE       FALSE      FALSE     TRUE       TRUE  TRUE   TRUE      TRUE    FALSE
# ℹ 2 more variables: vehicles <lgl>, starships <lgl>

map_df(starwars, \(x) sum(is.na(x)))

# A tibble: 1 × 14
   name height  mass hair_color skin_color eye_color birth_year   sex gender homeworld species films
  <int>  <int> <int>      <int>      <int>     <int>      <int> <int>  <int>     <int>   <int> <int>
1     0      6    28          5          0         0         44     4      4        10       4     0
# ℹ 2 more variables: vehicles <int>, starships <int>

`dplyr::summarise()`

Another approach involved the use of dplyr’s summarise() along with across(), everything(), and an anonymous function. This approach was meant only to count the amount of missing values within each column.

starwars |>
  summarise(across(everything(), \(x) sum(is.na(x))))

# A tibble: 1 × 14
   name height  mass hair_color skin_color eye_color birth_year   sex gender homeworld species films
  <int>  <int> <int>      <int>      <int>     <int>      <int> <int>  <int>     <int>   <int> <int>
1     0      6    28          5          0         0         44     4      4        10       4     0
# ℹ 2 more variables: vehicles <int>, starships <int>

`skimr::skim()`

skimr::skim() was also discussed, though the output is more verbose than the other options. The output contains a sum of the number of missing values within each column. This is certainly the most succinct way to obtain this information, and it provides additional summary information about your data. However, it may be more information then you need to answer your question about the presence of missing values in your data.

skim(starwars)

Data summary
Name	starwars
Number of rows	87
Number of columns	14
_______________________
Column type frequency:
character	8
list	3
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1.00	3	21	87
hair_color	5	0.94	4	13	11
skin_color	0	1.00	3	19	31
eye_color	0	1.00	3	13	15
sex	4	0.95	4	14	4
gender	4	0.95	8	9	2
homeworld	10	0.89	4	14	48
species	4	0.95	3	14	37

Variable type: list

skim_variable	complete_rate	n_unique	min_length	max_length
films	1	24	1	7
vehicles	1	11	0	2
starships	1	16	0	5

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
height	6	0.93	174.60	34.77	66	167.0	180	191.0	264	▂▁▇▅▁
mass	28	0.68	97.31	169.46	15	55.6	79	84.5	1358	▇▁▁▁▁
birth_year	44	0.49	87.57	154.69	8	35.0	52	72.0	896	▇▁▁▁▁

Identify implicit missing values with `dplyr::anti_join()`

Implicit missing values were also an interesting topic of discussion. According to the R4DS book, implicit missing values are absent data without any observable reference to it being missing within the data.

This was such an important topic that even an exercise was devoted to it. During the reading, we learned how to use dplyr::anti_join() to help identify implicit missing values.

Here’s the solution, graciously shared by Stephan Koenig with the group (see around 01m38s):

flights |>
  select(tailnum, carrier) |>
  anti_join(planes, by = join_by(tailnum)) |>
  distinct(tailnum, carrier) |>
  count(carrier, sort = TRUE)

# A tibble: 10 × 2
   carrier     n
   <chr>   <int>
 1 AA        430
 2 MQ        234
 3 UA         23
 4 FL         12
 5 DL         10
 6 US          9
 7 B6          3
 8 F9          3
 9 WN          3
10 9E          1

I really liked this approach. It was useful to see the utility of count() to show which carriers had the most implicitly missing values within the flights data. However, it took me a beat to fully understand what was happening. I think this was due to the volume of data within the flights dataset used for the example. As such, I thought up a simpler dataset, semi-related to the domain I work in, to help clarify my understanding of what’s happening here.

Let’s say we’re a marketer, and we run marketing campaigns to sell products through an ecommerce store. As part of this store, we collect data about specific marketing campaigns and customers’ purchases. Data include a campaign data set, which contains metadata for each campaign. For instance, data on the type of campaign being run. We also have an events tibble, which contains data about whether a customer purchased items while visiting our ecommerce store, and if so, how much revenue was generated from their purchase.

campaign <- tibble(
  campaign_id = c(1, 2),
  campaign_type = c("email", "social media"),
  campaign_name = c("Spring sale", "Winter clearance")
)

events <- tibble(
  campaign_id = c(1, 2, 2, 1, 3, 3),
  type = c("no purchase", "purchase", "purchase", "no purchase", "no purchase", "purchase"),
  revenue = c(0, 11.25, 25.37, 0, 0, 7.45)
)

The implicitly missing data is in the campaign dataset. Specifically, we’re missing campaign information for campaign_id = 3. That is, there is no explicit reference for us to see that data is missing if we only looked at the campaign dataset. We would only know we have missing values by joining these two datasets together.

Missing data could be made more explicit in our campaign dataset by performing a left join. Then, the approaches discussed above and count() can be used to identify and sum up how many events are missing campaign information. We do something like this:

# Identify the presence of missing
events |>
  left_join(campaign) |>
  map_df(\(x) any(is.na(x)))

Joining with `by = join_by(campaign_id)`

# A tibble: 1 × 5
  campaign_id type  revenue campaign_type campaign_name
  <lgl>       <lgl> <lgl>   <lgl>         <lgl>        
1 FALSE       FALSE FALSE   TRUE          TRUE

# Count how many events are missing
events |>
  anti_join(campaign, by = join_by(campaign_id)) |>
  count(campaign_id, sort = TRUE)

# A tibble: 1 × 2
  campaign_id     n
        <dbl> <int>
1           3     2

Indeed, this seems trivial given the size of the data. However, if the data had thousands or tens of thousands of purchase events and multiple campaigns running during a specific time period, this approach would be quite handy to help quickly identify which campaigns were missing data.

Wrap up

Knowing how to identify missing values within data is important. There are different strategies for identifying explicit or implict missing values. In this post, I highlighted and made notes about the approaches discussed in the R4DS book and our DSLC book club meeting.

Happy data wrangling!

Reuse

CC BY 4.0

Citation

BibTeX citation:

@misc{berke2025,
  author = {Berke, Collin K},
  title = {Notes on Identifying Explicit and Implicit Missing Values},
  date = {2025-02-01},
  langid = {en}
}

For attribution, please cite this work as:

Berke, Collin K. 2025. “Notes on Identifying Explicit and Implicit Missing Values.” February 1, 2025.

Identify and count missing values

Base R’s sapply()

purrr::map_df with any() and is.na()

dplyr::summarise()

skimr::skim()

Identify implicit missing values with dplyr::anti_join()