Notes on identifying explicit and implicit missing values
Today I learned more about identifying explicit and missing values in R. During our weekly Data Science Learning Community’s (DSLC) bookclub meeting for the R for Data Science (R4DS) book, I was re-introduced to several methods to identify explicit and implicit missing values. Much of what is covered here comes from Chapter 18: Missing values of the book. I wanted to share what I’ve learned, in hopes I can better remember this information in the future.
Identify and count missing values
When faced with a dataset, the need to identify the presence of and estimate the extent of explicit missing values (i.e., NA
s) may come up. I made note of four different approaches from our discussion, which all output the same information.
Base R’s sapply()
The first suggestion was to use base R’s sapply()
with an anonymous function. There’s two variations: one that identifys the presence of any NA
s across the columns. The second provides a count of NA
s for each column.
name height mass hair_color skin_color eye_color birth_year sex gender
FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
homeworld species films vehicles starships
TRUE TRUE FALSE FALSE FALSE
purrr::map_df
with any()
and is.na()
Similar to the base R approach is the use of purrr::map_df()
with an anonymous function. I’m quite partial to this approach, as it’s even more succinct, though it requires purrr
as a dependency. However, if you’re already importing the tidyverse
into your session, then why not go ahead and use it?
# A tibble: 1 × 14
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films
<lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
# ℹ 2 more variables: vehicles <lgl>, starships <lgl>
# A tibble: 1 × 14
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 6 28 5 0 0 44 4 4 10 4 0
# ℹ 2 more variables: vehicles <int>, starships <int>
dplyr::summarise()
Another approach involved the use of dplyr
’s summarise()
along with across()
, everything()
, and an anonymous function. This approach was meant only to count the amount of missing values within each column.
starwars |>
summarise(across(everything(), \(x) sum(is.na(x))))
# A tibble: 1 × 14
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 6 28 5 0 0 44 4 4 10 4 0
# ℹ 2 more variables: vehicles <int>, starships <int>
skimr::skim()
skimr::skim()
was also discussed, though the output is more verbose than the other options. The output contains a sum of the number of missing values within each column. This is certainly the most succinct way to obtain this information, and it provides additional summary information about your data. However, it may be more information then you need to answer your question about the presence of missing values in your data.
skim(starwars)
Name | starwars |
Number of rows | 87 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 8 |
list | 3 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 3 | 21 | 0 | 87 | 0 |
hair_color | 5 | 0.94 | 4 | 13 | 0 | 11 | 0 |
skin_color | 0 | 1.00 | 3 | 19 | 0 | 31 | 0 |
eye_color | 0 | 1.00 | 3 | 13 | 0 | 15 | 0 |
sex | 4 | 0.95 | 4 | 14 | 0 | 4 | 0 |
gender | 4 | 0.95 | 8 | 9 | 0 | 2 | 0 |
homeworld | 10 | 0.89 | 4 | 14 | 0 | 48 | 0 |
species | 4 | 0.95 | 3 | 14 | 0 | 37 | 0 |
Variable type: list
skim_variable | n_missing | complete_rate | n_unique | min_length | max_length |
---|---|---|---|---|---|
films | 0 | 1 | 24 | 1 | 7 |
vehicles | 0 | 1 | 11 | 0 | 2 |
starships | 0 | 1 | 16 | 0 | 5 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
height | 6 | 0.93 | 174.60 | 34.77 | 66 | 167.0 | 180 | 191.0 | 264 | ▂▁▇▅▁ |
mass | 28 | 0.68 | 97.31 | 169.46 | 15 | 55.6 | 79 | 84.5 | 1358 | ▇▁▁▁▁ |
birth_year | 44 | 0.49 | 87.57 | 154.69 | 8 | 35.0 | 52 | 72.0 | 896 | ▇▁▁▁▁ |
Identify implicit missing values with dplyr::anti_join()
Implicit missing values were also an interesting topic of discussion. According to the R4DS book, implicit missing values are absent data without any observable reference to it being missing within the data.
This was such an important topic that even an exercise was devoted to it. During the reading, we learned how to use dplyr::anti_join()
to help identify implicit missing values.
Here’s the solution, graciously shared by Stephan Koenig with the group (see around 01m38s):
flights |>
select(tailnum, carrier) |>
anti_join(planes, by = join_by(tailnum)) |>
distinct(tailnum, carrier) |>
count(carrier, sort = TRUE)
# A tibble: 10 × 2
carrier n
<chr> <int>
1 AA 430
2 MQ 234
3 UA 23
4 FL 12
5 DL 10
6 US 9
7 B6 3
8 F9 3
9 WN 3
10 9E 1
I really liked this approach. It was useful to see the utility of count()
to show which carriers had the most implicitly missing values within the flights data. However, it took me a beat to fully understand what was happening. I think this was due to the volume of data within the flights
dataset used for the example. As such, I thought up a simpler dataset, semi-related to the domain I work in, to help clarify my understanding of what’s happening here.
Let’s say we’re a marketer, and we run marketing campaigns to sell products through an ecommerce store. As part of this store, we collect data about specific marketing campaigns and customers’ purchases. Data include a campaign
data set, which contains metadata for each campaign. For instance, data on the type of campaign being run. We also have an events
tibble, which contains data about whether a customer purchased items while visiting our ecommerce store, and if so, how much revenue was generated from their purchase.
campaign <- tibble(
campaign_id = c(1, 2),
campaign_type = c("email", "social media"),
campaign_name = c("Spring sale", "Winter clearance")
)
events <- tibble(
campaign_id = c(1, 2, 2, 1, 3, 3),
type = c("no purchase", "purchase", "purchase", "no purchase", "no purchase", "purchase"),
revenue = c(0, 11.25, 25.37, 0, 0, 7.45)
)
The implicitly missing data is in the campaign
dataset. Specifically, we’re missing campaign information for campaign_id = 3
. That is, there is no explicit reference for us to see that data is missing if we only looked at the campaign
dataset. We would only know we have missing values by joining these two datasets together.
Missing data could be made more explicit in our campaign
dataset by performing a left join. Then, the approaches discussed above and count()
can be used to identify and sum up how many events are missing campaign information. We do something like this:
Joining with `by = join_by(campaign_id)`
# A tibble: 1 × 5
campaign_id type revenue campaign_type campaign_name
<lgl> <lgl> <lgl> <lgl> <lgl>
1 FALSE FALSE FALSE TRUE TRUE
# Count how many events are missing
events |>
anti_join(campaign, by = join_by(campaign_id)) |>
count(campaign_id, sort = TRUE)
# A tibble: 1 × 2
campaign_id n
<dbl> <int>
1 3 2
Indeed, this seems trivial given the size of the data. However, if the data had thousands or tens of thousands of purchase events and multiple campaigns running during a specific time period, this approach would be quite handy to help quickly identify which campaigns were missing data.
Wrap up
Knowing how to identify missing values within data is important. There are different strategies for identifying explicit or implict missing values. In this post, I highlighted and made notes about the approaches discussed in the R4DS book and our DSLC book club meeting.
Happy data wrangling!
Reuse
Citation
@misc{berke2025,
author = {Berke, Collin K},
title = {Notes on Identifying Explicit and Implicit Missing Values},
date = {2025-02-01},
langid = {en}
}