Need to store data in a place that’s not persistent, use a temporary directory
Author
Collin K. Berke, Ph.D.
Published
November 3, 2023
Today I learned how to store data in R’s per-session temporary directory.
Recently, I’ve been working on an R package for a project. This package contains some internal data, which is intended to be updated from time-to-time. As part of the data update process, I’m required to download a set of .zip files from cloud storage, unzip, wrangle, and make the data available in the package via the data folder.
Given the data I’m working with, I wanted to avoid storing pre-wrangled data in the data-raw directory of the package. My main concern was an accidental check-in of pre-proccessed data into version control. So, I sought out a means to solve this problem.
This post aims to overview an approach using R’s per-session temporary directory to store data temporarily. Specifically, this post will discuss the use of base::tempdir() and other system file management functions made available in R to store data in this directory.
Warning
Using R’s per-session temporary directory may not be the right solution for your specific situation. If you’re working with sensitive data, make sure you follow your organization’s guidelines on where to store, access, and properly use your data.
I am not a security expert.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
What are temporary directories?
Note
As of this writing, I drafted this post on a computer running a Mac operating system. Some of what gets discussed here may not apply to Windows or Linux systems. The ideas and application should be similar, though I haven’t fully explored the differences.
The temporary directory, simply, is a location on your system. You can store files in this location just like any other directory. The difference is data stored within a temporary directory are not meant to be persistent, and your system will delete them automatically. File deletion either occurs when the system is shut down or after a set amount of time.
If you’re working on a Mac operating system, you can get the path to the temporary directory by running the following in your terminal:
echo$TMPDIR
When I last ran this command on my system, echo returned the following path (later we’ll use base::tempdir() to get and use this path in R).
/var/folders/_4/t3mpnn5n5rzg4sdq2fr_n4y80000gn/T/
This directory is located at the system level. The cd command can be used to navigate to it from the terminal. You may have to back up a few directories if your root starts at the user level, though. This is pretty standard, especially if you’re working on a Mac.
Note
Since I’m drafting this post on my personal machine, I’m not aware if you need admin privileges to access this folder. As such, you may run into issues if you’re not an admin on your machine.
With my curiosity peaked, I sought more information about what this directory was used for on a MacOS. Oddly enough, there is very little about this directory online. From what I can deduce, the /var directory is mainly a per-user cache for temporary files, and it provides security benefits beyond other cache locations on a Mac system (again, I’m not a security expert, so my previous statement may be inaccurate). Being that this location is temporary, this cache gets cleared every time the system restarts or every three days.
Although there’s a lack of information about this directory online, I did come across a few blog posts and a Stack Overflow answer that were helpful in understanding this temporary directory in more depth: post 1; post 2; post 3. You might find these useful if you want to learn more. However, for me, the above is as far as I wanted to go to understand its purpose.
Access the temporary directory using base::tempdir()
The temporary directory R uses for the current session is labeled using the RtmpXXXXXX pattern. The final six characters of the path (i.e., the Xs) are determined by the system. Note, tempdir() doesn’t create this directory, it just prints the temporary directory’s path to the console. This directory is created every time a R session begins.
Since the temporary directory is just like any location on your computer, you can navigate to it from your terminal during an active R session. With your terminal pointing to the temporary directory, you can use the following code to find R’s per-session temporary directory:
la|grep"Rtmp"
Let’s take a peak at what’s in this directory. R’s list.files() function can be helpful in this case.
list.files(tempdir())
character(0)
Most R setups should start with an empty per-session directory. So the above should return character(0). Despite being empty now, list.files() will become handy again once we start to write files to this location.
Writing files to the temporary directory
Now that we know a little more about this temp directory and where it is located on our system, let’s write some data to it. We can do this by doing something like the following.
tempfile() creates unique file names, which concatenates together the file path, the character vector passed to the pattern argument, a random string in hex, and the character vector inputed to the fileext argument. When you list the files in the temporary directory now, you’ll see the initial mtcars.csv file along with a file that looks something like this: mtcars7eb3503ac74c.csv. The random hex string ensures files remain unique.
Indeed, the above is just one way to write files to the temporary directory. You can use other methods to read and write files at this location. However, you now know what is needed to interact with this directory, read and write files to and from it. At this point you can do any data wrangling steps your project requires. After which, we can go about deleting our files from this directory.
Deleting files with file.remove()
Although these files will eventually be removed by the system, we should be proactive and clean up after ourselves.
Note
If you’re using this approach within functions, especially if their intended to be used by other users, you’ll want to be clear they will write data to and remove data from the user’s system.
Indeed, it’s considered poor practice to change the R landscape on a user’s computer without good reason. So the least we can do here is clean up after ourselves.
To delete our files we wrote to the temporary directory, run the following in the console:
The arguments of the list.files() function should be pretty straightforward. We want file paths to be full length (i.e., full.names = TRUE) and to list only files with the .csv extension (i.e., pattern = ".csv"). Then, we use these full file paths within the file.remove() function, which will remove the files from R’s temporary directory.
Wrap-up
Today I learned more about R’s per-session temporary directory, and how it can be used to write files not intended for persistent storage. I also learned how to use several base R functions to create files within this temporary directory by using tempfile() and tempdir(). I also demonstrated how the list.files() function can be used to list files within any directory on your system, specifically using it to list files in R’s temporary directory. Finally, I highlighted how files in the temporary directory can be deleted using the file.remove() function.
Have fun using R’s per-session temporary directory. Cheers 🎉!