Philippe Massicotte - Reading multiples CSV files using readr

If you are beginning in R, chances are that you have used read.csv() to import CSV files into R. While this function works perfectly fine, it can only read one file at a time. Hence, new R programmers often read multiple files successively and combine the data afterward.

# Read all the data files
df1 <- read.csv("file1.csv")
df2 <- read.csv("file2.csv")
df3 <- read.csv("file3.csv")
df4 <- read.csv("file4.csv")
df5 <- read.csv("file5.csv")

# Combine all the data frame together
big_df <- rbind(df1, df2, df3, df4, df5)

Whereas this can work fine if you have only a few files, this can become tedious when the number of files to read increases. A better approach would be to use a list of files and read them at once. For quite a while, I have been using a combination of map_df() from the purrr package.

# Create a vector of file names
files <- c("file1.csv", "file2.csv", "file3.csv", "file4.csv", "file5.csv")

# Read and combine all data files into a single data frame
big_df <- map_df(files, read_csv)

In the release of readr 2.0.0, the read_csv() function can directly take a list of files as input, eliminating the need to use the mad_df() function. Hence, we can now read multiples files as follow:

# Read and combine all data files into a single data frame without using the
# map_df function
big_df <- read_csv(files)

In this short blog post, I wanted to benchmark the speed difference between map_df(files, read_csv) and read_csv(files). To do it so let’s first generate some data files.

Photo by Marc Sendra Martorell on Unsplash

library(nycflights13)

purrr::iwalk(
  split(flights, flights$carrier),
  ~ {
    .x$carrier[[1]]
    data.table::fwrite(.x, glue::glue("/tmp/flights_{.y}.csv"))
  }
)

files <- fs::dir_ls(path = "/tmp", glob = "*flights*csv")
files
#> /tmp/flights_9E.csv /tmp/flights_AA.csv /tmp/flights_AS.csv /tmp/flights_B6.csv 
#> /tmp/flights_DL.csv /tmp/flights_EV.csv /tmp/flights_F9.csv /tmp/flights_FL.csv 
#> /tmp/flights_HA.csv /tmp/flights_MQ.csv /tmp/flights_OO.csv /tmp/flights_UA.csv 
#> /tmp/flights_US.csv /tmp/flights_VX.csv /tmp/flights_WN.csv /tmp/flights_YV.csv

We can look at what the data look like.

read_csv(files[[1]])
#> # A tibble: 18,460 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>
#>  1  2013     1     1      810            810         0     1048           1037
#>  2  2013     1     1     1451           1500        -9     1634           1636
#>  3  2013     1     1     1452           1455        -3     1637           1639
#>  4  2013     1     1     1454           1500        -6     1635           1636
#>  5  2013     1     1     1507           1515        -8     1651           1656
#>  6  2013     1     1     1530           1530         0     1650           1655
#>  7  2013     1     1     1546           1540         6     1753           1748
#>  8  2013     1     1     1550           1550         0     1844           1831
#>  9  2013     1     1     1552           1600        -8     1749           1757
#> 10  2013     1     1     1554           1600        -6     1701           1734
#> # ℹ 18,450 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <dbl>, flight <dbl>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

Now that data files have been successfully created, we can compare the two reading options.

res <- microbenchmark::microbenchmark(
  map_df_read_csv = map_df(files, read_csv, col_types = cols(carrier = col_character())),
  read_csv = read_csv(files, col_types = cols(carrier = col_character())),
  times = 100
)

res
#> Unit: milliseconds
#>             expr      min       lq     mean   median       uq      max neval
#>  map_df_read_csv 622.2706 691.0963 724.5537 747.4462 757.6883 799.8428   100
#>         read_csv 177.7021 183.8535 192.6876 187.7292 193.5388 311.9672   100

autoplot(res)

Using read_csv() directly seems to be much (~two times) faster than the map_df(files, read_csv) combination.

Session info

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.0 (2024-04-24)
#>  os       Linux Mint 21.3
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en_CA:en
#>  collate  en_CA.UTF-8
#>  ctype    en_CA.UTF-8
#>  tz       America/Montreal
#>  date     2024-05-03
#>  pandoc   2.9.2.1 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  ! package      * version date (UTC) lib source
#>  P bit            4.0.5   2022-11-15 [?] RSPM
#>  P bit64          4.0.5   2020-08-30 [?] RSPM
#>  P cachem         1.0.8   2023-05-01 [?] RSPM
#>  P cli            3.6.2   2023-12-11 [?] RSPM
#>  P colorspace     2.1-0   2023-01-23 [?] RSPM
#>  P crayon         1.5.2   2022-09-29 [?] RSPM
#>  P data.table     1.15.4  2024-03-30 [?] RSPM
#>  P devtools       2.4.5   2022-10-11 [?] RSPM (R 4.4.0)
#>  P digest         0.6.35  2024-03-11 [?] RSPM
#>  P dplyr        * 1.1.4   2023-11-17 [?] RSPM
#>  P ellipsis       0.3.2   2021-04-29 [?] RSPM
#>  P evaluate       0.23    2023-11-01 [?] RSPM
#>  P extrafont      0.19    2023-01-18 [?] RSPM
#>  P extrafontdb    1.0     2012-06-11 [?] RSPM
#>  P fansi          1.0.6   2023-12-08 [?] RSPM
#>  P fastmap        1.1.1   2023-02-24 [?] RSPM
#>  P forcats      * 1.0.0   2023-01-29 [?] RSPM
#>  P fs             1.6.4   2024-04-25 [?] CRAN (R 4.4.0)
#>  P generics       0.1.3   2022-07-05 [?] RSPM
#>  P ggplot2      * 3.5.1   2024-04-23 [?] RSPM
#>  P ggpmthemes   * 0.0.2   2024-04-25 [?] Github (pmassicotte/ggpmthemes@993d61e)
#>  P glue           1.7.0   2024-01-09 [?] RSPM
#>  P gtable         0.3.5   2024-04-22 [?] RSPM
#>  P hms            1.1.3   2023-03-21 [?] RSPM
#>  P htmltools      0.5.8.1 2024-04-04 [?] RSPM
#>  P htmlwidgets    1.6.4   2023-12-06 [?] RSPM
#>  P httpuv         1.6.15  2024-03-26 [?] RSPM
#>  P jsonlite       1.8.8   2023-12-04 [?] RSPM
#>  P knitr          1.46    2024-04-06 [?] RSPM
#>  P later          1.3.2   2023-12-06 [?] RSPM
#>  P lifecycle      1.0.4   2023-11-07 [?] RSPM
#>  P lubridate    * 1.9.3   2023-09-27 [?] RSPM
#>  P magrittr       2.0.3   2022-03-30 [?] RSPM
#>  P memoise        2.0.1   2021-11-26 [?] RSPM
#>  P mime           0.12    2021-09-28 [?] RSPM
#>  P miniUI         0.1.1.1 2018-05-18 [?] RSPM (R 4.4.0)
#>  P munsell        0.5.1   2024-04-01 [?] RSPM
#>  P nycflights13 * 1.0.2   2021-04-12 [?] RSPM
#>  P pillar         1.9.0   2023-03-22 [?] RSPM
#>  P pkgbuild       1.4.4   2024-03-17 [?] RSPM (R 4.4.0)
#>  P pkgconfig      2.0.3   2019-09-22 [?] RSPM
#>  P pkgload        1.3.4   2024-01-16 [?] RSPM (R 4.4.0)
#>  P processx       3.8.4   2024-03-16 [?] RSPM
#>  P profvis        0.3.8   2023-05-02 [?] RSPM (R 4.4.0)
#>  P promises       1.3.0   2024-04-05 [?] RSPM
#>  P ps             1.7.6   2024-01-18 [?] RSPM
#>  P purrr        * 1.0.2   2023-08-10 [?] RSPM
#>  P quarto       * 1.4     2024-03-06 [?] RSPM
#>  P R.cache        0.16.0  2022-07-21 [?] RSPM
#>  P R.methodsS3    1.8.2   2022-06-13 [?] RSPM
#>  P R.oo           1.26.0  2024-01-24 [?] RSPM
#>  P R.utils        2.12.3  2023-11-18 [?] RSPM
#>  P R6             2.5.1   2021-08-19 [?] RSPM
#>  P Rcpp           1.0.12  2024-01-09 [?] RSPM
#>  P readr        * 2.1.5   2024-01-10 [?] RSPM
#>  P remotes        2.5.0   2024-03-17 [?] RSPM (R 4.4.0)
#>  P renv           1.0.7   2024-04-11 [?] RSPM (R 4.4.0)
#>  P rlang          1.1.3   2024-01-10 [?] RSPM
#>  P rmarkdown      2.26    2024-03-05 [?] RSPM
#>  P rstudioapi     0.16.0  2024-03-24 [?] RSPM
#>  P Rttf2pt1       1.3.12  2023-01-22 [?] RSPM
#>  P scales         1.3.0   2023-11-28 [?] RSPM
#>  P sessioninfo    1.2.2   2021-12-06 [?] RSPM (R 4.4.0)
#>  P shiny          1.8.1.1 2024-04-02 [?] RSPM (R 4.4.0)
#>  P stringi        1.8.3   2023-12-11 [?] RSPM
#>  P stringr      * 1.5.1   2023-11-14 [?] RSPM
#>  P styler       * 1.10.3  2024-04-07 [?] RSPM
#>  P tibble       * 3.2.1   2023-03-20 [?] RSPM
#>  P tidyr        * 1.3.1   2024-01-24 [?] RSPM
#>  P tidyselect     1.2.1   2024-03-11 [?] RSPM
#>  P tidyverse    * 2.0.0   2023-02-22 [?] RSPM
#>  P timechange     0.3.0   2024-01-18 [?] RSPM
#>  P tzdb           0.4.0   2023-05-12 [?] RSPM
#>  P urlchecker     1.0.1   2021-11-30 [?] RSPM (R 4.4.0)
#>  P usethis        2.2.3   2024-02-19 [?] RSPM (R 4.4.0)
#>  P utf8           1.2.4   2023-10-22 [?] RSPM
#>  P vctrs          0.6.5   2023-12-01 [?] RSPM
#>  P vroom          1.6.5   2023-12-05 [?] RSPM
#>  P withr          3.0.0   2024-01-16 [?] RSPM
#>  P xfun           0.43    2024-03-25 [?] RSPM
#>  P xtable         1.8-4   2019-04-21 [?] RSPM (R 4.4.0)
#>  P yaml           2.3.8   2023-12-11 [?] RSPM
#> 
#>  [1] /tmp/RtmpAEibNS/renv-use-libpath-25a31f3c131d5d
#>  [2] /tmp/RtmpAEibNS/renv-sandbox
#> 
#>  P ── Loaded and on-disk path mismatch.
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────