If you are beginning in R, chances are that you have used read.csv()
to import CSV files into R. While this function works perfectly fine, it can only read one file at a time. Hence, new R programmers often read multiple files successively and combine the data afterward.
Whereas this can work fine if you have only a few files, this can become tedious when the number of files to read increases. A better approach would be to use a list of files and read them at once. For quite a while, I have been using a combination of map_df()
from the purrr package.
# Create a vector of file names
files <- c("file1.csv", "file2.csv", "file3.csv", "file4.csv", "file5.csv")
# Read and combine all data files into a single data frame
big_df <- map_df(files, read_csv)
In the release of readr 2.0.0, the read_csv()
function can directly take a list of files as input, eliminating the need to use the mad_df()
function. Hence, we can now read multiples files as follow:
# Read and combine all data files into a single data frame without using the
# map_df function
big_df <- read_csv(files)
In this short blog post, I wanted to benchmark the speed difference between map_df(files, read_csv)
and read_csv(files)
. To do it so let’s first generate some data files.
Photo by Marc Sendra Martorell on Unsplash
library(nycflights13)
purrr::iwalk(
split(flights, flights$carrier),
~ {
.x$carrier[[1]]
data.table::fwrite(.x, glue::glue("/tmp/flights_{.y}.csv"))
}
)
files <- fs::dir_ls(path = "/tmp", glob = "*flights*csv")
files
#> /tmp/flights_9E.csv /tmp/flights_AA.csv /tmp/flights_AS.csv /tmp/flights_B6.csv
#> /tmp/flights_DL.csv /tmp/flights_EV.csv /tmp/flights_F9.csv /tmp/flights_FL.csv
#> /tmp/flights_HA.csv /tmp/flights_MQ.csv /tmp/flights_OO.csv /tmp/flights_UA.csv
#> /tmp/flights_US.csv /tmp/flights_VX.csv /tmp/flights_WN.csv /tmp/flights_YV.csv
We can look at what the data look like.
read_csv(files[[1]])
#> # A tibble: 18,460 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2013 1 1 810 810 0 1048 1037
#> 2 2013 1 1 1451 1500 -9 1634 1636
#> 3 2013 1 1 1452 1455 -3 1637 1639
#> 4 2013 1 1 1454 1500 -6 1635 1636
#> 5 2013 1 1 1507 1515 -8 1651 1656
#> 6 2013 1 1 1530 1530 0 1650 1655
#> 7 2013 1 1 1546 1540 6 1753 1748
#> 8 2013 1 1 1550 1550 0 1844 1831
#> 9 2013 1 1 1552 1600 -8 1749 1757
#> 10 2013 1 1 1554 1600 -6 1701 1734
#> # ℹ 18,450 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <dbl>, flight <dbl>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
Now that data files have been successfully created, we can compare the two reading options.
res <- microbenchmark::microbenchmark(
map_df_read_csv = map_df(files, read_csv, col_types = cols(carrier = col_character())),
read_csv = read_csv(files, col_types = cols(carrier = col_character())),
times = 100
)
res
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> map_df_read_csv 622.2706 691.0963 724.5537 747.4462 757.6883 799.8428 100
#> read_csv 177.7021 183.8535 192.6876 187.7292 193.5388 311.9672 100
autoplot(res)
Using read_csv()
directly seems to be much (~two times) faster than the map_df(files, read_csv)
combination.
Session info
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.0 (2024-04-24)
#> os Linux Mint 21.3
#> system x86_64, linux-gnu
#> ui X11
#> language en_CA:en
#> collate en_CA.UTF-8
#> ctype en_CA.UTF-8
#> tz America/Montreal
#> date 2024-05-03
#> pandoc 2.9.2.1 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> ! package * version date (UTC) lib source
#> P bit 4.0.5 2022-11-15 [?] RSPM
#> P bit64 4.0.5 2020-08-30 [?] RSPM
#> P cachem 1.0.8 2023-05-01 [?] RSPM
#> P cli 3.6.2 2023-12-11 [?] RSPM
#> P colorspace 2.1-0 2023-01-23 [?] RSPM
#> P crayon 1.5.2 2022-09-29 [?] RSPM
#> P data.table 1.15.4 2024-03-30 [?] RSPM
#> P devtools 2.4.5 2022-10-11 [?] RSPM (R 4.4.0)
#> P digest 0.6.35 2024-03-11 [?] RSPM
#> P dplyr * 1.1.4 2023-11-17 [?] RSPM
#> P ellipsis 0.3.2 2021-04-29 [?] RSPM
#> P evaluate 0.23 2023-11-01 [?] RSPM
#> P extrafont 0.19 2023-01-18 [?] RSPM
#> P extrafontdb 1.0 2012-06-11 [?] RSPM
#> P fansi 1.0.6 2023-12-08 [?] RSPM
#> P fastmap 1.1.1 2023-02-24 [?] RSPM
#> P forcats * 1.0.0 2023-01-29 [?] RSPM
#> P fs 1.6.4 2024-04-25 [?] CRAN (R 4.4.0)
#> P generics 0.1.3 2022-07-05 [?] RSPM
#> P ggplot2 * 3.5.1 2024-04-23 [?] RSPM
#> P ggpmthemes * 0.0.2 2024-04-25 [?] Github (pmassicotte/ggpmthemes@993d61e)
#> P glue 1.7.0 2024-01-09 [?] RSPM
#> P gtable 0.3.5 2024-04-22 [?] RSPM
#> P hms 1.1.3 2023-03-21 [?] RSPM
#> P htmltools 0.5.8.1 2024-04-04 [?] RSPM
#> P htmlwidgets 1.6.4 2023-12-06 [?] RSPM
#> P httpuv 1.6.15 2024-03-26 [?] RSPM
#> P jsonlite 1.8.8 2023-12-04 [?] RSPM
#> P knitr 1.46 2024-04-06 [?] RSPM
#> P later 1.3.2 2023-12-06 [?] RSPM
#> P lifecycle 1.0.4 2023-11-07 [?] RSPM
#> P lubridate * 1.9.3 2023-09-27 [?] RSPM
#> P magrittr 2.0.3 2022-03-30 [?] RSPM
#> P memoise 2.0.1 2021-11-26 [?] RSPM
#> P mime 0.12 2021-09-28 [?] RSPM
#> P miniUI 0.1.1.1 2018-05-18 [?] RSPM (R 4.4.0)
#> P munsell 0.5.1 2024-04-01 [?] RSPM
#> P nycflights13 * 1.0.2 2021-04-12 [?] RSPM
#> P pillar 1.9.0 2023-03-22 [?] RSPM
#> P pkgbuild 1.4.4 2024-03-17 [?] RSPM (R 4.4.0)
#> P pkgconfig 2.0.3 2019-09-22 [?] RSPM
#> P pkgload 1.3.4 2024-01-16 [?] RSPM (R 4.4.0)
#> P processx 3.8.4 2024-03-16 [?] RSPM
#> P profvis 0.3.8 2023-05-02 [?] RSPM (R 4.4.0)
#> P promises 1.3.0 2024-04-05 [?] RSPM
#> P ps 1.7.6 2024-01-18 [?] RSPM
#> P purrr * 1.0.2 2023-08-10 [?] RSPM
#> P quarto * 1.4 2024-03-06 [?] RSPM
#> P R.cache 0.16.0 2022-07-21 [?] RSPM
#> P R.methodsS3 1.8.2 2022-06-13 [?] RSPM
#> P R.oo 1.26.0 2024-01-24 [?] RSPM
#> P R.utils 2.12.3 2023-11-18 [?] RSPM
#> P R6 2.5.1 2021-08-19 [?] RSPM
#> P Rcpp 1.0.12 2024-01-09 [?] RSPM
#> P readr * 2.1.5 2024-01-10 [?] RSPM
#> P remotes 2.5.0 2024-03-17 [?] RSPM (R 4.4.0)
#> P renv 1.0.7 2024-04-11 [?] RSPM (R 4.4.0)
#> P rlang 1.1.3 2024-01-10 [?] RSPM
#> P rmarkdown 2.26 2024-03-05 [?] RSPM
#> P rstudioapi 0.16.0 2024-03-24 [?] RSPM
#> P Rttf2pt1 1.3.12 2023-01-22 [?] RSPM
#> P scales 1.3.0 2023-11-28 [?] RSPM
#> P sessioninfo 1.2.2 2021-12-06 [?] RSPM (R 4.4.0)
#> P shiny 1.8.1.1 2024-04-02 [?] RSPM (R 4.4.0)
#> P stringi 1.8.3 2023-12-11 [?] RSPM
#> P stringr * 1.5.1 2023-11-14 [?] RSPM
#> P styler * 1.10.3 2024-04-07 [?] RSPM
#> P tibble * 3.2.1 2023-03-20 [?] RSPM
#> P tidyr * 1.3.1 2024-01-24 [?] RSPM
#> P tidyselect 1.2.1 2024-03-11 [?] RSPM
#> P tidyverse * 2.0.0 2023-02-22 [?] RSPM
#> P timechange 0.3.0 2024-01-18 [?] RSPM
#> P tzdb 0.4.0 2023-05-12 [?] RSPM
#> P urlchecker 1.0.1 2021-11-30 [?] RSPM (R 4.4.0)
#> P usethis 2.2.3 2024-02-19 [?] RSPM (R 4.4.0)
#> P utf8 1.2.4 2023-10-22 [?] RSPM
#> P vctrs 0.6.5 2023-12-01 [?] RSPM
#> P vroom 1.6.5 2023-12-05 [?] RSPM
#> P withr 3.0.0 2024-01-16 [?] RSPM
#> P xfun 0.43 2024-03-25 [?] RSPM
#> P xtable 1.8-4 2019-04-21 [?] RSPM (R 4.4.0)
#> P yaml 2.3.8 2023-12-11 [?] RSPM
#>
#> [1] /tmp/RtmpAEibNS/renv-use-libpath-25a31f3c131d5d
#> [2] /tmp/RtmpAEibNS/renv-sandbox
#>
#> P ── Loaded and on-disk path mismatch.
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────