How to export >100x faster .csv files from R for big data

TLDR; You should use data.table::fwrite() to export .csv from r, because it is more than >100 times faster than other methods. If you want to export your data as Excel xlsx, you should use openxlsx. But it is not recommended for big data. If you want to learn more about Data Science in R, I recommend this book*.

Image by author

Why you have to export to .csv?

There is this case, when you want to exchange your data with customers or other people like your colleagues. And for some people a database format like sqlite is not usable, so you have to move to other options like Excel. I know I also do not like Excel, but it still widely used and a common format. So as a data analyst you have to work with that. So I compared recently options for R, to save time and frustration.

Create a fake dataset with fakir

{fakir} is a good library from ThinkR to create some fake data. You should check it out in this link. For the test, I created a fairly big table with 1,000,000 rows and 25 rows. So, it is of course not big data, but already hard to handle with Excel. For the test purpose, it was enough. I compared 4 common libraries and 7 methods, see below. For additional comparison, I also added .xlsx export.

Test Set up:

I have 8 GB RAM PC with intel core i5 @ 2.4 GHz and 4 cores. So nothing special. If you have a better PC, the export should be faster.

## Import librarieslibrary(tidyverse)
# install.packages("fakir",
# repos = c("thinkropen" = "https://thinkr-open.r-universe.dev"))
# install.packages("charlatan")
library(fakir)
library(data.table)
library(openxlsx)
library(xlsx)
library(vroom)
## Construct Datasetdata <- fake_ticket_client(vol = 1000000)
Test Dataset, Image by author

Next I created for each method a sys.time() and took the difference between start and end, see code below. I saved all outputs also to compare the file size.

## 1 Benchmark Openxlsxstart.time <- Sys.time()
openxlsx::write.xlsx(data, file = "openxslx_export.xlsx")
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)
# Time difference of 6.246952 mins
## 2 Benchmark XLSXstart.time <- Sys.time()
xlsx::write.xlsx(data, "xslx_export.xlsx", sheetName = "Sheet1",
col.names = TRUE, row.names = TRUE, append = FALSE)
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)
# Time difference of 7.059051 mins
## 3 Benchmark XLSX2start.time <- Sys.time()
xlsx::write.xlsx2(data, "xslx2_export.xlsx", sheetName = "Sheet1",
col.names = TRUE, row.names = TRUE, append = FALSE)
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)
# JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError"
## 4 Benchmark CSV Export utilsstart.time <- Sys.time()
utils::write.csv(data, file = "utils_write.csv_export.csv")
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)
# Time difference of 1.901615 mins
## 5 Benchmark CSV 2 Export utilsstart.time <- Sys.time()
utils::write.csv2(data, file = "utils_write.csv2_export.csv")
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)
# Time difference of 3.071534 mins
## 6 Benchmark CSV Datatable fwritestart.time <- Sys.time()
data.table::fwrite(data, file = "fwrite_export.csv")
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)
# Time difference of 1.205034 secs
## 7 Benchmark CSV vroomstart.time <- Sys.time()
vroom::vroom_write(data, file = "vroom_write_export.csv", delim = ",")
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)
# Time difference of 14.00942 secs

The Results

In result, you can see from 7 methods, just 6 worked for me. xlsx::write.xlsx2 did not work and throw some unknow errors. Maybe my memory is to less, but this is also a good result. Below, you see the results, data.table:.fwrite won with a big time ahead for {utils} by more than 100 s. If you compare xlsx exports, they take a lot longer. More than 6 minutes!

Image by author

In terms of output size openxlsx::write.xlsx is half the size of the csv files.

Image by author

If you compare both dimensions, you can decide which you want to optimize. But I would go for fwrite or vroom and prefer to save time, since it is 100 x faster but only 2 times bigger. Below you will find the code 👇🏻

What would you prefer? Let me know in the comments!

Image by author

Thanks for reading my article. I hope you liked it. Please feel free to like, share, and comment on it. Follow me on more content about cryptos, stocks, data analysis and web development.

  • Read my other article about stocks and reddit:

Here you find the code:

If you want to learn more about Data Science in R, I recommend this book*.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Antonio Blago

Antonio Blago

I am a data analyst/ scientist and web developer in python, discovering the unlimited world of coding and data. Visit me on www.antonioblago.com.