Benchmarking / Benchmarking various Data file formats - csv, h5, pytables(hdf5), npy, npz, joblib

The data processing/feature engineering part is very important and time taking process while developing machine learning models. I usually have several different formats of data(after data processing and feature engineering) for ex: one data file may contain only normalized data, one file containing standardized data, various data files with varying features after performing feature selection. I then apply appropriate machine learning models to each of the file and see the cross validation results.

Also it is important to note that when we save a file to disk and load it back again, serialization and de-serialization happens which can impact the time.

So, the time to save the data to the disk, load the file back from the disk, memory used are very important factors that can save a lot of time for us.

Data Formats

csv
h5
pytables(hdf5)
npy
npz
joblib
Todo: hickle.
Dropped: pickle(because pickle cannot handle larger data sizes. Based on my experiments pickle format has tipping point around 2 GB).

Array Sizes

Instead of benchmarking just one size, I used asymptotic analysis approach wherein you measure the size and time of various array sizes.

100 x 100
1000 x 1000
10000 x 10000
20000 x 20000
30000 x 30000
40000 x 40000
50000 x 50000

In Memory Sizes

The following table shows the size of in memory size consumed after creation of each array:

Array Size -	In Memory(MB)
100 x 100 -	0.08
1000 x 1000 -	8
10000 x 10000 -	800
20000 x 20000 -	3200
30000 x 30000 -	7200
40000 x 40000 -	12800
50000 x 50000 -	20000

Lets see the same data visually:

benchmarking_data_file_formats

Size on disk (in MB):

CSV format consumed the largest size on disk when compared to other data formats.
CSV format consumes 3x the size of all other formats.
We can see from the below chart that size of the csv file increases exponentially and way far away from other file formats.
Surprising point is, except csv, all other file formats consume the same disk space as the memory size.

size_on_disk

Time to save files to disk (in minutes):

Overall, h5 and hdf5 take less time to save followed by joblib and npy and then npz.
CSV again performed the worst.
On an average, CSV consumes 70x times more than the least time taken by any other file format.
As you can see from the below visual, CSV performs the worst.
While other formats take less than a minute to save 50000 x 50000 array, csv takes approximately 23 minutes.

time_to_disk

Time to load files from disk (in minutes):

Overall npy format performs the best.
CSV performs the worst.
On an average, CSV consumes 140x times more than npy format to load the data from the disk.
As you can see from the below visual, time taken to load the csv across various array sizes increases exponentially where as the other file formats show a linear pattern.

load_from_disk

Strange results:

During my experiments, I used memory_profiler package to check the memory size after loading each file format. I noticed strange results, sometimes the size in memory of the largest array < smaller array sizes. I have to investigate this part.

Benchmarking / Benchmarking various Data file formats - csv, h5, pytables(hdf5), npy, npz, joblib

Data Formats

Array Sizes

In Memory Sizes

Size on disk (in MB):

Time to save files to disk (in minutes):

Time to load files from disk (in minutes):

Strange results:

Published

Category

Tags

Contact