High-performance Data Storage Formats

# High-performance Data Storage Formats ## Apache Parquet For excellent general load time and portability to non-Python tools. - Write: `df.to_parquet(file_name)` - Read: `df = pd.read_parquet(file_name)` - You should have either the [PyArrow](https://arrow.apache.org/docs/python/install.html) (`pyarrow`, recommended) or [fastparquet](https://github.com/dask/fastparquet) (`fastparquet`) engine installed. ## **CSVs on GPUs with [cuDF](https://docs.rapids.ai/api/cudf/stable/)** For ultra-fast loading of even huge CSV files: - Write: `df.to_csv(file_name)` - Read: `df = cudf.read_csv(file_name)` - You need to [install and set up cuDF](https://github.com/rapidsai/cudf?tab=readme-ov-file#installation) ## [PyTables](https://www.pytables.org/usersguide/introduction.html) HDF5 Use PyTables for speed when using large datasets (>50 MB). - Write: `df.to_hdf(file_name)` - Read: `df = pd.read_hdf(file_name)` - You need to [install](https://www.pytables.org/usersguide/installation.html#id1) the package `tables` Ref: [Loading data into a Pandas DataFrame – a performance study](https://www.architecture-performance.fr/ap_blog/loading-data-into-a-pandas-dataframe-a-performance-study/)