IllustrisTNG - Using PySpark for Scalable Processing of IllustrisTNG Simulation Data

Using PySpark for Scalable Processing of IllustrisTNG Simulation Data

Sungryong Hong

1
8 Jul

PySpark is a de facto standard tool for handling big data today.

If you're interested in how to use PySpark to process full snapshots and catalogs from IllustrisTNG,
feel free to check out my tutorial on GitHub.

[1] Converting HDF5 to Parquet
https://github.com/shongscience/illustris-pyspark/blob/main/tutorial/notebook/snapshot/tng-snapshot-convert-h5-to-parquet-v1-pilot-eda.ipynb

[2] Counting All DM Particles from Converted Snapshot Parquets
https://github.com/shongscience/illustris-pyspark/blob/main/tutorial/notebook/snapshot/tng50-snap84-test-conversion-eda-v1.ipynb

[3] Processing the Group Catalog with Spark Dataframe and Pandas
https://github.com/shongscience/illustris-pyspark/blob/main/tutorial/notebook/groupcatalog/tutorial-tng300vs50-groupcat-eda-v1-pilot.ipynb

(FYI, my yarn cluster has 270 vCPU with 2.0 TB memory)

Suggestion:
HDF5 is not well-suited for cloud-like storage systems such as S3, Google Cloud Storage (GS), or Hadoop.
In practice, I always download HDF5 files to local storage before further processing, rather than working directly on distributed file systems.

I hope that someday HDF5 can be restructured into Apache Arrow format to enable seamless integration with S3, GS, or Hadoop environments.

Thank you, and happy discoveries in the IllustrisTNG mines!

Page 1 of 1

Public Data Access Overview / Discussion Forum

Using PySpark for Scalable Processing of IllustrisTNG Simulation Data