Mastodawn

Sem 🏴‍🅰️Mar 30, 2023

I made a post about working with Distributed File Systems (like HDFS or S3) from PySpark without additional dependencies. The idea is to directly call JVM and "org.apache.hadoop.fs.FileSystem".

#pyspark #hadoop #s3 #hdfs #py4j
https://semyonsinchenko.github.io/ssinchenko/post/working-with-fs-pyspark/

Working With File System from PySpark

Working with File System from PySpark Motivation Any of us is working with File System in our work. Almost every pipeline or application has some kind of file-based configuration. Typically json or yaml files are used. Also for data pipelines, it is sometimes important to be able to write results or state them in a human-readable format. Or serialize some artifacts, like matplotlib plot, into bytes and write them to the disk.

Sem Sinchenko

Show thread

Antonin Delpeuch Oct 30, 2022

@shmarkus @java_discussions there is also #py4j which is quite successful, for instance it powers the popular #pyspark. And #GraalPython which could hopefully provide a sort of replacement for #jython on the long run, although the authors do not aim for that