I made a post about working with Distributed File Systems (like HDFS or S3) from PySpark without additional dependencies. The idea is to directly call JVM and "org.apache.hadoop.fs.FileSystem".
#pyspark #hadoop #s3 #hdfs #py4j
https://semyonsinchenko.github.io/ssinchenko/post/working-with-fs-pyspark/
Working With File System from PySpark
Working with File System from PySpark Motivation Any of us is working with File System in our work. Almost every pipeline or application has some kind of file-based configuration. Typically json or yaml files are used. Also for data pipelines, it is sometimes important to be able to write results or state them in a human-readable format. Or serialize some artifacts, like matplotlib plot, into bytes and write them to the disk.