TIL about the Pile dataset 886TB oops 886GB of text data, created in 2K20 which can be used for various purposes including LLM training

#Programming #Pile #program #OpenSource #LLM #slop #AI #technology #dataset

https://en.wikipedia.org/wiki/The_Pile_%28dataset%29?wprov=sfla1

@Dendrobatus_Azureus

I wonder what bzip3 would do to it.

@rl_dane

When I have the bandwidth I will download the set and play with it, including archivers.
However I'm certain it's available in zip format
Check the torrents

#Programming #Pile #program #OpenSource #LLM #slop #AI #technology #dataset

@Dendrobatus_Azureus

I don't have the room for it, lol.

My curiosity and love for lossless compression is making me want to find some, though. XD