hey so remember how I wanted a project gutenberg corpus with every plaintext file in an easy-to-use format? https://mastodon.social/@aparrish/100511033258021934

well I wanted it so bad I guess that I went ahead and made it https://github.com/aparrish/gutenberg-dammit

aparrish/gutenberg-dammit

gutenberg-dammit - I wanted all of plaintext Project Gutenberg in an easy-to-use format, so I made this

a quick exercise with this corpus: "Flower blank," alphabetized bigrams beginning with "flowers" from every Project Gutenberg book labelled as "Poetry"

https://gist.github.com/aparrish/fdcbdbd50e363995d086f0e0c3f95769

excerpt:

flowers a
flowers ablaze
flowers about
flowers above
flowers absorb
flowers accompanying
flowers adorn
flowers advance
flowers afford
flowers affray
flowers aflame
flowers after
flowers again
flowers against
flowers alighting
flowers alive
flowers all
flowers allied
flowers ally
flowers almost
flowers aloft
...

Flower blank / alphabetized bigrams beginning with "flowers" from every Project Gutenberg book labelled as "poetry"

Flower blank / alphabetized bigrams beginning with "flowers" from every Project Gutenberg book labelled as "poetry"

if you were going to download this today, maybe hold off—I found a frustrating bug where some utf8-encoded texts were being decoded incorrectly with a different encoding, leading to hilarious mojibake when they came out the other end—will post a fix in a few hrs

none of this would be a problem if the reported charset in the metadata was always the correct charset. but there are a lot of texts that report "us-ascii" when what they really mean is "ascii with occasional 8-bit chars just for fun!"

also lots of files (>1%?) that say "yeah I'm utf8 sure whatever" but are actually ISO-8859-1 (according to chardet at least)

lessons learned: (a) never trust someone's claim about the encoding of a text file (b) character encodings are bad and trying to digitize text in the first place was bad idea

@aparrish and chardet can't always get it right. e.g. I recently ran into CSV files that weren't encoded in ISO-8859-1, but instead a MacOS encoding from a similar era.
okay I just pushed an update and uploaded a new archive with (more) correct character encoding—enjoy https://github.com/aparrish/gutenberg-dammit
aparrish/gutenberg-dammit

gutenberg-dammit - I wanted all of plaintext Project Gutenberg in an easy-to-use format, so I made this

@aparrish I thought that "ascii with occasional 8-bit chars just for fun!" was the new mandatory charset standard.
@dredmorbius @aparrish I thought it was just a longwinded kind of captcha.
@aparrish be the datastore you want to see in the world
@aparrish Holy spit this is amazing and something I’ve wanted to do for a few months now, and you’ve done a better job than I was even imagining. Thank you for this!
@aparrish I'm definitely going to play with this later! Thanks for doing this!