Mastodawn

Darius: Moving Accts, See Bio Aug 10, 2018

hey so remember how I wanted a project gutenberg corpus with every plaintext file in an easy-to-use format? https://mastodon.social/@aparrish/100511033258021934

well I wanted it so bad I guess that I went ahead and made it https://github.com/aparrish/gutenberg-dammit

aparrish/gutenberg-dammit

gutenberg-dammit - I wanted all of plaintext Project Gutenberg in an easy-to-use format, so I made this

Show thread

Allison Parrish Aug 10, 2018

a quick exercise with this corpus: "Flower blank," alphabetized bigrams beginning with "flowers" from every Project Gutenberg book labelled as "Poetry"

https://gist.github.com/aparrish/fdcbdbd50e363995d086f0e0c3f95769

excerpt:

flowers a
flowers ablaze
flowers about
flowers above
flowers absorb
flowers accompanying
flowers adorn
flowers advance
flowers afford
flowers affray
flowers aflame
flowers after
flowers again
flowers against
flowers alighting
flowers alive
flowers all
flowers allied
flowers ally
flowers almost
flowers aloft
...

Flower blank / alphabetized bigrams beginning with "flowers" from every Project Gutenberg book labelled as "poetry"

Show thread

Allison Parrish Aug 11, 2018

if you were going to download this today, maybe hold off—I found a frustrating bug where some utf8-encoded texts were being decoded incorrectly with a different encoding, leading to hilarious mojibake when they came out the other end—will post a fix in a few hrs

none of this would be a problem if the reported charset in the metadata was always the correct charset. but there are a lot of texts that report "us-ascii" when what they really mean is "ascii with occasional 8-bit chars just for fun!"

Show thread

Allison Parrish Aug 11, 2018

also lots of files (>1%?) that say "yeah I'm utf8 sure whatever" but are actually ISO-8859-1 (according to chardet at least)

lessons learned: (a) never trust someone's claim about the encoding of a text file (b) character encodings are bad and trying to digitize text in the first place was bad idea

Show thread

Kevin Turner Aug 11, 2018

@aparrish and chardet can't always get it right. e.g. I recently ran into CSV files that weren't encoded in ISO-8859-1, but instead a MacOS encoding from a similar era.

Show thread

Allison Parrish Aug 12, 2018

okay I just pushed an update and uploaded a new archive with (more) correct character encoding—enjoy https://github.com/aparrish/gutenberg-dammit

aparrish/gutenberg-dammit

gutenberg-dammit - I wanted all of plaintext Project Gutenberg in an easy-to-use format, so I made this

Show thread

Doc Edward Morbius ❌Aug 11, 2018

@aparrish I thought that "ascii with occasional 8-bit chars just for fun!" was the new mandatory charset standard.

Show thread

Lew Perin Aug 11, 2018

@dredmorbius @aparrish I thought it was just a longwinded kind of captcha.

Show thread

Steve has ☕️ for brains Aug 10, 2018

@aparrish be the datastore you want to see in the world

Show thread

Mike Lynch Aug 10, 2018

@aparrish Thank you!

Show thread

psthfr 🐟Aug 10, 2018

@aparrish omg thank you <3

Show thread

Colin Mitchell Aug 10, 2018

@aparrish holy cow!

Show thread

Brian P.Aug 11, 2018

@aparrish Holy spit this is amazing and something I’ve wanted to do for a few months now, and you’ve done a better job than I was even imagining. Thank you for this!

Show thread

users interact / digital purse Aug 11, 2018

@aparrish this is amazing!

Show thread

Andy Gorman Aug 11, 2018

@aparrish I'm definitely going to play with this later! Thanks for doing this!