just listened to @pluralistic's How To Think About Scraping, which is a REALLY good summary of the (imo) correct view of scraping and training in relation to AI. you should check it out

https://craphound.com/news/2023/09/24/how-to-think-about-scraping/

How To Think About Scraping | Cory Doctorow's craphound.com

I will add exactly one extremely tangential nitpick, which is that linguistics did *not* rely entirely or even mostly on written corpora, and casual speech was *not* mostly a blackbox, until the internet came along. but that requires some context on how linguistics is done, so lemme give that
historically, when trying to build a theory or model of language, we linguists (i'm a former academic linguist, so i claim the "we" :p) do a combination of empirical modelling and theoretical noodling

when we're trying to work on language data, pure empirical work is functionally impossible

if we start from, let's say, written documents, whether it's formal writing or casual, we still just have a giant pile of text

and while there is a lot of stuff we can do to build purely empirical models of text, such as create markov models, or these days create various deep net things, there's a lot we CAN'T do
those models are extremely constrained in their utility to understanding language because the results that they spit out are, at best, difficult to analyze, and at worst, almost completely blackboxes in the case of neural nets
and moreover, the models produced, even when they're analyzable, generally aren't able to provide insight into the kinds of phenomena that linguists are interested in. you the linguist have to provide that insight yourself
there are, in principle, more sophisticated mathematical models that we can try to create, beyond something like a markov chain, but the more complicated the model, the more difficult it is to generate
there's a well known proof (PROOF! as in, mathematically true, period, end of story) that given a corpus of text, deriving a context free grammar from a set of example sentences is ***completely intractable*** in all but the most trivial cases
which is not to say that you can't derive SOME kind of grammar, it's just not remotely guaranteed to be correct, or even good, and is going to be very heavily driven by heuristical techniques
and most of the phenomena that linguists care about are far far more complex than what we can represent with context free grammars, and deriving THOSE is even more complicated, and basically impossible
and this is not taking into account the fact that historically, computers were slow. i mean, REALLY slow. prior to about 2000, you could barely do any large computational work anyway, and in the 1960s and 1970s when most modern theories were being born? forget it, no one used computers at all

which means the value of corpus based linguistics was EXTREMELY minimal

so how was, and is, linguistics actually done? how do we get around the problems of tractability?

we DONT rely on pure empirical modelling!

but, that's how most science is done, this shouldn't shock you. we don't merely gather up LOTS of data and then do a best fit etc etc.

no, we have some data, maybe even a very small amount, and we develop a hypothesis, and we then test that, and refine out hypothesis, etc etc

most theoretical linguistics is done not on formal written language at all, not on large corpora at all, but rather on relative small sample sizes, collected under specific methodologies meant to get extremely relevant data to the phenomenon under investigation
typically, this is in the form of judgment illicitations, where a linguist uses their theory to guide the construction of a sentence that is relevant to the phenomenon, and then they get people (starting with themselves) to judge the sentence by how "good" it sounds in various ways
this requires careful thought about confouding factors (sentences can sound better or worse depending on other sentences around them, depending on dialect, etc.)
these judgments are known to be rather robust. at least one study from the early 2010s showed that individual judgments on the part of the *linguist themself* are extremely reflective of the broader judgments of their speech community

what this means is that theoreticians have the ability to investigate phenomena by looking at pretty much any kind of sentence

and that includes both extremely rare kinds of sentences

AND ALSO extremely informal sentences

most theoretical linguists work heavily in the domain of both at once

that is to say, because theoreticians are interested in natural language, we TEND to avoid formal language entirely because it's artificial

and because we can rely on judgments, we can actually do a LOT of work on JUST informal speech

now, it's true, as Doctorow mentioned, that all of this work is very heavily dependent on grad students etc. at least if you're collecting data or doing standard analyses, tho in practice theoretical work is actually very individualized

a lot of it is actually the researchers themselves doing illicitations

still a few small, focused, specialized group of people, tho!

but that's ok, because big data actually isn't any help here anyway

this is how most theoretical linguistics was done in the past BEFORE the internet, and it's how most theoretical linguistics is done NOW even with the internet

very very very few theoreticians are out there collecting gobs of data and modelling it

and if you want a sense of how deeply informal this language can get, the Generative Semanticists, such as Haj Ross, @georgelakoff, Jim McCawley, and so many others great linguists, used to actively seek out super casual, often vulgar or obnoxious, sentences to study

the act itself was also very informal and jocular. here's something McCawley once wrote, under a pseudonym pretending to be a Maoist critiquing McCawley

"This note is concerned with a counterexample to the outrageous claim
made by the bourgeois imperialist linguist McCawley. . . . Consider the
idiomatic sense of shove X up Y’s ass. As is well known, Y must be coreferential to the indirect object of the next higher clause (including the
deleted indirect object of a deleted performative verb)"

here's some more McCawley:

Consideration of these examples makes it fairly clear that the "fuck"
of (12a)–(20a) (henceforth fuck1) and the "fuck" of (2) (henceforth fuck2) are two distinct homophonous lexical items. These two lexical
items have totally different selectional restrictions, as is shown by the examples:

(26) Fuck these irregular verbs.

(27) *John fucked these irregular verbs.

(28) Fuck communism.

(29) *John fucked communism.

and here are some more, of various origins:

1.
a. The fact that Max plorbed Betty did not convince Pete to caress her on the lips. (Postal 1988a [1969]:74)

b. Mary tried to give John a blow job, but she choked on it [ambiguous, depending on Mary’s success]. (Douloureux 1992 [1971]:48)

c. Let’s fuck. (R. Lakoff 1977:82)

2.
a. Hey, if John went to Chicago, that means we’ll soon have a big supply of
dope. (Schmerling 1971:249)

b. My cache of marijuana got found by Fido, the police dog. (R. Lakoff 1971b:154)

c. Fred does nothing but smoke hashish and play the sarod; John is similar. (McCawley 1976b [1972]:304)

3.
a. The M.C. introduced Mick Jagger’s penis as being large enough to amaze
the most jaded of groupies. (Borkin 1984 [1974]:18)

b. Paul is dead and I do not believe he is dead. (G. Lakoff 1975)

c. She left one too many a boy behind. He committed suicide. (Bob Dylan, cited in Zwicky 1976:683)

and some more:

4.

a. Amerika’s [sic] claim that it was difficult to control Vietnamese aggression in Vietnam surprised no one. (Grinder 1970:300).

b. *The shit that John took weighed 600 grams. (McCawley 1988 [1971]:96)

c. *I don’t want to kiss no gorillas. (Postal 1974:236)

its pretty clear from these that INFORMAL language was the focus of at least this body of research, but actually this is most of theoretical linguistics here

usually our examples aren't so vulgar -- the Generative Semanticists were trying to bring some levity to an otherwise boring tradition of boring example sentences and so they intentionally constructed examples that were obnoxious and funny and gross and whatnot because science should be fun -- but nevertheless, most theoreticians work in casual speech, not formal speech

and indeed most theoretical work is somewhat allergic to formal speech. we reject it as ENTIRELY uninsightful into language because it's FAKE

we theoreticians want to study the human mind's capacity for language, not what Strunk and White and the rest of the Rome larpers think

"formal" english has rules like Never Split an Infinitive

but that shit was literally invented because in LATIN you couldn't split an infinitive and people like Strunk and White wanted to model English on Latin because of Rome LARPing

that's horseshit and not science

so linguists don't use that shit

we use casual language

none of this is to take away from @pluralistic's podcast episode, it's just an excuse to infodump about something i'm passionate about that most people don't know anything about :p
@beka_valentine This was as entertaining and informative as I hoped. FYI, "Rome LARPing" was what caught my eye and made me scroll up to take in the whole thread.
@beka_valentine @pluralistic love all of this! Thank you so much for sharing these perspectives which for many of us are completely unknown. You rock!