just listened to @pluralistic's How To Think About Scraping, which is a REALLY good summary of the (imo) correct view of scraping and training in relation to AI. you should check it out

https://craphound.com/news/2023/09/24/how-to-think-about-scraping/

How To Think About Scraping | Cory Doctorow's craphound.com

I will add exactly one extremely tangential nitpick, which is that linguistics did *not* rely entirely or even mostly on written corpora, and casual speech was *not* mostly a blackbox, until the internet came along. but that requires some context on how linguistics is done, so lemme give that
historically, when trying to build a theory or model of language, we linguists (i'm a former academic linguist, so i claim the "we" :p) do a combination of empirical modelling and theoretical noodling

when we're trying to work on language data, pure empirical work is functionally impossible

if we start from, let's say, written documents, whether it's formal writing or casual, we still just have a giant pile of text

and while there is a lot of stuff we can do to build purely empirical models of text, such as create markov models, or these days create various deep net things, there's a lot we CAN'T do
those models are extremely constrained in their utility to understanding language because the results that they spit out are, at best, difficult to analyze, and at worst, almost completely blackboxes in the case of neural nets
and moreover, the models produced, even when they're analyzable, generally aren't able to provide insight into the kinds of phenomena that linguists are interested in. you the linguist have to provide that insight yourself
there are, in principle, more sophisticated mathematical models that we can try to create, beyond something like a markov chain, but the more complicated the model, the more difficult it is to generate
there's a well known proof (PROOF! as in, mathematically true, period, end of story) that given a corpus of text, deriving a context free grammar from a set of example sentences is ***completely intractable*** in all but the most trivial cases
which is not to say that you can't derive SOME kind of grammar, it's just not remotely guaranteed to be correct, or even good, and is going to be very heavily driven by heuristical techniques
and most of the phenomena that linguists care about are far far more complex than what we can represent with context free grammars, and deriving THOSE is even more complicated, and basically impossible
and this is not taking into account the fact that historically, computers were slow. i mean, REALLY slow. prior to about 2000, you could barely do any large computational work anyway, and in the 1960s and 1970s when most modern theories were being born? forget it, no one used computers at all

which means the value of corpus based linguistics was EXTREMELY minimal

so how was, and is, linguistics actually done? how do we get around the problems of tractability?

we DONT rely on pure empirical modelling!

but, that's how most science is done, this shouldn't shock you. we don't merely gather up LOTS of data and then do a best fit etc etc.

no, we have some data, maybe even a very small amount, and we develop a hypothesis, and we then test that, and refine out hypothesis, etc etc

most theoretical linguistics is done not on formal written language at all, not on large corpora at all, but rather on relative small sample sizes, collected under specific methodologies meant to get extremely relevant data to the phenomenon under investigation
typically, this is in the form of judgment illicitations, where a linguist uses their theory to guide the construction of a sentence that is relevant to the phenomenon, and then they get people (starting with themselves) to judge the sentence by how "good" it sounds in various ways
this requires careful thought about confouding factors (sentences can sound better or worse depending on other sentences around them, depending on dialect, etc.)
these judgments are known to be rather robust. at least one study from the early 2010s showed that individual judgments on the part of the *linguist themself* are extremely reflective of the broader judgments of their speech community

what this means is that theoreticians have the ability to investigate phenomena by looking at pretty much any kind of sentence

and that includes both extremely rare kinds of sentences

AND ALSO extremely informal sentences

most theoretical linguists work heavily in the domain of both at once

that is to say, because theoreticians are interested in natural language, we TEND to avoid formal language entirely because it's artificial

and because we can rely on judgments, we can actually do a LOT of work on JUST informal speech

now, it's true, as Doctorow mentioned, that all of this work is very heavily dependent on grad students etc. at least if you're collecting data or doing standard analyses, tho in practice theoretical work is actually very individualized
@beka_valentine @pluralistic will do exactly that now. thanks for sharing! sounds interesting
@beka_valentine @pluralistic super interesting points! Might have to get that book
@scoutaloud @pluralistic you should! it's very good. both cory's and brian's are worth getting