The info in the tables includes file types for embedded files, depth of embedded files, language id and a bunch of other features, including the #outOfVocabulary statistic.
These kinds of stats are really important for ingest for #search, #digipres and #fileforensics.
Take a look at the *-1k.csv files, and I can share the config file that extracted that info if you have an interest.