Write a Linux command line to traverse a directory of files and summarize the number of instances of each file extension found (e.g., .docx, .jpg, .pdf, etc). Fold all file extensions to lower case-- ".JPG" and ".jpg" should be reported together. Files that have no extension should be reported as "other". List the extensions found in descending numeric order by the number of occurrences of each extension.

#Linux #DFIR #CommandLine #Trivia

I took yesterday's Linux DFIR command line trivia from one of our old Command Line Kung Fu blog postings (Episode 99). It's an interesting challenge and useful for quickly summarizing types of files in a directory by simply collecting and counting the file extensions.

I also liked it because we get to use the "${var##pattern}" expansion, similar to my solution from a couple of days ago using "${var%%pattern}". So there's some nice symmetry.

If you read the original blog posting, my first solution used a funky "sed" substitution to whittle down to the file extensions. It was actually loyal reader and friend of the blog Jeff Haemer who suggested this much cleaner version (and to @barubary who reminded me to quote my variables) :

find Documents -type f |
while read f; do
echo "${f##*.}";
done |
sed 's/.*\/.*/other/' | tr A-Z a-z |
sort | uniq -c | sort -n

"find" gets us a list of files in the directory and then we feed that into a loop that uses "${f##*.}" to remove everything up until the final "." in the file path.

Some file names are not going to have a "." and so the original file path will be unchanged. The "sed" expression after the loop marks these files as being in category "other". Finally we shift everything to lowercase so "GIF" and "gif" are recognized as the same type. Then our usual command line histogram idiom rounds things out.

We could add some other fix-ups to make things nicer:

find Documents -type f |
while read f; do
echo "${f##*.}";
done |
sed 's/.*\/.*/other/' | tr A-Z a-z |
sed -r 's/^jpg$/jpeg/; s/(~|,v)$//' |
sort | uniq -c | sort -n

I've dropped in another "sed" expression before we start counting things. We make "jpg" and "jpeg" count the same and remove some trailing file extensions for backup files to reduce clutter in the output. This just shows that you can arbitrarily tweak the output to suit your needs.

Props to @apgarcia for getting very close to the final solution on this one!

#Linux #DFIR #CommandLine #Trivia

@hal_pomeranz i came across a really cool alternative to the 'sort | uniq -c | sort -rn' histogram pattern:

https://github.com/red-data-tools/YouPlot

see screenshot...

GitHub - red-data-tools/YouPlot: A command line tool that draw plots on the terminal.

A command line tool that draw plots on the terminal. - red-data-tools/YouPlot

GitHub