Mastodawn

Amin, minor deity of the legume realm Feb 4

@rl_dane

Oh, didn't know about -c. I usually just pipe to wc -l I guess.

-c, -l, -h, -H, and -q are my favorite #grep flags. :D

Huh, that almost became a [Marcel Duchamp] reference. 😅

Marcel Duchamp - Wikipedia

Amin, minor deity of the legume realm Feb 4

@rl_dane

I just use -v and -E

sotolf Feb 4

@amin @rl_dane you guys use flags?... :p

thedoctor Feb 4

@amin @rl_dane @sotolf You guys still use grep instead of ripgrep. Tst

@thedoctor @amin @sotolf

...and bash instead of zsh
...and grep/awk/sed instead of jq
...and firefox instead of chrome
...and the fediverse instead of facebook

Face it... I'm an unpopular-opinion neckbeard level boss. XD

cc: @mirabilos

thedoctor Feb 5

@rl_dane Those are so not comparable!

@amin @sotolf @mirabilos

sotolf Feb 5

@thedoctor @rl_dane @amin @mirabilos At least bash and zsh is comparable to grep ripgrep, as zsh is just a strictly better bash ;)

Amin, minor deity of the legume realm Feb 5

@sotolf @thedoctor @rl_dane @mirabilos

Mm, not really though? ripgrep is meant for bulk grepping of files

sotolf Feb 5

@amin @thedoctor @rl_dane @mirabilos I think I had it installed, I just never remembered to use it :p

Amin, minor deity of the legume realm Feb 5

@sotolf @thedoctor @rl_dane @mirabilos

I mostly just use it to run rg TODO and see all the spots in a codebase I marked as still needing work.

@amin @sotolf @thedoctor @mirabilos

Why is ripgrep better than just grep -R?

@kabel42 @amin @sotolf @thedoctor @mirabilos

@rl_dane @amin @sotolf @thedoctor @mirabilos it's somehow a lot faster if you want to grep a few GiB of code, like 15 minutes to 30 seconds

Interesting! I wonder what kind of algorithmic optimizations (as opposed to compiler optimizations) they're using to do that, and if regular (GNU/BSD) grep could do the same.

Because I'll wear clown shoes and a tutu before changing to a "rewrite the world in rust!" utility 😂

@rl_dane @amin @sotolf @thedoctor @mirabilos From what little i have read, some assumptions about what you are greping and different defaults. Doing the same in existing grep would probably break compatibility.

@kabel42 @rl_dane @amin @sotolf @thedoctor eww, it’s not even a drop-in then…

(For not-a-drop-in, I found pcregrep interesting. Sadly, Debian recently dropped it, but in the versions which don’t have pcregrep any more, you can use grep -P for many use cases. pcre2grep is not a drop-in for pcregrep either…)

R.L. Dane

🍵

I was a total PCRE stan in the olden days, but I've steered more towards regular extended regexp for compatibility. I do miss \d, \w and \s, though. [[:space:]] feels so clumsy to type and use several times in a regex, I'll sometimes put a sp="[[:space:]]" line at the start of a script, and you'll see several invocations of "${sp}" in my regex strings.

But again... compatibility. ;)

Is there a big difference between (GNU) grep -P and pcregrep? I hadn't heard of that utility before.

@amin @kabel42 @rl_dane @sotolf @thedoctor I never used \d and the likes, always felt them much too complicated. I almost never use POSIX character classes (besides the BSD [[:<:]] and [[:>:]]), rather I just hit [ tab space ] quickly.

GNU grep -P does a PCRE grep, it doesn’t support all of the extra flags of pcregrep though, and before the version in IIRC trixie was very broken.

is [[:<:]] and [[:>:]] the same as \< and \>?

@rl_dane @amin @kabel42 @sotolf @thedoctor obviously not, because it’s written differently ;)

re_format(7) knows:

     There are two special cases** of bracket expressions: the bracket expres-
     sions '[[:<:]]' and '[[:>:]]' match the null string at the beginning and
     end of a word, respectively. A word is defined as a sequence of charac-
     ters starting and ending with a word character which is neither preceded
     nor followed by word characters. A word character is an alnum character
     (as defined by ctype(3)) or an underscore. This is an extension, compati-
     ble with but not specified by POSIX, and should be used with caution in
     software intended to be portable to other systems.


(as for the mark:)
     POSIX leaves some aspects of RE syntax and semantics open; '**' marks de-
     cisions on these aspects that may not be fully portable to other POSIX
     implementations.

The definition for \< / \> differs between less, perlre, pcre, … I believe, but they all are somewhat simiar.

@rl_dane @amin @kabel42 @sotolf @thedoctor perlre(1) actually has…

     A word boundary ("\b") is a spot between two characters that
     has a "\w" on one side of it and a "\W" on the other side of
     it (in either order), counting the imaginary characters off
     the beginning and end of the string as matching a "\W".

… so the \< probably comes from less(1)?

… hm, no. But, where then?

I used to use \b a lot, but \< and \> are just as easy to use, and POSIX. ;)

\w is nice, though. I think the closest POSIX one is [[:graph:]]? (Not super close, though)

@rl_dane @amin @kabel42 @sotolf @thedoctor \< and \> are not POSIX.

perlre(1) \w is identical to POSIX [a-zA-Z0-9_] in the C locale, so [[:alnum:]_] if you have support for POSIX character classes.

Ah, yes. [[:alnum:]] was the one I was thinking of.

@rl_dane @amin @kabel42 @sotolf @thedoctor but [[:alnum:]_]

Waiiiiit, what does the underscore before the second bracket do? I've never seen that before.

No mention of it in RE_FORMAT(7) on FreeBSD.

@rl_dane @amin @kabel42 @sotolf @thedoctor the exact same thing as the underscore in [a-zA-Z0-9_], and I’d be surprised if the FreeBSD manpage would not document it

@mirabilos @rl_dane @amin @sotolf @thedoctor yay clear and unmistakable syntax

@kabel42 @mirabilos @amin @sotolf @thedoctor

@kabel42 @rl_dane @amin @sotolf @thedoctor what? It is. Regex is simple.

Doctor Strangepattern or: How I Learned to Stop Worrying and Love the Write-Once-Read-Never Nature of Regexp

@rl_dane @kabel42 @amin @sotolf @thedoctor oh, reading and understanding my own regexen is easy, it’s only other people’s…

I think it's like almost any terse "programming" language where it takes some time to find the same neural pattern in your own head that produced it, so you can "remember" what you were doing. ^___^

In the past, I have literally used shell loops to construct regexp variables on the fly, rather than having completely incomprehensible "line noise" regexps. 😄

@rl_dane @kabel42 @amin @sotolf @thedoctor remember though that ksh extglobs are special at parse time, so you cannot do e.g. foo='@(0|[1-9]*([0-9]))'; [[ $1 = $foo ]], you have to use eval (eurghs, best to avoid, especially in functions and loops due to the hard parse overhead each time)

@sotolf @kabel42 @mirabilos @amin @thedoctor

sotolf Feb 9

@rl_dane @kabel42 @mirabilos @amin @thedoctor I would lie if I said that I haven't just started a new regex because I wasn't sure how to change the old one without breaking it.

I feel that. XD

@rl_dane @amin @kabel42 @sotolf @thedoctor let me blow your mind if that was news to you:

[[:alpha:][:digit:]_]

@mirabilos @rl_dane @amin @sotolf @thedoctor yay context sensitive [], there is no way that can go wrong \s

@mirabilos @rl_dane @amin @sotolf @thedoctor what would [:alpha:] do, and [:alpha]?

@kabel42 @rl_dane @amin @sotolf @thedoctor the same as [:ahlp]

@kabel42 @rl_dane @amin @sotolf @thedoctor see re_format(7)

RTFM re_format(7)

@mirabilos @rl_dane @amin @sotolf @thedoctor ok, so context sensitive, [ in [ is different from normal [

@kabel42 @rl_dane @amin @sotolf @thedoctor context-switching, [ opens a completely different parse in which EVERYTHING is different

@kabel42 @rl_dane @amin @sotolf @thedoctor even .

kabel, repeat after me:

THIS IS FINE
I WILL LEARN TO LOVE REGEX
THIS IS FINE
I WILL LEARN TO LOVE REGEX
THIS IS FINE
I WILL LEARN TO LOVE REGEX
THIS IS FINE
I WILL LEARN TO LOVE REGEX
THIS IS FINE
I WILL LEARN TO LOVE REGEX

(Because it really is fine, and I really do love #regex. #StockholmSyndrome??? You be the judge. XD )

@rl_dane @kabel42 @amin @sotolf @thedoctor what? regexen are great!

They’re basically the one thing ed(1) has but EDLIN.EXE doesn’t which make it actually usable.

@rl_dane @mirabilos @amin @sotolf @thedoctor i like regex, but the notation in Uni Math class was more sane

@kabel42 @rl_dane @amin @sotolf @thedoctor … no.

@mirabilos @rl_dane @amin @sotolf @thedoctor No manual entry for re_format in section 7

@kabel42 @mirabilos @amin @sotolf @thedoctor

@kabel42 @rl_dane @amin @sotolf @thedoctor then click on the link

in [[:alpha:]] the outer brackets denote the fact that you're defining a character class (terminology???), and the inner [:alpha:] is a character class/shortcut for [a-zA-Z].

Someone please correct my terminology.

@rl_dane @kabel42 @amin @sotolf @thedoctor the outer ones delineate a bracket expression, the rest is correct

Thanks. :)

@rl_dane @kabel42 @amin @sotolf @thedoctor though of course it is [[:alpha:]] that is equivalent to [a-zA-Z], not [:alpha:]

… and that is only true for the C locale. In C.UTF-8 [[:alpha:]] also matches α.

@kabel42 @rl_dane @amin @sotolf @thedoctor it’s actually not, the first unescaped [ switches from RE context to RE-Bracket context in the bracket-begin state, in which you can have an optional ^ (except in shellglobs where it is spelt !), then an optional ] not taken as the end of the RE-Bracket, then an optional -, then any amount of expressions of the type a-z, [:charclass:], [=equivalenceclass=], x, then an optional -, then a closing ] which terminates the RE-Bracket context.

@kabel42 @rl_dane @amin @sotolf @thedoctor (I erred: you can have either the ] or the - at the beginning, not both)

@kabel42 @rl_dane @amin @sotolf @thedoctor (and I forgot collating elements, which is totally fucked up, [a[.ch.]] in e.g. es_ES.UTF-8 matches either a or ch, so a bracket expression in POSIX has a variable matching length…)

@mirabilos @rl_dane @amin @sotolf @thedoctor yeah, i hate it

@kabel42 @rl_dane @amin @sotolf @thedoctor these are rare-to-never-used features, thankfully

@kabel42 @rl_dane @amin @sotolf @thedoctor tbh the only time I use something other than simple chars and ranges in bracket expressions is the BSD [[:<:]] and [[:>:]] extension (which matches a zero-length string)

@mirabilos @rl_dane @amin @sotolf @thedoctor as in '^$'?

@kabel42 @rl_dane @amin @sotolf @thedoctor no, the zero-length string between a nōn-word‑ and a word character

Ok re_format(7) is very terse when defining equivalence classes (#TIL!!!)

Are they just for visually/linguistically-similar characters, like "e" and "é"?

@rl_dane @kabel42 @amin @sotolf @thedoctor it’s a POSIX locales thing best to not wonder about and never use.

@rl_dane @kabel42 @amin @sotolf @thedoctor

     Within a bracket expression, a collating element enclosed in '[=' and
     '=]' is an equivalence class, standing for the sequences of characters of
     all collating elements equivalent to that one, including itself.

This basically means, if you have a locale whose LC_COLLATE does things like case-insensitive sorting (like de_DE.UTF-8 does, which changes the output order of ls(1) and totally fucks me up), then [[=a=]] matches [Aa], for example.

Other locales do even weirder things to LC_COLLATE, see the files under /usr/share/i18n/locales/ on your glibc system for the gory details (sanity-preserving hint: don’t.)

Oh DUH. Ok. XD

@rl_dane @amin @kabel42 @sotolf @thedoctor (also, though capitalised in the header, manpage names are case-sensitive)

Fair point. I had seen people reproduce the header style before, so I wasn't sure if that was canonical.

@rl_dane @amin @kabel42 @sotolf @thedoctor the less(1) manpage is full of lies.

The older one in MirBSD:

     /pattern
           Search forward in the file for the N-th line containing the pat-
           tern. N defaults to 1. The pattern is a regular expression, as
           recognized by ed(1). The search starts at the second line displayed

No, less(1) uses different REs than ed(1), which uses POSIX BRE.

The newer one in Debian:

       /pattern
              Search forward in the file for the N-th line containing the pat‐
              tern.  N defaults to 1.  The pattern is a regular expression, as
              recognized by the regular expression library  supplied  by  your
              system.   The search starts at the first line displayed (but see

Just as big a lie, glibc’s regexp (as documented by Linux man-pages) also does not support \< or \>.

Really! I don't recall \< and \> ever not working for me.

@rl_dane @amin @kabel42 @sotolf @thedoctor then, my dear, you’re suffering from GNU extensions. (Probably. I still haven’t figured out where it came from. No manpage on my Debian I tried documents it, other than grep(1).)

Hmm, I wonder if it would be different on Alpine Linux, as that's a relatively non-GNU distro.

@rl_dane @amin @kabel42 @sotolf @thedoctor nah, busybox is full-on GNU compatible

@rl_dane @amin @kabel42 @sotolf @thedoctor also, please write Alpine Linux, so I don’t confuse it with the MUA. Thanks.

@rl_dane @amin @kabel42 @sotolf @thedoctor just like we write Linux Mint as Linux Mint, to not confuse it with MiNT.

What's "MiNT?"

@rl_dane @amin @kabel42 @sotolf @thedoctor unixoid for Atari that supplants TOS