Mastodawn

things that are much harder to describe accurately with a regex than you'd think, an incomplete list:

* floating point numbers
* IPv6 addresses
* IPv*4* addresses (depending on how you define them and how picky you are about the numeric ranges)
* ...

Show thread

Eli the Bearded Mar 3

@zwol

Email addresses.
URLs.
Phone numbers.

Show thread

Alyssa Coghlan Mar 3

@elithebearded @zwol

Dates and times (and how)
URLs and URIs (they contain some of the things already mentioned)

Show thread

Alyssa Coghlan Mar 3

@elithebearded @zwol Oops, URLs was a duplicate.

Show thread

SnoopJ Mar 3

@ancoghlan @elithebearded @zwol at least there is official guidance on the latter now, so people have something to move *to* instead of ad-hoc implementations (many of which will be sub-par regexes) https://www.unicode.org/reports/tr58/

UTS #58: Unicode Link Detection and Formatting: URLs and Email Addresses

Show thread

pancomputans Mar 3

@zwol Extra difficulty: recognize Fortran output of floating point number with less than perfect edit descriptor.

Show thread

Zack Weinberg Mar 3

@pancomputans Day job brain is SCREAMING

(day job involves several file formats designed by Fortran programmers in the 1970s and possibly even earlier, y'see)

Show thread

nick Mar 3

@zwol yeah, we used to use IPv4 addresses as an example when teaching regexes, because it seems like it'll be easy enough until you start actually doing it, the point of the lesson being "use regexes for what regexes are good at and then do the rest in some other language."

Show thread

veetee Mar 3

@nickzoic @zwol loopback? that's easy: 0x7f000001

Show thread

Stylus Mar 3

@zwol

curse you for making me type test cases like this into a REPL

>>> socket.gethostbyname("10.010.0x10")
'10.8.0.16'

but heck even IBM is guilty of grossly oversimplifying things https://www.ibm.com/docs/en/ts4500-tape-library?topic=functionality-ipv4-ipv6-address-formats

Though it appears that the exact syntax of numeric IPv4 addresses is "whatever inet_aton does" and has never been well specified in a published rfc? https://datatracker.ietf.org/doc/html/draft-main-ipaddr-text-rep-02

IPv4 and IPv6 address formats

Octets or segments, or a combination of both, make up Internet Protocol version 4 (IPv4) and Internet Protocol version 6 (IPv6) addresses.

Show thread

Zack Weinberg Mar 3

@stylus I wonder why that draft stalled out. The BNFs in there look quite sensible.

Show thread

jwz Mar 3

@zwol Oh no https://jwz.org/b/yjDH

Just gonna leave this regexp here

How to handle emoji: Where other methods are not available, you can use the following regex (for Unicode 11.0 emoji). For clarity, it escapes all characters that can be invisible or are non-spacing -- otherwise you see some odd constructions like ([♀♂])?+ that are really (\\x{200D}[♀♂]\\x{FE0F})?+. ...