Article 3XMT7 CodeSOD: Look Ahead. Look Out!

CodeSOD: Look Ahead. Look Out!

by
Remy Porter
from The Daily WTF on (#3XMT7)

I'm an old person. It's the sort of thing that happens when you aren't looking. All the kids these days are writing Slack and Discord bots in JavaScript, and I remember writing my first chatbots in Perl and hooking them into IRC. Fortunately, all the WTFs in my Perl chatbots have been lost to time.

"P" has a peer who wants to scrape all the image URLs out of a Discord chat channel. Those URLs will be fetched, then passed through an image processing pipeline to organize and catalog frequently used images, regardless of their origin.

Our intrepid scraper, however, doesn't want to run the risk of trying to request a URL that might be invalid. So they need a way to accurately validate every URL.

Now, the trick to URLs, and URIs in general, is that they have a grammar that seems simple but is deceptively complex and doesn't lend itself to precise validation via regular expressions. If you were a sane person, you'd generally just ballpark it into the neighborhood and handle exceptions, or maybe copy/paste from StackOverflow and call it a day.

This developer spent 7 hours developing their own regular expression to validate a URL. They tested it with every URL they could think of, and it passed with 100% accuracy, which sounds like the kind of robust testing we'd expect from the person who wrote this:

 const regex = /((?:(http|https|Http|Https|rtsp|Rtsp):\/\/(?:(?: [a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,64}(?:\: (?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,25})? \@)?)?((?:(?:[a-zA-Z0-9][a-zA-Z0-9\-]{0,64}\.)+(?:(?:aero|arpa|asia|a [cdefgilmnoqrstuwxz])|(?:biz|b[abdefghijmnorstvwyz])|(?:cat|com|coop|c [acdfghiklmnoruvxyz])|d[ejkmoz]|(?:edu|e[cegrstu])|f[ijkmor]|(?:gov|g [abdefghilmnpqrstuwy])|h[kmnrtu]|(?:info|int|i[delmnoqrst])|(?:jobs|j[emop] )|k[eghimnrwyz]|l[abcikrstuvy]|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz]) |(?:name|net|n[acefgilopruz])|(?:org|om)|(?:pro|p[aefghklmnrstwy])|qa|r [eouw]|s[abcdeghijklmnortuvyz]|(?:tel|travel|t[cdfghjklmnoprtvwz])|u [agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))|(?:(?:25[0-5]|2[0-4][0-9]|[0-1] [0-9]{2}|[1-9][0-9]|[1-9])\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9] |[1-9]|0)\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(?:25 [0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])))(?:\:\d{1,5})?)(\/(?:(?: [a-zA-Z0-9\;\/\?\:\@\&\=\#\~\-\.\+\!\*\'\(\)\,\_])|(?:\%[a-fA-F0-9]{2}))*)? (?:\b|$)/gi

Note each use of (?:). These are "look ahead" matches, which will execute depending on what comes after them. Using one or two of these in a regex makes its massively more complicated. This regex uses 32 look ahead expressions, taking "unreadability" to a new height, and flirting with the lesser demons which serve Zalgo. Bonus points for making the comparison case insensitive, but also checking for both http and Http, just in case.

"There's no module on NPM that can do all of this!" the developer proudly proclaimed. They then presumably uploaded it to NPM as a "microframework", used it within a few of their own modules, and then those modules got used by some other modules, and now 75% of the web depends on this regex.

proget-icon.png [Advertisement] ProGet can centralize your organization's software applications and components to provide uniform access to developers and servers. Check it out! TheDailyWtf?d=yIl2AUoC8zA1BC6HwZwxIc
External Content
Source RSS or Atom Feed
Feed Location http://syndication.thedailywtf.com/TheDailyWtf
Feed Title The Daily WTF
Feed Link http://thedailywtf.com/
Reply 0 comments