castarco,
@castarco@hachyderm.io avatar

#TIL Today I learnt that adding ? after * transforms a #regex expression from being "greedy" into "lazy" (important for performance, safe validators, and protection against DoS attacks).

I don't know how I missed this bit of knowledge for so long. :blobfoxbox:

barubary,

@castarco I don't see how it protects against DoS attacks. The DoS problem is caused by badly written regexes that take "forever" to fail, so the regex engine is busy trying all possible variations to make the regex match a given string (assuming a standard backtracking implementation).

Making quantifiers non-greedy (by appending ?) does not change the number of ways a regex can match, nor does it shortcut the matching process. It simply changes the order in which alternatives are tried.

In the case of X*, the normal order is to try to match as many X's as possible and only "give back" matches if the rest of the regex fails to match. With X*?, the regex engine will try to match as few repetitions as possible (i.e. 0 at first) and only consume more if the rest of the regex fails to match. In either case all possibilities are tried before giving up.

castarco,
@castarco@hachyderm.io avatar

@barubary It's not a panacea at all. True. And of course it cannot be applied always, there are semantics involved.

But in many circumstances it can help to reduce the chances of catastrophic backtracking because it forces backtracking much sooner (the exploration tree is much smaller as a consequence).

barubary,

@castarco Can you show me an example of a regex where non-greedy matching reduces the number of alternatives tried?

castarco,
@castarco@hachyderm.io avatar

@barubary

Sure. What follows is a dumb example ( executed in https://regex101.com/ ), but illustrates my point.

In this particular case you could say that ? is semantically required for <script> because we could have more than one, but many times we don't have this distinction and it still affects how many steps the #regex has to perform.

(Sorry for having the text selected in the 2nd image, I was copying it for the alt of the images 😅 )

[Result: 1 match, 75 steps, 0.0ms Regexp (with the ? symbol): /([sS]*?)</script>/gi

Text:

<main> Hello World <script>console.log("hello!"); More stuff Just a decoy!](https://media.hachyderm.io/media_attachments/files/111/914/833/409/432/020/original/3925f50f868f8a82.png)
barubary,

@castarco Oh, that doesn't count. The regex finds a match, so there is no catastrophic backtracking.

(As for semantic correctness, both regexes are wrong.)

barubary,

@castarco PS: You can easily get the opposite result by changing the test string: https://regex101.com/r/pRY7Gw/1

(3720 steps for <script>.*?</script>, 30 steps for <script>.*</script>)

  • All
  • Subscribed
  • Moderated
  • Favorites
  • til
  • kavyap
  • ngwrru68w68
  • osvaldo12
  • DreamBathrooms
  • mdbf
  • magazineikmin
  • thenastyranch
  • Youngstown
  • khanakhh
  • everett
  • slotface
  • tacticalgear
  • rosin
  • normalnudes
  • megavids
  • Leos
  • GTA5RPClips
  • ethstaker
  • InstantRegret
  • cubers
  • modclub
  • Durango
  • provamag3
  • cisconetworking
  • tester
  • anitta
  • JUstTest
  • lostlight
  • All magazines