Catastrophic Backtracking
Catastrophic Backtracking (灾难性回溯)
At first sight, some regular expressions seem simple but may take too much time to execute. That can even make the engine of JavaScript to “hang”. In such cases, the browser suggests killing the script and reloading the page. But, it’s not a good idea. for JavaScript engine it can become a fragility. (乍一看,一些正则表达式看起来很简单,但执行起来可能花费太多时间。 这甚至可以使JavaScript引擎“挂起”。 在这种情况下,浏览器建议删除脚本并重新加载页面。 但是,这不是一个好主意。对于JavaScript引擎来说,它可能会变得脆弱。)
In this chapter, we will show you the ways to overcome such situations. (在本章中,我们将向您展示克服这些情况的方法。)
An Example of a Regexp
An Example of a Regexp (正则表达式示例)
For a better understanding of the issue, let’s start at examples. (为了更好地了解问题,让我们从示例开始。)
Imagine having a string and intending to check whether it involves words with an optional \s? space after each of them. You can use the ^(\w+\s?)*$ regexp, specifying 0 or more words, like this:
let regexp = /^(\w+\s?)*$/;
console.log(regexp.test("Welcome to w3cdoc")); // true
console.log(regexp.test("Bad characters: $@#")); // false
This code works, and the result is correct. But, it takes too much time on certain things. (此代码有效,结果正确。但是,在某些事情上花费太多时间。)
Running the code below, there won’t be anything, as the JavaScript engine will “hang”. After a while, it will suggest reloading the page, so you should be careful:
let regexp = /^(\w+\s?)*$/;
let str = "The input string, that takes a lot of time or to hang when makes this regex";
// taking a lot of time
console.log(regexp.test(str));
Not all regular expression engines can handle it. (并非所有正则表达式引擎都可以处理它。)
Words and Strings
Words and Strings (单词和字符串)
So, the reason of the hanging is that a word may be represented as one \w+ or multiple:
(input) (输入)
(inpu)(t) ((inpu) (t))
(inp)(u)(t) ((inp) (u) (t))
(in)(p)(ut) ((in) (p) (ut))
…
As the string ends with an exclamation sign !, it’s obvious that there is no match. But, at the end, the regular expression considers a \w character or a space \s. But the engine is not aware of that. (由于字符串以感叹号结尾! ,很明显没有匹配项。但是,在最后,正则表达式考虑一个\ w字符或一个空格。但发动机并不知道这一点。)
As there are too many combinations to search for, it takes too much time. (由于需要搜索的组合太多,因此需要花费太多时间。)
Solutions
Solutions (服务 )
There exist two primary solutions to the problem. (这个问题有两个主要解决方案。)
The first solution is to lower the number of possible combinations. For that purpose, you can just rewrite the regexp, using its equivalent, like in the example below:
let regexp = /^(\w+\s)*\w*$/;
let str = "The input string, that takes a lot of time or to hang when makes this regex";
console.log(regexp.test(str)); // false
As you can see in the example above, the * goes after \w+\s instead of \w+\s?. (正如您在上面的示例中看到的, *位于\ w +\ s之后,而不是\ w +\ s?。)
Representing one word of the string with multiple successive \w+ becomes impossible. (用多个连续的\ w +表示字符串中的一个单词变得不可能。)
The (\w+\s?)* pattern can match the word string as two \w+, like this:
\w+\w+ (% w/w)
string (字符串)
In this case, there may be \w+\s or \w+\s\w+\s, and not \w+\w+. (在这种情况下,可能有\ w +\ s或\ w +\ s\ w +\ s ,而不是\ w +\ w +。)
Preventing Backtracking
Preventing Backtracking (防止回溯)
The second solution is forbidding to backtrack for the quantifier. (第二种解决方案是禁止回溯量词。)
The engine of the regexp tries different combinations that can be wrong for a human. (正则表达式的引擎尝试不同的组合,这对人类来说可能是错误的。)
For instance, it’s obvious that in the (\d+)*$ regexp + couldn’t backtrack. (例如,很明显,在(\ d +) * $ regexp +中无法回溯。)
Replacing one \d+ with two separate \d+\d+ will not change anything:
\d+…….. (D)
(123456789)! (123456789)
\d+…\d+…. (%d天)
(1234)(56789)! ((1234) (56789) !)
To prevent the problems, modern engines use greedy quantifiers that don’t backtrack. (为了防止这些问题,现代引擎使用了不回溯的贪婪量词。)
Also, there are so-called “atomic capturing groups” used to disable backtracking in parentheses. However, they are not supported by JavaScript. But, another way is efficiently used by JavaScript. (此外,还有所谓的“原子捕获组” ,用于禁用括号中的回溯。但是, JavaScript不支持它们。但是, JavaScript有效地使用了另一种方式。)
Lookahead
Lookahead is used for preventing backtracking. (Lookahead用于防止回溯。)
Let’s interpret how it works:
Lookahead ?= searches for the longest word \w+, beginning at the current position. The parentheses’ contents with ?=… is not remembered by the engine. So, wrapping into parentheses will make the engine memorize the contents. It is referenced in the pattern as \1 by all of us. (Lookahead? =从当前位置开始搜索最长的单词\ w +。 引擎不会记住括号中带有? =…的内容。 因此,包裹在括号中将使引擎记住内容。 在模式中,我们所有人都将其称为\ 1。)
So, when there is a word \w+, you should match it as \1. (因此,当有一个单词\ w +时,您应该将其匹配为\ 1。)
The reason is that a lookahead detects a word \w+ as a whole and you capture it in the pattern using \1. Essentially, a possessive plus + quantifier is implemented, capturing the whole word and not a part of it. (原因是前瞻检测到一个单词\ w +作为一个整体,而您使用\ 1在模式中捕获它。本质上,实现了占有式加号+量词,捕获整个单词,而不是其中的一部分。)
For example, in the word JavaScript, it can not only match, leaving out Java to correspond to the rest part of the pattern. (例如,在单词JavaScript中,它不仅可以匹配,还可以省略Java以对应模式的其余部分。)
Let’s see the following example:
console.log("JavaScript".match(/\w+Script/)); // JavaScript
console.log("JavaScript".match(/(?=(\w+))\1Script/)); // null
The first option of the example above, the \w+ captures the word JavaScript totally, then + backtracks it character by character, trying to match the rest of the pattern, till \w+ matches Java. In the second option, (?=(\w+)) searches for and detects JavaScript, which is included in the pattern as a whole by \1. So, no way remains for finding Script after it. (上面示例的第一个选项,\ w +完全捕获单词JavaScript ,然后+逐个字符回溯,尝试匹配模式的其余部分,直到\ w +匹配Java。在第二个选项中, (? = (\ w +))搜索并检测JavaScript ,\ 1将其作为一个整体包含在模式中。因此,没有办法在它之后找到Script。)
If you put a more complicated regexp to (?=(\w+))\1 rather than \w, you should prevent backtracking for + after that. (如果将更复杂的正则表达式置于(? = (\ w +))\ 1而不是\ w ,则在此之后应避免回溯+。)
In the example below, lookahead is used for preventing backtracking:
let regexp = /^((?=(\w+))\2\s?)*$/;
console.log(regexp.test("Welcome to w3cdoc")); // true
let str = "The input string, that takes a lot of time or to hang when makes this regex";
console.log(regexp.test(str)); // false, works and fast!
In the example above, instead of \1 is used \2 as additional outer parentheses are there. To prevent a mess with numbers, you can name the parentheses, like here:
// parentheses are named ?<word>, referenced as \k<word>
let regexp = /^((?=(?<word>\w+))\k<word>\s?)*$/;
let str = "The input string, that takes a lot of time or to hang when makes this regex";
console.log(regexp.test(str)); // false
console.log(regexp.test("A correct string")); // true
Summary
Summary (概要)
In this chapter, we covered a common problem in JavaScript, known as “catastrophic backtracking”. (在本章中,我们介绍了JavaScript中的一个常见问题,即“灾难性回溯”。)
Here we overviewed the two main solutions to that problem: lowering the possible combinations count and preventing backtracking.
The most common and efficient way to prevent backtracking is using lookahead. (防止回溯的最常见和最有效的方法是使用Lookahead。)