Greedy and Lazy Quantifiers
Greedy and Lazy Quantifiers (贪婪和懒惰量词)
There are two operation modes for quantifiers in JavaScript. (JavaScript中的量词有两种操作模式。)
In this chapter, we will see how the search works with greedy and lazy quantifiers. (在本章中,我们将了解搜索如何使用贪婪和懒惰的量词。)
Imagine you have a text and need to replace all the quotes “…” with guillemet marks «…». For instance, “Welcome to w3cdoc”, must become «Welcome to w3cdoc». (假设您有一个文本,需要将所有引号“…”替换为guillemet marks «…»。例如, “欢迎使用w3cdoc”必须变为«欢迎使用w3cdoc»。)
Here, the first step should be locating the quoted strings and then replacing them. At first sight, the regular expression /".+"/g can seem appropriate but it is not. (在这里,第一步应该是找到引用的字符串,然后替换它们。 乍一看,正则表达式/". +"/g似乎是合适的,但事实并非如此。)
Let’s see an example:
let regexp = /".+"/g;
let str = 'a "Javascript" and "Css" books';
console.log(str.match(regexp)); // "Javascript" and "Css"
So, you can see that is doesn’t work properly, in the example above. (因此,您可以看到,在上面的示例中,这是无法正常工作的。)
It finds one “Javascript” and “Css” instead of two matches “Javascript” and “Css”. (它找到一个“Javascript”和“Css” ,而不是两个匹配的“Javascript”和“Css”。)
For the proper work, it is necessary to use the greedy search. (为了正确的工作,有必要使用贪婪的搜索。)
Greedy Search
Greedy Search (贪婪搜索)
A regular expression engine uses the algorithm below for finding a match:
For each position in the string, the pattern is matched at that position. (-对于字符串中的每个位置,图案在该位置匹配。)
In case there is not any match, it is necessary to go to the next position. (-如果没有任何匹配项,则需要转到下一个位置。)
Let’s see how the search will work for the “.+” pattern. (让我们看看搜索将如何用于“. +”模式。)
The initial pattern character is the " quote. The regular expression attempts to detect it at the zero position of the source string a “Javascript” and “Css” books , yet there is a between. So no match will be found at once:
The quote is found. Afterward, the engine attempts to find a match for the rest of the pattern. As in this case, the character is a dot, the string letter will be ‘J’. The process is demonstrated in the picture below:
Because of the quantifier .+, the dot repeats. The regexp engine adds one character after another to the match. It stops when the end of the string is reached:
In this stage, the engine ends repeating .+, trying to find the next character of the pattern. It will be the " quote. But a problem has occurred: the string has added, and no more characters have remained. The regexp engine understands that has got too many .+ and starts backtracking. That is, it cuts the match for the qualifier by a single character, like in the picture below:
So, it assumes that .+ finishes one character before the string end, trying to match the rest of the pattern from that position. In case there was a quote there, the search would finish but ’s’ is the last character. Therefore, the engine reduces the number of the repetitions .+ by another character. It looks like this:
So, the ‘"’ quote doesn’t correspond to ‘k’. The engine continues to backtrack, reducing the count of repetition for ‘.’ till the rest of the pattern matches.
The full match is there. The initial match is “Javascript” and “Css”. In case the regexp has a g flag, then the search will keep on from the place the first match ends.
So, it can be assumed that in the greedy mode, the quantifier replays as much as possible. (因此,可以假设在贪婪模式下,量词尽可能多地重放。)
Now, let’s see what a lazy mode can do. (现在,让我们看看懒惰模式可以做什么。)
Lazy Mode
Lazy Mode (慵懒模式)
This is the opposite of the greedy mode. It considers repeating a minimal number of times. It can be enabled by using a question mark ‘?’ after the quantifier. As a result, it becomes either *? or +? and even ?? for ‘?’. The /".+?"/g regexp works properly, finding “Javascript” and “Css” , like this:
let regexp = /".+?"/g;
let str = 'a "Javascript" and "Css" books';
console.log(str.match(regexp)); // "Javascript" and "Css"
Now, let’s investigate it step-by-step. (现在,让我们逐步调查一下。)
The initial step is finding the pattern start ‘"’ in the third position:
The second step is finding a match for the dot ‘.’, like this:
As we have a lazy mode for +?, the engine doesn’t attempt to match a dot once more but stops and tries to match the rest of the pattern ‘"’:
then the regexp engine raises the number of repetitions for the dot attempting once more:
Again there is a failure. And the number of repetitions raises again and again. So, it happens until the rest of the pattern is found.
The following search starts begins from the end of the current match, yielding another result:
So, the lazy mode is only enabled for the quantifier with ?. (因此,懒惰模式仅对带有?的量词启用。)
The rest of the quantifiers stay greedy. (其余的量词保持贪婪。)
For instance:
console.log("123 456".match(/\d+ \d+?/)); // 123 4
Optimizations
Modern regexp engines are capable of optimizing internal algorithms for faster work. So, it is likely that they will work differently from the algorithm, described above. As a rule, they are used internally for optimizing things. Complex regular expressions are not easy to optimize, so for them, the search can work exactly as described. (现代正则表达式引擎能够优化内部算法以加快工作速度。 因此,它们的工作方式很可能与上述算法不同。 通常,它们在内部用于优化事物。 复杂的正则表达式不容易优化,因此对于它们来说,搜索可以完全按照描述工作。)
An Alternative Approach
An Alternative Approach (替代方法)
Regular expressions usually provide several ways to do the same thing. In the example below, quoted strings are found without using the lazy mode, like this:
let regexp = /"[^"]+"/g;
let str = 'a "Javascript" and "Css" books';
console.log(str.match(regexp)); // "Javascript" and "Css"
The regexp “[^”]+" leads to correct results as it searches for a quote ‘"’ followed by a single or more non-quotes [^"], and, finally, the closing quote. Once the regexp engine searches for [^"]+, it stops the repetitions when it comes across the closing quote. (正则表达式“[^”] + ”在搜索引号“ ”时会导致正确的结果,后跟单个或多个非引号[^”] ,最后是右引号。 一旦正则表达式引擎搜索[^ “] + ,它就会在遇到右引号时停止重复。)
Please, take into account that this logic will not replace lazy quantifiers. In another example, the lazy quantifiers fail and this option works properly. For example, if you intend to find links of the form <a href=”…" class=“doc”> along with any href.
Let’s consider an example of using /<a href=".*" class=“doc”>/g:
let str = '...<a href="link" class="className">...';
let regexp = /<a href=".*" class="className">/g;
alert(str.match(regexp));
This example works. But let’s see what will happen if lots of links exist in the text:
let str = '...<a href="link1" class="className">... <a href="link2" class="className">...';
let regexp = /<a href=".*" class="className">/g;
// There're two links in a single match
alert(str.match(regexp)); // <a href="link1" class="className">... <a href="link2" class="className">
So, in this case, the quantifier .* again received too many characters. The match will be as follows:
<a href="....................................." class="className">
<a href="link1" class="className">... <a href="link2" class="className">
Let’s see what will happen if we try to make the quantifier .*? lazy:
let str = '...<a href="link1" class="className">... <a href="link2" class="className">...';
let regexp = /<a href=".*?" class="className">/g;
alert(str.match(regexp)); // <a href="link1" class="className">, <a href="link2" class="className">
Now, two matches appeared:
<a href="....." class="className"> <a href="....." class="className">
<a href="link1" class="className">... <a href="link2" class="className">
Summary
Summary (概要)
Two modes of work exist for quantifiers: greedy and lazy. In the greedy mode, the regular expression engine attempts to repeat the quantifier as many times as possible. For example, \d+ will consume all the possible digits. When consuming more is impossible, it keeps on matching the rest of the pattern. If no match is found, it decreases the number of repetitions. The lazy mode is activated by the ? mark after the quantifier. The regular expression engine attempts to match the rest of the pattern before every repetition.