Sets and ranges [...]

JavaScript Sets and ranges […]

Let’s get deeper into the details of regular expressions. In this chapter, we will show you how to use sets and ranges in JavaScript.

Putting several characters or character classes inside square brackets allows searching for any character among the given.

To be precise, let’s consider an example. Here, [lam] means any of the given three characters ’l’, ‘a’, or ’m’. It is known as a “set”. You can use them with regular characters in a regexp like this:

// find [w], and then "3Docs"
console.log("Welcome to w3cdoc".match(/[w]3docs/gi)); // "w3cdoc"

Although multiple characters exist in the set, they match exactly a single character in the match.

So, there are no matches in the example below:

// find "W", then [3 or D], then "ocs"
console.log("w3cdoc".match(/W[3D]ocs/)); // null, no matches

The pattern looks for W, then one of these letters [3D], and, finally, ocs.

So, here could be a match for W3ocs or WDocs.

Ranges

Ranges

Square brackets can also include the so-called character ranges.

For example, [a-m] is a character in range from “a” to “m”, and [0-7] is a digit from “0” to “7”.

Let’s see an example where “x” is followed by two digits or letters from A toF.

console.log("Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g)); // xAF

So, in the example above,[0-9A-F] includes two ranges: it looks for a character that is either a digit from “0” to “9” or a letter from “A” to “F”.

In case you want to search for lowercase letters, you can either add the a-f range or add the e flag.

Inside […], you can also use character classes.

For example, if you try to search for the character \w or a hyphen -, then the set will be [\w-]. You can also combine different classes such as [\s\d].

Multilanguage \w

As \w is a shorthand for [a-zA-Z0-9_] it’s not capable of finding Cyrillic letters, Chinese hieroglyphs, and so on.

A more universal pattern can be written. It can search for wordy characters in every language. With Unicode properties, it’s quite easy:

[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}].

Let’s interpret it. Like \w, it includes characters with Unicode properties, like here:

let regexp = /[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]/gu;
let str = `Welcome 你好 123`;
// finds all the digits and letters:
console.log(str.match(regexp));

the left half of Ģ(1). the right half of Ģ(2). the left half of Ç(3). the right half of Ç(4).

Their codes can be seen as follows:

for (let i = 0; i < 'ĢÇ'.length; i++) {
  console.log('ĢÇ'.charCodeAt(i));
};

So, the the left half of Ç is found and shown.

Adding the flag u will make it proper:

console.log('Ģ'.match(/[ÇĢ]/));//Ģ

The same thing happens while searching for a range like [Ç-Ģ].

Forgetting to add the u flag will lead to an error, like this:

console.log(' Ģ'.match(/[Ç-Ģ]/));

So, the pattern will look properly with the u flag:

// search for characters from Ç to Ģ
console.log('Ģ'.match(/[Ç-Ģ]/u)); // Ģ

The error happens because without the u flag surrogate pairs are recognized as two characters.



请遵守《互联网环境法规》文明发言,欢迎讨论问题
扫码反馈

扫一扫,反馈当前页面

咨询反馈
扫码关注
返回顶部