Sets and ranges [...]
On this page
JavaScript Sets and ranges […]
Let’s get deeper into the details of regular expressions. In this chapter, we will show you how to use sets and ranges in JavaScript.
Putting several characters or character classes inside square brackets allows searching for any character among the given.
To be precise, let’s consider an example. Here, [lam] means any of the given three characters ’l’, ‘a’, or ’m’. It is known as a “set”. You can use them with regular characters in a regexp like this:
// find [w], and then "3Docs"
console.log("Welcome to w3cdoc".match(/[w]3docs/gi)); // "w3cdoc"
Although multiple characters exist in the set, they match exactly a single character in the match.
So, there are no matches in the example below:
// find "W", then [3 or D], then "ocs"
console.log("w3cdoc".match(/W[3D]ocs/)); // null, no matches
The pattern looks for W, then one of these letters [3D], and, finally, ocs.
So, here could be a match for W3ocs or WDocs.
Ranges
Ranges
Square brackets can also include the so-called character ranges.
For example, [a-m] is a character in range from “a” to “m”, and [0-7] is a digit from “0” to “7”.
Let’s see an example where “x” is followed by two digits or letters from A toF.
console.log("Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g)); // xAF
So, in the example above,[0-9A-F] includes two ranges: it looks for a character that is either a digit from “0” to “9” or a letter from “A” to “F”.
In case you want to search for lowercase letters, you can either add the a-f range or add the e flag.
Inside […], you can also use character classes.
For example, if you try to search for the character \w or a hyphen -, then the set will be [\w-]. You can also combine different classes such as [\s\d].
Multilanguage \w
As \w is a shorthand for [a-zA-Z0-9_] it’s not capable of finding Cyrillic letters, Chinese hieroglyphs, and so on.
A more universal pattern can be written. It can search for wordy characters in every language. With Unicode properties, it’s quite easy:
[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}].
Let’s interpret it. Like \w, it includes characters with Unicode properties, like here:
let regexp = /[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]/gu;
let str = `Welcome 你好 123`;
// finds all the digits and letters:
console.log(str.match(regexp));
the left half of Ģ(1). the right half of Ģ(2). the left half of Ç(3). the right half of Ç(4).
Their codes can be seen as follows:
for (let i = 0; i < 'ĢÇ'.length; i++) {
console.log('ĢÇ'.charCodeAt(i));
};
So, the the left half of Ç is found and shown.
Adding the flag u will make it proper:
console.log('Ģ'.match(/[ÇĢ]/));//Ģ
The same thing happens while searching for a range like [Ç-Ģ].
Forgetting to add the u flag will lead to an error, like this:
console.log(' Ģ'.match(/[Ç-Ģ]/));
So, the pattern will look properly with the u flag:
// search for characters from Ç to Ģ
console.log('Ģ'.match(/[Ç-Ģ]/u)); // Ģ
The error happens because without the u flag surrogate pairs are recognized as two characters.