Regular Expressions
A regular expression is an object that describes a textual pattern, represented by RegExp
class. You can create regular expressions with RegExp()
constructor or with regular expression literal by enclosing the expression within a pair of slashes /.../
.
let regExpLiteral = /JavaScript/;
let regExpConstructor = new RegExp("ECMAScript");
All alphabetic characters and digits match themselves literally. JavaScript also supports certain non-alphabetic characters through escape sequences that begin with a backslash \
.
Sequence | Match |
---|---|
\0 | NUL character (\u0000) |
\t | Tab (\u0009) |
\n | Newline (\u000A) |
\v | Vertical tab (\u000B) |
\f | Form feed (\u000C) |
\r | Carriage return (\u000D) |
\xnn | Latin character specified by the hexadecimal number nn |
\uxxxx | Unicode character specified by the hexadecimal number xxxx |
\u{n} | Unicode character specified by the codepoint n (works only with u flag) |
The punctuation characters ^ $ . * + ? = ! : | \ / ( ) [ ] { }
have special meanings in regular expressions and if used literally should be escaped with a backslash \
.
Individual characters can be combined into character classes by including them in square brackets. For example, /[abc]/
matches any one of letters a, b, or c. Negated character class is defined with a caret ^
as the first character inside the left bracket, for example, /[^abc]/
matches any one character other than a, b, or c. A hyphen indicates a range of characters, so /[a-zA-Z0-9]/
matches any letter or digit from the Latin alphabet. The most common character classes have their own escape sequences. You can define your own Unicode character classes. For example, /[\u0400-\u04FF]/
matches any one Cyrillic character.
Sequence | Match |
---|---|
[...] | Any one character between the brackets |
[^...] | Any one character not between the brackets |
. | Any character except newline; if s flag is used, then match any character including newline |
\w | Any ASCII word character; same as [a-zA-Z0-9_] |
\W | Any character that is not an ASCII word character; same as [^a-zA-Z0-9_] |
\s | Any Unicode whitespace character |
\S | Any character that is not Unicode whitespace |
\d | Any ASCII digit; same as [0-9] |
\D | Any character other than an ASCII digit; same to [^0-9] |
[\b] | A literal backspace character |
When using the u
flag, then character class \p{...}
and its negation \P{...}
are supported.
Sequence | Match |
---|---|
\p{Decimal_number} | Any decimal digit from any of the world's writing systems |
\P{Decimal_number} | Not a decimal digit in any language |
\p{Number} | Any number-like character, including fractions and Roman numerals |
\p{Alphabetic} | Any alphabet character, such as Latin, Cyrillic, Greek, Chinese, and others |
\p{Script=Greek} | Any character that belongs to the Greek alphabet |
\p{Script=Cyrillic} | Any character that belongs to the Cyrillic alphabet |
Repetition in regular expressions are represented with special characters. These repetition characters are "greedy", meaning they match as many times as possible while still allowing any following parts to match. To make them non-greedy, follow the repetition character with a question mark: ??
, +?
, *?
, or {1,5}?
.
Sequence | Match |
---|---|
{n,m} | Match the previous item at least n times but no more than m times |
{n,} | Match the previous item n or more times |
{n} | Match exactly n occurrences of the previous item |
? | Match zero or one occurrences of the previous item; same as {0,1} |
+ | Match one or more occurrences of the previous item; same as {1,} |
* | Match zero or more occurrences of the previous item; same as {0,} |
/\d{2,4}/; // Match between two and four digits
/\w{3}\d?/; // Match exactly three word characters and an optional digit
/\s+java\s+/; // Match 'java' with one or more spaces before and after
/[^(]*/; // Match zero or more characters that are not open parentheses
The |
character separates alternatives, considering left to right until a match is found.
/ab|cd|ef/; // Matches string 'ab' or 'cd' or 'ef'
/\d{3}|[a-z]{4}/; // Matches either three digits or four lowercase letters
Parentheses group separate items into a sub-expression, so that items within it can be treated as a single unit.
/java(script)?/; // Matches 'java' followed by the optional 'script'
/[a-z]+(\d+)/; // Focus on extracting the digits that follow letters
Parentheses can also extract the portions of the target string that matched any particular sub-pattern in parentheses. You can refer back to a sub-expression in parentheses that matched the text. If the reference to sub-expression is not required, begin the group with (?:
and end it with )
.
/([Jj]ava([Ss]cript)?)\sis\s(fun\w*)/; // ([Ss]cript) if referred as \2
/(['"])[^'"]*\1/; // Match zero or more characters within matched quotes
/([Jj]ava(?:[Ss]cript)?)\sis\s(fun\w*)/; // (?:[Ss]cript) does not produce reference
The named capture groups allow you to reference a group by name rather than by position. To name a group, start it with (?<...>
instead of (
and put the name between the angle brackets.
/(?<quote>['"])[^'"]*\k<quote>/; // \k<quote> is a named backreference
Regular expressions also support position matching, which specify word boundary \b
, start of string ^
, end of string $
, look-ahead assertion (?=...)
, and look-behind assertion (?<=...)
.
Sequence | Match |
---|---|
^ | Match the beginning of the string (or the beginning of line with m flag) |
$ | Match the end of the string (or the end of line with the m flag) |
\b | Match a word boundary |
\B | Match a position that is not a word boundary |
(?=p) | A positive look-ahead assertion: require that the following characters match the pattern p bu do not include those characters in the match |
(?!p) | A negative look-ahead assertion: require that the following characters do not match the pattern p |
(?<=...) | A positive look-behind assertion |
(?<!...) | A negative look-behind assertion |
/^JavaScript$/; // Match the word 'JavaScript' on a line by itself
/\bJava\b/; // Match the word 'Java' standing as a separate word
/[Jj]ava(?=[Ss]cript)/; // Match 'Java' when it followed by 'Script'
// Match 5-digit zip code but only when it follows a two-letter state abbreviations
/(?<= [A-Z]{2} )\d{5}/;
// Match a string of digits that is not preceded by a Unicode currency sumbol
/(?<![\p{Currency_Symbol}\d.])\d+(\.\d+)?/u;
The flags are used with regular expressions to modify its matching behavior. Flags appear after the second /
character of a regular expression literal or as a string passed as the second argument to the RefExp()
constructor.
Flag | Meaning |
---|---|
g | The global flag: find all matches within a string rather than find the first match. |
i | Case-insensitive matching |
m | Multi-line mode: ^ and $ anchors match both the beginning and end of the string as well as the beginning and end of individual lines within the string. |
s | . will match any character, including line terminators. |
u | Unicode flag: match full Unicode codepoints rather than 16-bit values. Without the u flag, . matches 1 UTF-16 16-bit value, but with u flag, it matches one Unicode codepoint, including those that take more than 16 bits. |
y | "Sticky" flag: match at the beginning of a string or the first character following the previous match. |
RegExp methods
The search()
method of a string takes a regular expression as an argument and returns either the character position of the start of the first matching substring or -1 if there is no match. If the argument to search()
method is not a regular expression, it is first converted to one by passing it to the RegExp
constructor. The method does not support the global g
flag.
"JavaScript".search(/script/ui); // => 4
'Python'.search(/script/ui); // => -1
The replace()
method performs search-and-replace operation. It takes a regular expression as its first argument and a replacement string as its second argument. If the regular expression has the global flag g
, it replaces all matches in the string with the replacement string; otherwise, it replaces only the first match it finds. The method returns a new string with all performed replacements. If a positional variable $
followed by a digit appears in the replacement string, replace()
substitutes this variable with specified sub-expression in parentheses. Such substitution also works for named capture groups. You can specify a function as the second parameter. In this case, the function will be invoked after the match has been found. The function's return value will be used as the replacement string.
'JAVAScript'.replace(/javascript/gi, 'JavaScript');
// => 'JavaScript'
'He said "stop"'.replace(/"([^"]*)"/g, '«$1»');
// => 'He said «stop»'
'He said "stop"'.replace(/"(?<quote>[^"]*)"/g, '«$<quote>»');
// => 'He said «stop»'
'15 times 15 is 225'.replace(/\d+/gu, n => parseInt(n).toString(16));
// => 'f times f is e1'
The match()
method takes a regular expression as its only argument (or converts it to one by passing to the RegExp()
constructor) and returns an array that contains the results of the match or null
if no match is found. If the regular expression has the global flag g
, the method returns an array of all matches that appear in the string.
'78 divided by 3 equals 26'.match(/\d+/g);
// => ['78', '3', '26']
Without the global flag g
, the first element of the returned array is the matching string, and any remaining elements are the substrings matching the capturing groups of the regular expression.
let url = /(?<protocol>\w+):\/\/(?<host>[\w.]+)\/(?<path>\S*)/;
let text = 'Learn JS at https://romanakchurin.com/javascript/basics/';
let match = text.match(url);
match[0]; // => 'https://romanakchurin.com/javascript/basics/'
match.input; // => 'Learn JS at https://romanakchurin.com/javascript/basics/'
match.index; // => 12
match.groups.protocol; // => 'https'
match.groups.host; // => 'romanakchurin.com'
match.groups.path; // => 'javascript/basics/'
If the match()
method is used with a regular expression with a sticky flag y
, the match is constrained to the start of the string. This default behavior can be changed by setting the lastIndex
property of the regular expression to the specified index.
let vowel = /[aeiou]/y;
'hello'.match(vowel); // => null: 'hello' does not begin with a vowel
vowel.lastIndex = 1; // Specify a new matching postition
'hello'.match(vowel)[0]; // => 'e': found a vowel at position 1
vowel.lastIndex; // => 2: lastIndex is automatically updated
'hello'.match(vowel); // => null: no vowel at position 2
vowel.lastIndex; // => 0: reset to 0 after failed match
The matchAll()
method expects a regular expression with the global flag g
and returns match objects similar to what match()
returns when used without the global flag g
.
const words = /\b\p{Alphabetic}+\b/gu;
const example = 'matchAll() method test string';
for (let word of example.matchAll(words)) {
console.log(`Found '${word[0]}' at index ${word.index}.`);
}
The split()
method breaks a string into an array of substrings. It can take a regular expression as its argument to specify a general separator. If capturing groups are used with the regular expression, then the text that matches the capturing groups will also be included in the returned array.
'1, 2, 3,\n4, 5'.split(', '); // => ['1', '2', '3,\n4', '5']
'1, 2, 3,\n4, 5'.split(/\s*,\s*/); // => ['1', '2', '3', '4', '5']
const htmlTag = /<([^>]+)>/;
'Test<br />1,2,3'.split(htmlTag); // => ['Test', 'br /', '1,2,3']
The RegExp
class
The RegExp()
constructor expects a string that contains the body of a regular expression as its first argument and any of the regular expression flags as its second optional argument. Note that both strings and regular expressions use backslash \
to escape sequences, so you must replace a backslash \
appearing in the regular expression body with \\
when provided as a string.
let zipcode = new RegExp('\\d{5}', 'g');
zipcode.source; // => '\\d{5}'
zipcode.flags; // => 'g'
zipcode.global; // => true
zipcode.ignoreCase; // => false
zipcode.multiline; // => false
zipcode.dotAll; // => false
zipcode.unicode; // => false
zipcode.sticky; // => false
zipcode.lastIndex; // => 0
The exec()
method takes a single string argument and returns an array just like what match()
returns for non-global searches: the first element corresponds to the matched part of the string, and any subsequent array elements contain the substrings that matched any capturing groups. The index
property contains the character position at which the match occurred. The input
property specifies the string that was searched. The groups
property refers to an object that holds the substrings matching named capturing groups.
let pattern = /Java/g;
let txt = 'JavaScript > Java';
let bingo;
while ((bingo = pattern.exec(txt)) !== null) {
console.log(`Matched ${bingo[0]} at ${bingo.index}`);
console.log(`Next search begins at ${pattern.lastIndex}`);
}
The test()
method takes a single string argument and return true
if the string matches the pattern or false
otherwise. When test()
and exec()
methods are used with global flag g
or sticky flag y
, their behavior depends on the value of the lastIndex
property of the RegExp
object.
let dictionary = ['apple', 'book', 'coffee'];
let doubleLetterWords = [];
let doubleLetter = /(\w)\1/g;
for (let word of dictionary) {
if (doubleLetter.test(word)) {
doubleLetterWords.push(word);
}
}
doubleLetterWords; // => ['apple', 'coffee']: 'book' is missing!