~/src/www.mokhan.ca/xlgmokha [main]
cat regular-expressions.md
regular-expressions.md 77332 bytes | 2007-06-12 00:00
symlink: /dev/random/regular-expressions.md

Regular Expressions: Pattern Matching Guide

Regular expressions (regex) are powerful tools for finding patterns in strings. They’re incredibly useful for validation, ensuring data is in a particular format, and text processing. Compilers use regular expressions to validate program syntax, and web developers use them for everything from email validation to URL parsing.

What Are Regular Expressions?

At their core, regular expressions are a mini-language for describing text patterns. Instead of looking for exact matches, you describe the pattern you want to find, and the regex engine finds all strings that match that pattern.

Character Classes

Character classes let you match specific types of characters:

Predefined Character Classes

  • \d - Matches any digit (0-9)
  • \D - Matches any non-digit
  • \w - Matches any word character (letters, digits, underscore)
  • \W - Matches any non-word character
  • \s - Matches any whitespace (spaces, tabs, newlines)
  • \S - Matches any non-whitespace

Custom Character Classes

  • [aeiou] - Matches any vowel
  • [0-9] - Matches any digit (same as \d)
  • [a-z] - Matches any lowercase letter
  • [A-Z] - Matches any uppercase letter
  • [0-35-9] - Matches digits 0-3 or 5-9 (excludes 4)

Negated Character Classes

  • [^4] - Matches any character except 4
  • [^aeiou] - Matches any consonant
  • [^0-9] - Matches any non-digit (same as \D)

Quantifiers

Quantifiers specify how many times a pattern should match:

  • * - Matches zero or more occurrences
  • + - Matches one or more occurrences
  • ? - Matches zero or one occurrence (optional)
  • {n} - Matches exactly n occurrences
  • {n,} - Matches at least n occurrences
  • {n,m} - Matches between n and m occurrences (inclusive)

Quantifier Examples

  • A* - Matches “”, “A”, “AA”, “AAA”, etc.
  • A+ - Matches “A”, “AA”, “AAA”, etc. (but not empty string)
  • A? - Matches “” or “A”
  • A{3} - Matches exactly “AAA”
  • A{2,4} - Matches “AA”, “AAA”, or “AAAA”

Special Characters

The Dot (.)

  • . - Matches any single character except newline
  • .* - Matches any number of characters (except newlines)
  • .+ - Matches one or more of any character

Anchors

  • ^ - Matches the beginning of a string
  • $ - Matches the end of a string
  • ^A - String must start with “A”
  • Z$ - String must end with “Z”
  • ^A.*Z$ - String starts with “A” and ends with “Z”

Practical Examples

Email Validation (Basic)

\w+@\w+\.\w+

Matches: user@domain.com

Phone Number (US Format)

\d{3}-\d{3}-\d{4}

Matches: 555-123-4567

Postal Code (Canadian)

[A-Z]\d[A-Z] \d[A-Z]\d

Matches: T2P 1J9, V6B 2W9

Postal Code (Canadian, Flexible)

[A-Z]\d[A-Z]\s?\d[A-Z]\d

Matches: T2P1J9 or T2P 1J9

Finding Words

\b\w+\b

Matches individual words (using word boundaries)

IP Address (Simple)

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Matches: 192.168.1.1

Hexadecimal Colors

#[0-9A-Fa-f]{6}

Matches: #FF5733, #a1b2c3

Date Format (MM/DD/YYYY)

\d{2}/\d{2}/\d{4}

Matches: 12/25/2007

Username Validation

^[a-zA-Z0-9_]{3,16}$

Matches usernames 3-16 characters, letters/numbers/underscore only

Tips for Writing Better Regex

Start Simple

Begin with basic patterns and build complexity gradually:

  1. \d (any digit)
  2. \d+ (one or more digits)
  3. \d{3} (exactly three digits)
  4. \d{3}-\d{3}-\d{4} (phone number pattern)

Test Your Patterns

Always test regex patterns with various inputs, including:

  • Valid examples that should match
  • Invalid examples that shouldn’t match
  • Edge cases (empty strings, very long strings)

Be Specific

  • Use \d instead of [0-9] for digits
  • Use ^ and `# Regular Expressions: Pattern Matching Guide

Regular expressions (regex) are powerful tools for finding patterns in strings. They’re incredibly useful for validation, ensuring data is in a particular format, and text processing. Compilers use regular expressions to validate program syntax, and web developers use them for everything from email validation to URL parsing.

What Are Regular Expressions?

At their core, regular expressions are a mini-language for describing text patterns. Instead of looking for exact matches, you describe the pattern you want to find, and the regex engine finds all strings that match that pattern.

Character Classes

Character classes let you match specific types of characters:

Predefined Character Classes

  • \d - Matches any digit (0-9)
  • \D - Matches any non-digit
  • \w - Matches any word character (letters, digits, underscore)
  • \W - Matches any non-word character
  • \s - Matches any whitespace (spaces, tabs, newlines)
  • \S - Matches any non-whitespace

Custom Character Classes

  • [aeiou] - Matches any vowel
  • [0-9] - Matches any digit (same as \d)
  • [a-z] - Matches any lowercase letter
  • [A-Z] - Matches any uppercase letter
  • [0-35-9] - Matches digits 0-3 or 5-9 (excludes 4)

Negated Character Classes

  • [^4] - Matches any character except 4
  • [^aeiou] - Matches any consonant
  • [^0-9] - Matches any non-digit (same as \D)

Quantifiers

Quantifiers specify how many times a pattern should match:

  • * - Matches zero or more occurrences
  • + - Matches one or more occurrences
  • ? - Matches zero or one occurrence (optional)
  • {n} - Matches exactly n occurrences
  • {n,} - Matches at least n occurrences
  • {n,m} - Matches between n and m occurrences (inclusive)

Quantifier Examples

  • A* - Matches “”, “A”, “AA”, “AAA”, etc.
  • A+ - Matches “A”, “AA”, “AAA”, etc. (but not empty string)
  • A? - Matches “” or “A”
  • A{3} - Matches exactly “AAA”
  • A{2,4} - Matches “AA”, “AAA”, or “AAAA”

Special Characters

The Dot (.)

  • . - Matches any single character except newline
  • .* - Matches any number of characters (except newlines)
  • .+ - Matches one or more of any character

Anchors

  • ^ - Matches the beginning of a string
  • $ - Matches the end of a string
  • ^A - String must start with “A”
  • Z$ - String must end with “Z”
  • ^A.*Z$ - String starts with “A” and ends with “Z”

Practical Examples

Email Validation (Basic)

\w+@\w+\.\w+

Matches: user@domain.com

Phone Number (US Format)

\d{3}-\d{3}-\d{4}

Matches: 555-123-4567

Postal Code (US ZIP)

\d{5}(-\d{4})?

Matches: 12345 or 12345-6789

Finding Words

\b\w+\b

Matches individual words (using word boundaries)

IP Address (Simple)

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Matches: 192.168.1.1

Hexadecimal Colors

#[0-9A-Fa-f]{6}

Matches: #FF5733, #a1b2c3

Date Format (MM/DD/YYYY)

\d{2}/\d{2}/\d{4}

Matches: 12/25/2007

Username Validation

^[a-zA-Z0-9_]{3,16}$

Matches usernames 3-16 characters, letters/numbers/underscore only

anchors to match entire strings

  • Consider word boundaries \b when matching whole words

Common Pattern Building Blocks

  • Start of string: ^pattern
  • End of string: `pattern# Regular Expressions: Pattern Matching Guide

Regular expressions (regex) are powerful tools for finding patterns in strings. They’re incredibly useful for validation, ensuring data is in a particular format, and text processing. Compilers use regular expressions to validate program syntax, and web developers use them for everything from email validation to URL parsing.

What Are Regular Expressions?

At their core, regular expressions are a mini-language for describing text patterns. Instead of looking for exact matches, you describe the pattern you want to find, and the regex engine finds all strings that match that pattern.

Character Classes

Character classes let you match specific types of characters:

Predefined Character Classes

  • \d - Matches any digit (0-9)
  • \D - Matches any non-digit
  • \w - Matches any word character (letters, digits, underscore)
  • \W - Matches any non-word character
  • \s - Matches any whitespace (spaces, tabs, newlines)
  • \S - Matches any non-whitespace

Custom Character Classes

  • [aeiou] - Matches any vowel
  • [0-9] - Matches any digit (same as \d)
  • [a-z] - Matches any lowercase letter
  • [A-Z] - Matches any uppercase letter
  • [0-35-9] - Matches digits 0-3 or 5-9 (excludes 4)

Negated Character Classes

  • [^4] - Matches any character except 4
  • [^aeiou] - Matches any consonant
  • [^0-9] - Matches any non-digit (same as \D)

Quantifiers

Quantifiers specify how many times a pattern should match:

  • * - Matches zero or more occurrences
  • + - Matches one or more occurrences
  • ? - Matches zero or one occurrence (optional)
  • {n} - Matches exactly n occurrences
  • {n,} - Matches at least n occurrences
  • {n,m} - Matches between n and m occurrences (inclusive)

Quantifier Examples

  • A* - Matches “”, “A”, “AA”, “AAA”, etc.
  • A+ - Matches “A”, “AA”, “AAA”, etc. (but not empty string)
  • A? - Matches “” or “A”
  • A{3} - Matches exactly “AAA”
  • A{2,4} - Matches “AA”, “AAA”, or “AAAA”

Special Characters

The Dot (.)

  • . - Matches any single character except newline
  • .* - Matches any number of characters (except newlines)
  • .+ - Matches one or more of any character

Anchors

  • ^ - Matches the beginning of a string
  • $ - Matches the end of a string
  • ^A - String must start with “A”
  • Z$ - String must end with “Z”
  • ^A.*Z$ - String starts with “A” and ends with “Z”

Practical Examples

Email Validation (Basic)

\w+@\w+\.\w+

Matches: user@domain.com

Phone Number (US Format)

\d{3}-\d{3}-\d{4}

Matches: 555-123-4567

Postal Code (US ZIP)

\d{5}(-\d{4})?

Matches: 12345 or 12345-6789

Finding Words

\b\w+\b

Matches individual words (using word boundaries)

IP Address (Simple)

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Matches: 192.168.1.1

Hexadecimal Colors

#[0-9A-Fa-f]{6}

Matches: #FF5733, #a1b2c3

Date Format (MM/DD/YYYY)

\d{2}/\d{2}/\d{4}

Matches: 12/25/2007

Username Validation

^[a-zA-Z0-9_]{3,16}$

Matches usernames 3-16 characters, letters/numbers/underscore only

  • Entire string: `^pattern# Regular Expressions: Pattern Matching Guide

Regular expressions (regex) are powerful tools for finding patterns in strings. They’re incredibly useful for validation, ensuring data is in a particular format, and text processing. Compilers use regular expressions to validate program syntax, and web developers use them for everything from email validation to URL parsing.

What Are Regular Expressions?

At their core, regular expressions are a mini-language for describing text patterns. Instead of looking for exact matches, you describe the pattern you want to find, and the regex engine finds all strings that match that pattern.

Character Classes

Character classes let you match specific types of characters:

Predefined Character Classes

  • \d - Matches any digit (0-9)
  • \D - Matches any non-digit
  • \w - Matches any word character (letters, digits, underscore)
  • \W - Matches any non-word character
  • \s - Matches any whitespace (spaces, tabs, newlines)
  • \S - Matches any non-whitespace

Custom Character Classes

  • [aeiou] - Matches any vowel
  • [0-9] - Matches any digit (same as \d)
  • [a-z] - Matches any lowercase letter
  • [A-Z] - Matches any uppercase letter
  • [0-35-9] - Matches digits 0-3 or 5-9 (excludes 4)

Negated Character Classes

  • [^4] - Matches any character except 4
  • [^aeiou] - Matches any consonant
  • [^0-9] - Matches any non-digit (same as \D)

Quantifiers

Quantifiers specify how many times a pattern should match:

  • * - Matches zero or more occurrences
  • + - Matches one or more occurrences
  • ? - Matches zero or one occurrence (optional)
  • {n} - Matches exactly n occurrences
  • {n,} - Matches at least n occurrences
  • {n,m} - Matches between n and m occurrences (inclusive)

Quantifier Examples

  • A* - Matches “”, “A”, “AA”, “AAA”, etc.
  • A+ - Matches “A”, “AA”, “AAA”, etc. (but not empty string)
  • A? - Matches “” or “A”
  • A{3} - Matches exactly “AAA”
  • A{2,4} - Matches “AA”, “AAA”, or “AAAA”

Special Characters

The Dot (.)

  • . - Matches any single character except newline
  • .* - Matches any number of characters (except newlines)
  • .+ - Matches one or more of any character

Anchors

  • ^ - Matches the beginning of a string
  • $ - Matches the end of a string
  • ^A - String must start with “A”
  • Z$ - String must end with “Z”
  • ^A.*Z$ - String starts with “A” and ends with “Z”

Practical Examples

Email Validation (Basic)

\w+@\w+\.\w+

Matches: user@domain.com

Phone Number (US Format)

\d{3}-\d{3}-\d{4}

Matches: 555-123-4567

Postal Code (US ZIP)

\d{5}(-\d{4})?

Matches: 12345 or 12345-6789

Finding Words

\b\w+\b

Matches individual words (using word boundaries)

IP Address (Simple)

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Matches: 192.168.1.1

Hexadecimal Colors

#[0-9A-Fa-f]{6}

Matches: #FF5733, #a1b2c3

Date Format (MM/DD/YYYY)

\d{2}/\d{2}/\d{4}

Matches: 12/25/2007

Username Validation

^[a-zA-Z0-9_]{3,16}$

Matches usernames 3-16 characters, letters/numbers/underscore only

  • Optional group: (pattern)?
  • Either/or: (pattern1|pattern2)

Common Pitfalls

Greedy vs. Non-Greedy

  • .* is greedy (matches as much as possible)
  • .*? is non-greedy (matches as little as possible)

Escaping Special Characters

To match literal special characters, escape them:

  • \. - Matches literal period
  • \* - Matches literal asterisk
  • \? - Matches literal question mark

Case Sensitivity

Most regex engines are case-sensitive by default:

  • [a-z] - Only lowercase letters
  • [A-Za-z] - Both upper and lowercase
  • Use case-insensitive flags when available

Security Considerations

ReDoS (Regular Expression Denial of Service)

Certain regex patterns can cause exponential backtracking, leading to performance issues or denial of service attacks.

Vulnerable Patterns

These patterns can be exploited with malicious input:

(a+)+
(a|a)*
(a|b)*a
^(a+)+$

How ReDoS Works

When given input like aaaaaaaaaaaaaaaaaaaaX, the regex engine tries many combinations before failing, consuming excessive CPU time.

Safe Alternatives

  • Vulnerable: (a+)+
  • Safe: a+

  • Vulnerable: (a|b)*a
  • Safe: [ab]*a

Input Validation Bypass

Regex for validation can sometimes be bypassed with unexpected input.

Common Bypass Techniques

  • Newline injection: Many regex engines treat ^ and `# Regular Expressions: Pattern Matching Guide

Regular expressions (regex) are powerful tools for finding patterns in strings. They’re incredibly useful for validation, ensuring data is in a particular format, and text processing. Compilers use regular expressions to validate program syntax, and web developers use them for everything from email validation to URL parsing.

What Are Regular Expressions?

At their core, regular expressions are a mini-language for describing text patterns. Instead of looking for exact matches, you describe the pattern you want to find, and the regex engine finds all strings that match that pattern.

Character Classes

Character classes let you match specific types of characters:

Predefined Character Classes

  • \d - Matches any digit (0-9)
  • \D - Matches any non-digit
  • \w - Matches any word character (letters, digits, underscore)
  • \W - Matches any non-word character
  • \s - Matches any whitespace (spaces, tabs, newlines)
  • \S - Matches any non-whitespace

Custom Character Classes

  • [aeiou] - Matches any vowel
  • [0-9] - Matches any digit (same as \d)
  • [a-z] - Matches any lowercase letter
  • [A-Z] - Matches any uppercase letter
  • [0-35-9] - Matches digits 0-3 or 5-9 (excludes 4)

Negated Character Classes

  • [^4] - Matches any character except 4
  • [^aeiou] - Matches any consonant
  • [^0-9] - Matches any non-digit (same as \D)

Quantifiers

Quantifiers specify how many times a pattern should match:

  • * - Matches zero or more occurrences
  • + - Matches one or more occurrences
  • ? - Matches zero or one occurrence (optional)
  • {n} - Matches exactly n occurrences
  • {n,} - Matches at least n occurrences
  • {n,m} - Matches between n and m occurrences (inclusive)

Quantifier Examples

  • A* - Matches “”, “A”, “AA”, “AAA”, etc.
  • A+ - Matches “A”, “AA”, “AAA”, etc. (but not empty string)
  • A? - Matches “” or “A”
  • A{3} - Matches exactly “AAA”
  • A{2,4} - Matches “AA”, “AAA”, or “AAAA”

Special Characters

The Dot (.)

  • . - Matches any single character except newline
  • .* - Matches any number of characters (except newlines)
  • .+ - Matches one or more of any character

Anchors

  • ^ - Matches the beginning of a string
  • $ - Matches the end of a string
  • ^A - String must start with “A”
  • Z$ - String must end with “Z”
  • ^A.*Z$ - String starts with “A” and ends with “Z”

Practical Examples

Email Validation (Basic)

\w+@\w+\.\w+

Matches: user@domain.com

Phone Number (US Format)

\d{3}-\d{3}-\d{4}

Matches: 555-123-4567

Postal Code (Canadian)

[A-Z]\d[A-Z] \d[A-Z]\d

Matches: T2P 1J9, V6B 2W9

Postal Code (Canadian, Flexible)

[A-Z]\d[A-Z]\s?\d[A-Z]\d

Matches: T2P1J9 or T2P 1J9

Finding Words

\b\w+\b

Matches individual words (using word boundaries)

IP Address (Simple)

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Matches: 192.168.1.1

Hexadecimal Colors

#[0-9A-Fa-f]{6}

Matches: #FF5733, #a1b2c3

Date Format (MM/DD/YYYY)

\d{2}/\d{2}/\d{4}

Matches: 12/25/2007

Username Validation

^[a-zA-Z0-9_]{3,16}$

Matches usernames 3-16 characters, letters/numbers/underscore only

Tips for Writing Better Regex

Start Simple

Begin with basic patterns and build complexity gradually:

  1. \d (any digit)
  2. \d+ (one or more digits)
  3. \d{3} (exactly three digits)
  4. \d{3}-\d{3}-\d{4} (phone number pattern)

Test Your Patterns

Always test regex patterns with various inputs, including:

  • Valid examples that should match
  • Invalid examples that shouldn’t match
  • Edge cases (empty strings, very long strings)

Be Specific

  • Use \d instead of [0-9] for digits
  • Use ^ and `# Regular Expressions: Pattern Matching Guide

Regular expressions (regex) are powerful tools for finding patterns in strings. They’re incredibly useful for validation, ensuring data is in a particular format, and text processing. Compilers use regular expressions to validate program syntax, and web developers use them for everything from email validation to URL parsing.

What Are Regular Expressions?

At their core, regular expressions are a mini-language for describing text patterns. Instead of looking for exact matches, you describe the pattern you want to find, and the regex engine finds all strings that match that pattern.

Character Classes

Character classes let you match specific types of characters:

Predefined Character Classes

  • \d - Matches any digit (0-9)
  • \D - Matches any non-digit
  • \w - Matches any word character (letters, digits, underscore)
  • \W - Matches any non-word character
  • \s - Matches any whitespace (spaces, tabs, newlines)
  • \S - Matches any non-whitespace

Custom Character Classes

  • [aeiou] - Matches any vowel
  • [0-9] - Matches any digit (same as \d)
  • [a-z] - Matches any lowercase letter
  • [A-Z] - Matches any uppercase letter
  • [0-35-9] - Matches digits 0-3 or 5-9 (excludes 4)

Negated Character Classes

  • [^4] - Matches any character except 4
  • [^aeiou] - Matches any consonant
  • [^0-9] - Matches any non-digit (same as \D)

Quantifiers

Quantifiers specify how many times a pattern should match:

  • * - Matches zero or more occurrences
  • + - Matches one or more occurrences
  • ? - Matches zero or one occurrence (optional)
  • {n} - Matches exactly n occurrences
  • {n,} - Matches at least n occurrences
  • {n,m} - Matches between n and m occurrences (inclusive)

Quantifier Examples

  • A* - Matches “”, “A”, “AA”, “AAA”, etc.
  • A+ - Matches “A”, “AA”, “AAA”, etc. (but not empty string)
  • A? - Matches “” or “A”
  • A{3} - Matches exactly “AAA”
  • A{2,4} - Matches “AA”, “AAA”, or “AAAA”

Special Characters

The Dot (.)

  • . - Matches any single character except newline
  • .* - Matches any number of characters (except newlines)
  • .+ - Matches one or more of any character

Anchors

  • ^ - Matches the beginning of a string
  • $ - Matches the end of a string
  • ^A - String must start with “A”
  • Z$ - String must end with “Z”
  • ^A.*Z$ - String starts with “A” and ends with “Z”

Practical Examples

Email Validation (Basic)

\w+@\w+\.\w+

Matches: user@domain.com

Phone Number (US Format)

\d{3}-\d{3}-\d{4}

Matches: 555-123-4567

Postal Code (US ZIP)

\d{5}(-\d{4})?

Matches: 12345 or 12345-6789

Finding Words

\b\w+\b

Matches individual words (using word boundaries)

IP Address (Simple)

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Matches: 192.168.1.1

Hexadecimal Colors

#[0-9A-Fa-f]{6}

Matches: #FF5733, #a1b2c3

Date Format (MM/DD/YYYY)

\d{2}/\d{2}/\d{4}

Matches: 12/25/2007

Username Validation

^[a-zA-Z0-9_]{3,16}$

Matches usernames 3-16 characters, letters/numbers/underscore only

anchors to match entire strings

  • Consider word boundaries \b when matching whole words

Common Pattern Building Blocks

  • Start of string: ^pattern
  • End of string: `pattern# Regular Expressions: Pattern Matching Guide

Regular expressions (regex) are powerful tools for finding patterns in strings. They’re incredibly useful for validation, ensuring data is in a particular format, and text processing. Compilers use regular expressions to validate program syntax, and web developers use them for everything from email validation to URL parsing.

What Are Regular Expressions?

At their core, regular expressions are a mini-language for describing text patterns. Instead of looking for exact matches, you describe the pattern you want to find, and the regex engine finds all strings that match that pattern.

Character Classes

Character classes let you match specific types of characters:

Predefined Character Classes

  • \d - Matches any digit (0-9)
  • \D - Matches any non-digit
  • \w - Matches any word character (letters, digits, underscore)
  • \W - Matches any non-word character
  • \s - Matches any whitespace (spaces, tabs, newlines)
  • \S - Matches any non-whitespace

Custom Character Classes

  • [aeiou] - Matches any vowel
  • [0-9] - Matches any digit (same as \d)
  • [a-z] - Matches any lowercase letter
  • [A-Z] - Matches any uppercase letter
  • [0-35-9] - Matches digits 0-3 or 5-9 (excludes 4)

Negated Character Classes

  • [^4] - Matches any character except 4
  • [^aeiou] - Matches any consonant
  • [^0-9] - Matches any non-digit (same as \D)

Quantifiers

Quantifiers specify how many times a pattern should match:

  • * - Matches zero or more occurrences
  • + - Matches one or more occurrences
  • ? - Matches zero or one occurrence (optional)
  • {n} - Matches exactly n occurrences
  • {n,} - Matches at least n occurrences
  • {n,m} - Matches between n and m occurrences (inclusive)

Quantifier Examples

  • A* - Matches “”, “A”, “AA”, “AAA”, etc.
  • A+ - Matches “A”, “AA”, “AAA”, etc. (but not empty string)
  • A? - Matches “” or “A”
  • A{3} - Matches exactly “AAA”
  • A{2,4} - Matches “AA”, “AAA”, or “AAAA”

Special Characters

The Dot (.)

  • . - Matches any single character except newline
  • .* - Matches any number of characters (except newlines)
  • .+ - Matches one or more of any character

Anchors

  • ^ - Matches the beginning of a string
  • $ - Matches the end of a string
  • ^A - String must start with “A”
  • Z$ - String must end with “Z”
  • ^A.*Z$ - String starts with “A” and ends with “Z”

Practical Examples

Email Validation (Basic)

\w+@\w+\.\w+

Matches: user@domain.com

Phone Number (US Format)

\d{3}-\d{3}-\d{4}

Matches: 555-123-4567

Postal Code (US ZIP)

\d{5}(-\d{4})?

Matches: 12345 or 12345-6789

Finding Words

\b\w+\b

Matches individual words (using word boundaries)

IP Address (Simple)

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Matches: 192.168.1.1

Hexadecimal Colors

#[0-9A-Fa-f]{6}

Matches: #FF5733, #a1b2c3

Date Format (MM/DD/YYYY)

\d{2}/\d{2}/\d{4}

Matches: 12/25/2007

Username Validation

^[a-zA-Z0-9_]{3,16}$

Matches usernames 3-16 characters, letters/numbers/underscore only

  • Entire string: `^pattern# Regular Expressions: Pattern Matching Guide

Regular expressions (regex) are powerful tools for finding patterns in strings. They’re incredibly useful for validation, ensuring data is in a particular format, and text processing. Compilers use regular expressions to validate program syntax, and web developers use them for everything from email validation to URL parsing.

What Are Regular Expressions?

At their core, regular expressions are a mini-language for describing text patterns. Instead of looking for exact matches, you describe the pattern you want to find, and the regex engine finds all strings that match that pattern.

Character Classes

Character classes let you match specific types of characters:

Predefined Character Classes

  • \d - Matches any digit (0-9)
  • \D - Matches any non-digit
  • \w - Matches any word character (letters, digits, underscore)
  • \W - Matches any non-word character
  • \s - Matches any whitespace (spaces, tabs, newlines)
  • \S - Matches any non-whitespace

Custom Character Classes

  • [aeiou] - Matches any vowel
  • [0-9] - Matches any digit (same as \d)
  • [a-z] - Matches any lowercase letter
  • [A-Z] - Matches any uppercase letter
  • [0-35-9] - Matches digits 0-3 or 5-9 (excludes 4)

Negated Character Classes

  • [^4] - Matches any character except 4
  • [^aeiou] - Matches any consonant
  • [^0-9] - Matches any non-digit (same as \D)

Quantifiers

Quantifiers specify how many times a pattern should match:

  • * - Matches zero or more occurrences
  • + - Matches one or more occurrences
  • ? - Matches zero or one occurrence (optional)
  • {n} - Matches exactly n occurrences
  • {n,} - Matches at least n occurrences
  • {n,m} - Matches between n and m occurrences (inclusive)

Quantifier Examples

  • A* - Matches “”, “A”, “AA”, “AAA”, etc.
  • A+ - Matches “A”, “AA”, “AAA”, etc. (but not empty string)
  • A? - Matches “” or “A”
  • A{3} - Matches exactly “AAA”
  • A{2,4} - Matches “AA”, “AAA”, or “AAAA”

Special Characters

The Dot (.)

  • . - Matches any single character except newline
  • .* - Matches any number of characters (except newlines)
  • .+ - Matches one or more of any character

Anchors

  • ^ - Matches the beginning of a string
  • $ - Matches the end of a string
  • ^A - String must start with “A”
  • Z$ - String must end with “Z”
  • ^A.*Z$ - String starts with “A” and ends with “Z”

Practical Examples

Email Validation (Basic)

\w+@\w+\.\w+

Matches: user@domain.com

Phone Number (US Format)

\d{3}-\d{3}-\d{4}

Matches: 555-123-4567

Postal Code (US ZIP)

\d{5}(-\d{4})?

Matches: 12345 or 12345-6789

Finding Words

\b\w+\b

Matches individual words (using word boundaries)

IP Address (Simple)

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Matches: 192.168.1.1

Hexadecimal Colors

#[0-9A-Fa-f]{6}

Matches: #FF5733, #a1b2c3

Date Format (MM/DD/YYYY)

\d{2}/\d{2}/\d{4}

Matches: 12/25/2007

Username Validation

^[a-zA-Z0-9_]{3,16}$

Matches usernames 3-16 characters, letters/numbers/underscore only

  • Optional group: (pattern)?
  • Either/or: (pattern1|pattern2)

Common Pitfalls

Greedy vs. Non-Greedy

  • .* is greedy (matches as much as possible)
  • .*? is non-greedy (matches as little as possible)

Escaping Special Characters

To match literal special characters, escape them:

  • \. - Matches literal period
  • \* - Matches literal asterisk
  • \? - Matches literal question mark

Case Sensitivity

Most regex engines are case-sensitive by default:

  • [a-z] - Only lowercase letters
  • [A-Za-z] - Both upper and lowercase
  • Use case-insensitive flags when available

as line boundaries, not string boundaries

  • Case sensitivity: [a-z] doesn’t match uppercase letters
  • Unicode issues: \w might not handle international characters as expected

Safer Validation Practices

  • Use \A and \z for true string start/end (language dependent)
  • Consider case-insensitive matching when appropriate
  • Test with various character encodings and special characters
  • Validate both format AND content length limits

Best Practices for Security

  1. Avoid complex nested quantifiers
  2. Test with long, malformed input
  3. Set timeouts for regex operations
  4. Use specific character classes instead of broad ones
  5. Validate input length before applying regex
  6. Consider using dedicated parsers for complex formats

When NOT to Use Regex

Regular expressions aren’t always the best tool:

  • Complex parsing (use proper parsers for HTML, XML, JSON)
  • Simple string operations (use built-in string methods)
  • Performance-critical code (regex can be slow on large inputs)

Remember: “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” Use regex when appropriate, but don’t force it where simpler solutions exist.

Resources for Learning and Testing

Online Regex Testing

Rubular (https://rubular.com/) is an excellent online regex tester that provides:

  • Real-time pattern testing as you type
  • Clear highlighting of matches in your test string
  • Ruby-based regex engine (but patterns work across most languages)
  • Instant feedback on pattern syntax errors
  • Ability to save and share regex patterns

Using Rubular Effectively

  1. Start with simple test strings - Enter basic examples of what you want to match
  2. Build patterns incrementally - Add one piece at a time and watch the matches update
  3. Test edge cases - Add test strings that should NOT match to verify your pattern
  4. Use the quick reference - Rubular provides a handy cheat sheet on the right side
  5. Save useful patterns - Bookmark or save patterns you’ll use again

Other Testing Resources

  • Online regex testers with different engines
  • Language-specific regex documentation
  • Practice with real-world examples
  • Start with simple patterns and gradually increase complexity

Regular expressions are incredibly powerful once you understand the basics. Using tools like Rubular to practice with real examples makes learning much easier, and don’t be afraid to start simple and build up to more complex patterns.