Regex - Nejati Notes

### **Basic Characters** - **.** (Dot): Matches any single character except a newline. - Example: `a.b` matches "acb", "a+b", but not "ab" or "a\nb". - *\\*_ (Backslash): Escapes the next character. This allows you to match characters that have special meaning in regex (e.g., `\.` matches a literal dot). - Example: `\.` matches ".", `\\` matches "". --- ### **Character Classes** - `\d`: Matches any digit (0-9). Equivalent to `[0-9]`. - `\D`: Matches any non-digit. Equivalent to `[^0-9]`. - `\w`: Matches any word character (alphanumeric characters plus underscore). Equivalent to `[a-zA-Z0-9_]`. - `\W`: Matches any non-word character. Equivalent to `[^a-zA-Z0-9_]`. - `\s`: Matches any whitespace character (spaces, tabs, newlines). - `\S`: Matches any non-whitespace character. - `[abc]`: Matches any single character within the brackets (a, b, or c). - `[^abc]`: Matches any single character **not** within the brackets. - `[a-z]`: Matches any lowercase letter from 'a' to 'z'. - `[A-Z]`: Matches any uppercase letter from 'A' to 'Z'. - `[0-9]`: Matches any digit from '0' to '9'. --- ### **Quantifiers** Quantifiers specify how many times a character, group, or character class must be present in the input for a match to be found. - **`*`**: Matches the previous element **zero or more** times. - Example: `ab*c` matches "ac", "abc", "abbc", "abbbc", etc. - **`+`**: Matches the previous element **one or more** times. - Example: `ab+c` matches "abc", "abbc", but not "ac". - **`?`**: Matches the previous element **zero or one** time (makes it optional). - Example: `colou?r` matches "color" and "colour". - **`{n}`**: Matches the previous **element exactly _n_** times. - Example: `a{3}` matches "aaa". - **`{n,}`**: Matches the previous element **_n_ or more** times. - Example: `a{2,}` matches "aa", "aaa", "aaaa", etc. - **`{n,m}`**: Matches the previous element **at least _n_ times** but no more than _m_ times. - Example: `a{2,4}` matches "aa", "aaa", "aaaa", but not "a" or "aaaaa". **Greedy vs. Lazy Quantifiers:** By default, quantifiers are **greedy**, meaning they match as much text as possible. To make a quantifier **lazy** (match as little text as possible), add a `?` after it. - `*?`: Matches zero or more times (lazy). - `+?`: Matches one or more times (lazy). - `??`: Matches zero or one time (lazy). - `{n,}?`: Matches _n_ or more times (lazy). - `{n,m}?`: Matches between _n_ and _m_ times (lazy). - Example: `<.+>` (greedy) on `<a><b>` matches `<a><b>`. - Example: `<.+?>` (lazy) on `<a><b>` matches `<a>` (and then `<b>` if searched again). --- ### **Anchors and Boundaries** Anchors assert something about the string or the matching process. - **`^`**: Matches the beginning of the string (or the beginning of a line if the multiline flag is enabled). - Example: `^abc` matches "abc" only if it's at the start of the string. - **`

**: Matches the end of the string (or the end of a line if the multiline flag is enabled). - Example: `xyz

matches "xyz" only if it's at the end of the string. - **`\b`**: Matches a word boundary (the position between a word character and a non-word character, or at the start/end of a string if the first/last character is a word character). - Example: `\bcat\b` matches "cat" in "the cat sat" but not in "caterpillar". - **`\B`**: Matches a non-word boundary. - Example: `\Bcat\B` matches "cat" in "caterpillar" but not in "the cat sat". --- ### **Grouping and Capturing** - **`( )`**: Groups multiple tokens together and creates a capturing group. The matched content can be referred to later. - Example: `(abc)+` matches "abc", "abcabc", etc. The captured group would be "abc". - **`\1`, `\2`, etc.**: Backreferences. Match the text captured by the Nth capturing group. - Example: `(a)b\1` matches "aba". - **`(?: )`**: Non-capturing group. Groups tokens but does not create a capturing group. This is useful for applying quantifiers to a group of characters without needing to capture the result. - Example: `(?:abc)+` matches "abc", "abcabc", but "abc" is not captured. - **`|`**: Alternation (OR operator). Matches either the expression before or the expression after the pipe. - Example: `cat|dog` matches "cat" or "dog". --- ### **Lookarounds** Lookarounds are zero-width assertions; they check for a pattern but don't include it in the match. - **`(?=...)`**: Positive lookahead. Asserts that the characters following the current position match the pattern inside the lookahead, but doesn't consume those characters. - Example: `Windows(?=95|98|NT|2000)` matches "Windows" only if it's followed by "95", "98", "NT", or "2000". - **`(?!...)`**: Negative lookahead. Asserts that the characters following the current position do **not** match the pattern inside the lookahead. - Example: `Windows(?!XP|Vista)` matches "Windows" only if it's **not** followed by "XP" or "Vista". - **`(?<=...)`**: Positive lookbehind. Asserts that the characters preceding the current position match the pattern inside the lookbehind. (Note: Many regex engines require lookbehind patterns to have a fixed length). - Example: `(?<=USD)\d+` matches numbers that are preceded by "USD". - **`(?<!...)`**: Negative lookbehind. Asserts that the characters preceding the current position do **not** match the pattern inside the lookbehind. (Note: Fixed length often required). - Example: `(?<!EUR)\d+` matches numbers that are **not** preceded by "EUR". --- ### **Flags (Modifiers)** Flags change how the regex engine interprets the pattern. The way to specify flags varies between programming languages and tools (e.g., `/pattern/flags` in JavaScript or Perl, `re.compile(pattern, flags)` in Python). - **`i`**: Case-insensitive matching. - **`g`**: Global search (find all matches rather than stopping after the first match). - **`m`**: Multiline mode. `^` and `

match the start/end of a line, not just the start/end of the entire string. - **`s`** (or `.` all): Dotall mode. The dot (`.`) matches any character, _including_ newlines. (Behavior can vary, some engines use `s` for this).