Concepts
Mechanics
.
any character\
escapes special characters- characters (
\d
digits,\w
word (i.e. letter/digit/underscore),\s
whitespace). []
character classes (define rules over what characters are accepted, unlike the.
wildcard)[3-7]
hypen inside[]
bracket can specify ranges to mean things such as `[3,4,5,6,7]`[^ ...]
is the mirror of it to exclude the mentioned characters|
choices (think of it as OR)- Complement (i.e. everything but) version are capitalized, such as
\D
is everything not a\d
- whitespaces (
\n
newline,\t
tab,
Modifiers
- repetition quantifiers (
?
0~1 times,+
at least once,*
any times,{match how many times}
) (? ...)
inline modifiers alters behaviors such as how newlines, case sensitivity, whether(...)
captures or just groups, and comments within patterns are handled
Positioning rules
- anchors (
^
begins with,$
ends with) \b
word boundary
Output behavior
(...)
capturing group,(?: ...)
non-capturing group\(index)
content of previous matched groups/chunks referred to by indices.
This feature generates derived new content instead of just extracting(?( = | <= | ! | <! ) ...assertions...)
lookarounds skips the contents mentioned in...assertion...
before/after the pattern so you can toss out the matched assertion from your capture results.
(?s)
Also match newline characters (‘single-line’ or DOTALL mode)
Starting with (?s)
flag (also called inline modifiers) expands the .
(dot) single character pattern to ALSO match multiple lines (not by default).
Useful for extracting the contents of HTML blocks blindly and post-process it elsewhere
(?m)
Pattern starts over as a new string for each line (‘multi-line’ mode)
Starting with (?m) flag tells anchors ^
(begin with) and $
(end with) to
Assertions: use lookarounds to skip (not capture) patterns
(?( = | <= | ! | <! ) assertion pattern)
<
is lookbehind, no prefix-character is lookahead.-ahead
/-behind
refers to WHERE the you want TO CAPTURE relative to the assertion pattern,
NOT what you want to assert (match and throw) away (inside the(? ...)
)=
(positive) asserts the pattern inside the lookaround bracket,
! (negative) asserts the pattern inside the lookaround bracket MUST BE FALSE.
Assertions are very useful for getting to the meat you really want to capture rather than sifting through patterns introduced solely for making assertions that you intended to throw away
Extract HTML block
(?ms)(?<= starting tag pattern) body pattern (?= terminating tag pattern)