How I'm writing & maintaining complex regular expressions (RegEx)

New idea

Posted on April 18, 2022

RegEx is notoriously hard ¹ to write and maintain after a certain threshold of complexity.

In building a regular expression to parse Wordpress shortcodes, I went way beyond the simplicity boundary, and had to come up with a few tricks to help me make the code easier to write and simpler to maintain:

Use named capture groups
Split the expression in smaller sub expressions
Use extensive, recipe-like documentation of how each sub-part works

Note

How to use these tricks depends on each language's RegEx implementation. Not all of them have named capture groups.

Take the raw RegEx below:

// Good luck debugging this 10+ min after your write it
const myMagicalRegex =
  /\[([a-z][\w\d_\*-]{1,})(?:\s{1,}([^\]]{0,})){0,1}\](?:((?:(?!\[\/\1\]).)*)\[\/\1\]){0,1}/gmu;

I'm convinced that even those who work daily with complex regular expressions will have a hard time reading / parsing this. Here's the alternative, implementing the 3 principles above:

/**
 * Shortcode names are made up of lowercase words/digits/asterisks, separated by hyphens and underscores.
 *
 * Rules:
 *  - no whitespace
 *  - can include words, digits, hyphens, underscores & asterisks
 *  - must start with a lowercased letter
 *
 * Examples: popup, wcf7-contact-form,  vc_columns,  text*
 */
const nameExp = /(?<name>[a-z][\w\d_\*-]{1,})/; // Group #1
// 👆👆👆 Example of a named capture group

/**
 * Attributes come in all shapes and sizes, but as a group they must always be preceded by a whitespace.
 * Right now I'm accepting anything that isn't a closing bracket "]".
 * @TODO: how to handle closing brackets in attributes' values?
 *
 * Examples:
 *  - Strings: title="Not" padding_bottom="0px"
 *  - Numbers: width=100 id=30
 *  - Numbers as strings: width="100" id="30"
 *  - Attribute name IS the value: [textarea mensagem] [text* your-name]
 */
const attributesExp = /(?:\s{1,}(?<attributes>[^\]]{0,})){0,1}/; // Group #2

/**
 * Include everything inside the shortcode, if any.
 *
 * Breaking it down:
 * 1. we wrap it in a non-capturing group (?:) to check how many times the full match happens.
 *  - (?:ACTUAL_CONTENT_REGEX){0,1}
 *  - Content can occur either 0 or 1 times as not all shortcodes have content.
 * 2. The actual content is wrapped in a named capture group (?<content>)
 *  - The content will be whatever (`.`) isn't the closing shortcode bracket (see 3. below)
 *  - The usage of negative lookahead (?!) is explained here: https://stackoverflow.com/a/8057827
 *  - @TODO: better understand this portion
 * 3. The content *must* finish with the closing shortcode bracket
 *  - we backreference the captured "name" group (group #1 above -> \1)
 *  - as we know the bracket will close like [/shortcode-name], we can test for \[\/\1\]
 */
const contentExp = /(?:(?<content>(?:(?!\[\/\1\]).)*)\[\/\1\]){0,1}/; // Group #3

// 👇👇👇 The final shortcode is built from 3 manageable sub-expressions
const shortcodeRegExp = new RegExp(
  `\\[${nameExp.source}${attributesExp.source}\\]${contentExp.source}`,
  "gum"
);

Sure, it's extensive and isn't perfect. RegEx remains a big challenge for me, but now I feel more capable of tackling harder problems with it (and this confidence is bearing fruit!).

I had a couple of bugs with this expression a few days/weeks after the first implementation, and was capable of fixing them with relative ease. At least I know that, years from now, it'll be clear where to start debugging, and that's a huge win!

Bonus tip for JS: when consuming the RegEx matches, use array de-structuring to name matches just like your capture groups:

"...".replace(
  shortcodeRegExp,
  (match, ...groups) => {
    // Keeping in mind the capture group order in the expression,
    // we know what's each group in the final match 👇👇👇
    const [name, rawAttributes, content] = groups;

    // Then we can use it freely
    const attributes = parseShortcodeAttributes(rawAttributes);
    const shortcode = { name, attributes, content };

    if (IsShortcode?.(shortcode) === false) return match;
    
    // ...
  }
);

If you have other tricks for writing, understanding and maintaining RegEx, do reach out!

Sidenote: are you migrating content off Wordpress?

I'm building a tool to help with that, and would love to get together to speed up your work. Get in touch if you're interested meet@hdoro.dev or hdorodev 😉

I think I've seen a few jokes of that floating around, but right now I can only resort to common sense to drive the point home 😬
👆 Go back up