How I'm writing & maintaining complex regular expressions (RegEx)
RegEx is notoriously hard 1 to write and maintain after a certain threshold of complexity.
In building a regular expression to parse Wordpress shortcodes, I went way beyond the simplicity boundary, and had to come up with a few tricks to help me make the code easier to write and simpler to maintain:
- Use named capture groups
- Split the expression in smaller sub expressions
- Use extensive, recipe-like documentation of how each sub-part works
How to use these tricks depends on each language's RegEx implementation. Not all of them have named capture groups.
Take the raw RegEx below:
// Good luck debugging this 10+ min after your write it const myMagicalRegex = /\[([a-z][\w\d_\*-]{1,})(?:\s{1,}([^\]]{0,})){0,1}\](?:((?:(?!\[\/\1\]).)*)\[\/\1\]){0,1}/gmu;
I'm convinced that even those who work daily with complex regular expressions will have a hard time reading / parsing this. Here's the alternative, implementing the 3 principles above:
/** * Shortcode names are made up of lowercase words/digits/asterisks, separated by hyphens and underscores. * * Rules: * - no whitespace * - can include words, digits, hyphens, underscores & asterisks * - must start with a lowercased letter * * Examples: popup, wcf7-contact-form, vc_columns, text* */ const nameExp = /(?<name>[a-z][\w\d_\*-]{1,})/; // Group #1 // 👆👆👆 Example of a named capture group /** * Attributes come in all shapes and sizes, but as a group they must always be preceded by a whitespace. * Right now I'm accepting anything that isn't a closing bracket "]". * @TODO: how to handle closing brackets in attributes' values? * * Examples: * - Strings: title="Not" padding_bottom="0px" * - Numbers: width=100 id=30 * - Numbers as strings: width="100" id="30" * - Attribute name IS the value: [textarea mensagem] [text* your-name] */ const attributesExp = /(?:\s{1,}(?<attributes>[^\]]{0,})){0,1}/; // Group #2 /** * Include everything inside the shortcode, if any. * * Breaking it down: * 1. we wrap it in a non-capturing group (?:) to check how many times the full match happens. * - (?:ACTUAL_CONTENT_REGEX){0,1} * - Content can occur either 0 or 1 times as not all shortcodes have content. * 2. The actual content is wrapped in a named capture group (?<content>) * - The content will be whatever (`.`) isn't the closing shortcode bracket (see 3. below) * - The usage of negative lookahead (?!) is explained here: https://stackoverflow.com/a/8057827 * - @TODO: better understand this portion * 3. The content *must* finish with the closing shortcode bracket * - we backreference the captured "name" group (group #1 above -> \1) * - as we know the bracket will close like [/shortcode-name], we can test for \[\/\1\] */ const contentExp = /(?:(?<content>(?:(?!\[\/\1\]).)*)\[\/\1\]){0,1}/; // Group #3 // 👇👇👇 The final shortcode is built from 3 manageable sub-expressions const shortcodeRegExp = new RegExp( `\\[${nameExp.source}${attributesExp.source}\\]${contentExp.source}`, "gum" );
Sure, it's extensive and isn't perfect. RegEx remains a big challenge for me, but now I feel more capable of tackling harder problems with it (and this confidence is bearing fruit!).
I had a couple of bugs with this expression a few days/weeks after the first implementation, and was capable of fixing them with relative ease. At least I know that, years from now, it'll be clear where to start debugging, and that's a huge win!
Bonus tip for JS: when consuming the RegEx matches, use array de-structuring to name matches just like your capture groups:
"...".replace( shortcodeRegExp, (match, ...groups) => { // Keeping in mind the capture group order in the expression, // we know what's each group in the final match 👇👇👇 const [name, rawAttributes, content] = groups; // Then we can use it freely const attributes = parseShortcodeAttributes(rawAttributes); const shortcode = { name, attributes, content }; if (IsShortcode?.(shortcode) === false) return match; // ... } );
If you have other tricks for writing, understanding and maintaining RegEx, do reach out!
Sidenote: are you migrating content off Wordpress?
I'm building a tool to help with that, and would love to get together to speed up your work. Get in touch if you're interested meet@hdoro.dev or hdorodev 😉
- 👆 Go back up
I think I've seen a few jokes of that floating around, but right now I can only resort to common sense to drive the point home 😬