Writing plugins for remark and gatsby-transformer-remark (part 2)
Welcome to part two of my three-part tutorial on writing plugins for remark
and gatsby-transformer-remark
.
In part one, we created the well-tested functionality to fetch content from GitHub.
In this part, we’ll look at Abstract Syntax Trees (ASTs) and explore some of the fun things we can do with them.
Markdown AST
Practically speaking, an AST represents a piece of source code written in a particular language as a tree structure in which each node represents a language feature (variable definition, function call etc). This tree structure is produced by a parser based on that language’s grammar. A language can have multiple different AST formats and, conversely, multiple languages can share the same AST format. Here are a few examples:
- The transpiler Babel uses the
babylon
parser to parse JavaScript into ababylon
AST. - The linter ESLint uses the
espree
parser to parse JS into anESTree
AST. - The minifier UglifyJS, interestingly enough, has its own parser and AST format. Is this the reason it is the fastest JS minifier out there?
- Flow and TypeScript, which are supersets of JavaScript that add static type annotations, can be parsed into either an ESTree AST (by
flow-parser
andtypescript-eslint-parser
) or Babylon AST by thebabylon
parser. In addition, TypeScript has its own AST format. postcss
(Babel for CSS) parses CSS into its own AST format.styled-components
usesstylis
to parse CSS with interpolated JS.- Markdown has lots of different parsers.
One that is popular in the JavaScript ecosystem is
remark
, which we will work with in this tutorial.remark
’s AST format is MDAST.
If you think the list above is weighed heavily towards front-end development languages, the reason is that, in my impression, software developers in other languages don’t manipulate ASTs as often as we front-end developers do. In the vast majority of cases, browsers can only consume programs written in plain-text sources1 instead of compiled byte code or machine code. As such, most, if not all, optimizations in front-end development must be done by manipulating plain-text source code into more optimized plain-text source code2 (which in turn requires manipulating ASTs more often) instead of making compilers produce more efficient byte code or machine code from an AST.
Exploring a simple AST
To get a hang of ASTs, I’ll visually show you a simple AST with a fabulous tool called AST Explorer.
Paste some code (JavaScript, TypeScript, Markdown etc) into the left panel and it’ll parse that code and show you the corresponding AST in the right panel.
You can click on any text in the left panel and the corresponding AST node will be highlighted in the right panel (and vice versa).
Coincidentally, because AST Explorer uses remark
to parse Markdown, its output will help us keep a mental model while manipulating Markdown ASTs later.
Let’s look at the AST for a simple Markdown snippet
In the corresponding AST, we can see that the free-form text has been parsed into a tree structure that captures our intuitive understanding of how various Markdown formatting features translate to visual output. Here are a few examples:
We expect *italicized text*
to be rendered as the italicized phrase “italicized text.”
Indeed, the corresponding AST node’s type is emphasis
and its content is a single text
node with the value
of italicized text
:
1{2 "type": "emphasis",3 "children": [4 {"type": "text", "value": "italicized text"}5 ],6},
You’ll see that most formatted texts have text
nodes as their terminal children.
text
nodes are leaves in the tree, meaning they have no children.
We expect # Hello
to be rendered as a h1
heading. Indeed, its corresponding AST node has a type of heading
and depth
(or level) of 1
.
1{2 "type": "text", "value": "Hello",3 "depth": 1,4}
We expect the code snippet to be rendered as a JavaScript code block and, indeed, its AST node has the type of code
and lang
(language) JavaScript.
1{2 "type": "code", "lang": "javascript",3 "value": "console.log('!');",4}
Feel free to play around more with this tool.
If you want to dig deeper, the MDAST specification contains information about all the possible types of and relationship between nodes that remark
can understand natively.
Remark plugin
Let’s take a short detour and talk about the structure of a remark
plugin.
remark
is really just the scaffolding on which plugins do their jobs.
remark
’s core handles conversion between plain text Markdown sources and ASTs while all AST manipulations are performed by plugins.
The top-level export of a remark
plugin must be a function, called an attacher
, that can accept configuration options for that plugin from the user.
The attacher
can perform some initialization based on these options and then return another function, called the transformer
, which will perform all the heavy lifting.
During program execution, the transformer
will receive the Markdown AST and mutate it (e.g. add/remove nodes, change node types etc) to achieve the desired output.
Although we’ll only examine how plugins can transform ASTs, they can also add new syntactic constructs to Markdown or new types of output (e.g. HTML).
For example, here is a bare bone attacher
:
1// https://github.com/huy-nguyen/remark-github-plugin/blob/dcffd535/src/index.ts2import {transform} from './transform';3const attacher = () => {4 return transform;5};67export default attacher;
and a no-op transformer
1// https://github.com/huy-nguyen/remark-github-plugin/blob/dcffd535/src/transform.ts2export const transform = () => {34};
First AST manipulation
To warm up, let’s perform a simple transformation: replacing occurrences of the paragraph GITHUB-EMBED
in the following input
with the following short code snippet:
in order to get this output:
Because ASTs can be a bit difficult to think about, a useful trick I usually employ when working with them is just to copy-paste the Markdown input and desired Markdown output into AST Explorer and compare them to determine a reasonably simple way to change the former into the latter. By doing this, I can see that I need to transform this AST node:
1{2 "type": "paragraph",3 "children": [4 {5 "type": "text",6 "value": "GITHUB-EMBED",7 }8 ],9}
into this AST node:
1{2 "type": "code",3 "lang": "js",4 "value": "const a = 1;",5}
Thus, our plan is to visit every node in the AST and whenver we encounter a paragraph
node whose content is the marker GITHUB-EMBED
, we change the type of that node into code
, unset the children
key and add two new keys: lang
with the value js
and value
with the value const a = 1;
.
The following transformer accomplishes that goal:
1// https://github.com/huy-nguyen/remark-github-plugin/blob/c187c72dded0b57179648776e9b887c5fbcbc5da/src/transform.ts2import visit from 'unist-util-visit';34export interface IOptions {5 marker: string;6}7export const transform = ({marker}: IOptions) => (tree: any) => {8 const visitor = (node: any) => {9 const {children} = node;10 if (children.length >= 1 && children[0].value === marker) {11 node.type = 'code';12 node.children = undefined;13 node.lang = 'js';14 node.value = `const a = 1;`;15 }16 };1718 visit(tree, 'paragraph', visitor);19};
Note that instead of hard coding the embedding marker to be GITHUB-EMBED
, we’ve made it a configurable option marker
.
Instead of traversing the AST manually, we use the utility package unist-util-visit
(provided by remark).
Its main export (the visit
function) takes three arguments: an AST to traverse, a condition (such as the node type paragraph
in this case) and a callback to invoke if a node matches that condition.
This is the visitor design pattern in action.
All AST-parsing libraries I’ve seen so far (remark
, eslint
, babel
etc) provide utility packages to traverse the ASTs they produce e.g. babel-traverse
by Babel.
Side note on testing
One common way to test code transformers, such as the one we’re writing, is to store each pair of input and expected output as separate files within a directory (called a test fixture) and then programmatically generate tests from each directory.
For example, to test the transformer we have created so far, we create the following simple-example
directory inside __fixtures__
:
1src2├── __fixtures__3 ├── simple-example4 ├── input.md5 ├── expected.md6 ├── options.js
where input.md
and expected.md
are the Markdown input and expected Markdown output, respectively, taken from above.
options.js
contains the configuration for the transformer (in this case, setting marker
to GITHUB-EMBED
) and “simple example” is the name of this test fixture.
The plumbing to convert these fixtures into tests is in the file src/__tests__/transform.js
if you’re interested.
After this step, the repo should look like this.
When you git checkout
that commit and run npm run test
, the tests should pass, indicating that the actual output of the plugin matches the expected output.
Feel free to play around with the input, expected output, options and transformer code.
For example, can you try a new marker
phrase or make the plugin transform the marker
into a different code snippet while still keeping the tests pass?
If you set marker
to be GITHUB_EMBED
, what happens?
What constraint does that put on possible values for marker
?
Recognizing embedding markers
Now that we’ve gotten a hang of transforming ASTs and testing those transformations, let’s try to apply those skills to our current use case. We want to target our transformation at URLs sandwiched between embedding markers of this form
and replace them with the toy JavaScript snippet above (const a = 1;
) while avoiding false positives.
For example, in this sample input:
only the first paragraph containing GITHUB-EMBED
should be replaced by the code snippet while the latter two should be left alone because one of them doesn’t contain a URL and the other contains only one marker:
We will again use AST Explorer for guidance. After pasting the sample input into AST Explorer, we can see the difference between our target:
1{2 "type": "paragraph",3 "children": [4 {"type": "text", "value": "GITHUB-EMBED "},5 {6 "type": "link", "title": null, "url": "https://github.com/huy-nguyen/squarify/blob/d7074c2/.babelrc",7 "children": [8 {"type": "text", "value": "https://github.com/huy-nguyen/squarify/blob/d7074c2/.babelrc"}9 ],10 },11 {"type": "text", "value": " GITHUB-EMBED"}12 ],13},
and the two potential false positives:
1{2 "type": "paragraph",3 "children": [4 {"type": "text", "value": "GITHUB-EMBED GITHUB-EMBED"}5 ],6}
1{2 "type": "paragraph",3 "children": [4 {"type": "text", "value": "GITHUB-EMBED"}5 ],6},
From this exercise in compare-and-contrast, we can reasonably conclude that we need to transform paragraph
nodes that have three children, of which:
- The first is a
text
node whose value contains the embedding marker (GITHUB-EMBED
). - The second is a
link
to the desired GitHub file. - The last is another
text
node whose value also contains the embedding marker (GITHUB-EMBED
).
Based on the above three conditions, we can write a function checkNode
to check whether a paragraph
node is a candidate for transformation:
1// https://github.com/huy-nguyen/remark-github-plugin/blob/0784899e/src/transform.ts2type CheckResult = {3 isCandidate: true;4 link: string;5} | {6 isCandidate: false;7};89const checkNode = (embedMarker: string, node: any): CheckResult => {10 const {children} = node;11 const numChildren = children.length;12 if (numChildren < 3) {13 return {14 isCandidate: false,15 };16 } else {17 const firstChild = children[0];18 const firstChildContent = firstChild.value.trim();1920 const lastChild = children[numChildren - 1];21 const lastChildContent = lastChild.value.trim();2223 const [linkChild ] = children.slice(1, numChildren - 1);2425 if (firstChild.type === 'text' &&26 firstChildContent === embedMarker &&27 lastChild.type === 'text' &&28 lastChildContent.includes(embedMarker) &&29 linkChild.type === 'link') {3031 return {32 isCandidate: true,33 link: linkChild.url,34 };35 } else {36 return {37 isCandidate: false,38 };39 }4041 }4243};
and use this checker to guard against false positives in our transformer
:
1// https://github.com/huy-nguyen/remark-github-plugin/blob/0784899e/src/transform.ts2export const transform = ({marker}: IOptions) => (tree: any) => {3 const visitor = (node: any) => {4 const checkResult = checkNode(marker, node);5 if (checkResult.isCandidate === true) {6 // ..}7 };89 visit(tree, 'paragraph', visitor);10};
After this step, the repo should look like this. The tests should pass, indicating that our detection works as expected.
Allow specifying language and line range
I think we all want syntax highlighting for our new embedded code blocks. Additionally, it would also be nice to be able to embed only a subset of lines from a GitHub file. After some consideration, I decided to make it as simple as possible to specify the language for an embedded code block The language name should come after the URL (but still stays within the two embedding markers) and is separated from the URL by whitespace like this:
The user can additionally specify that only a subset of lines from the GitHub file should be embedded. For example, the following embedding will only insert line 1 and lines 3 through 5 into the output code block.
I chose this numeric range notation because it’s used to specify which pages should be printed from the print dialog of many operating systems and software, thus making it immediately familiar to a large number of users.
Additionally, there’s already an NPM package to parse this notation for us: parse-numeric-range
.
Like the language name, I again decided to let the line range just follow the language name, separated by whitespace but still stay within the two embedding markers. This does raise a potential conflict: if only one whitespace-delimited “word” appear between the URL and the closing embedding marker, should that “word” be interpreted as a language name or a line range? After some more consideration, I decided that because a user is more likely to specify a language name than to specify a line range, that ambiguous “word” should be interpreted as a language name.
We can now incorporate these new requirements into our test input:
and output:
Note that we include the expected line range inside the expected output code blocks (e.g. const range = '1,3-5'
) to visually demonstrate that if the tests pass, we have correctly extracted the line range from within the embedding markers.
Having the test in place, we can update the checkNode
function to be able to detect these extra use cases by detecting the number of whitespace-delimited entities between two embedding markers:
1// https://github.com/huy-nguyen/remark-github-plugin/blob/061fddea/src/transform.ts2// ...3type CheckResult = {4 isCandidate: true;5 link: string;6 language: string | undefined;7 range: string | undefined;8} | {9 isCandidate: false;10};11// ...12const checkNode = (embedMarker: string, node: any): CheckResult => {13 // ...14if (firstChild.type === 'text' &&15 firstChildContent.includes(embedMarker) &&16 lastChild.type === 'text' &&17 lastChildContent.includes(embedMarker) &&18 linkChild.type === 'link') {1920 // Ref https://stackoverflow.com/a/14912552/707569921 const matched = lastChildContent.match(/\S+/g);22 let range: string | undefined, language: string | undefined;23 if (matched.length === 3) {24 // If there are 2 settings, the first is the language and the second the25 // range:26 language = matched[0];27 range = matched[1];28 } else if (matched.length === 2) {29 // If there's only one option provided, it's the language:30 language = matched[0];31 range = undefined;32 } else {33 range = undefined;34 language = undefined;35 }3637 return {38 isCandidate: true,39 link: linkChild.url,40 range,41 language,42 };43 } else {44 // ...45 };46// ...
Once a node satisfies the checkNode
function, we need to set the lang
property on the code block and insert the line range into the code block:
1/* From https://github.com/huy-nguyen/remark-github-plugin/blob/061fddea/src/transform.ts */23export const transform = ({marker}: IOptions) => (tree: any) => {4 // ...56 if (checkResult.isCandidate === true) {7 const {language, link, range} = checkResult;8 node.type = 'code';9 node.children = undefined;10 node.lang = (language === undefined) ? null : language;11 node.value = `const link = '${link}';\nconst range = '${range}';`;12 }13 // ...14};
After this step, the repo should look like this.
Running npm run test
should show all tests passing.
So far our tool is pretty rudimentary but has correctly performed the tasks we asked of it. In my experience with writing code transformers, it’s best to start simple and avoid over engineering, then slowly add more complex test cases later.
This is the end of part two of my tutorial. Click here for part three.
- The notable (and probably only) exception is WebAssembly byte code.↩
- For example, the
transform-react-constant-elements
Babel plugin “factors out” constant React elements to avoid callingReact.createElement
more than once for those elements. A more extreme example is the Prepack “compiler” by Facebook, which actually executes JavaScript source code to eliminate all computations that can be done at compile-time. For example, it can turnconst a = 1; const b = 2; const c = a + b;
intoconst c = 3
.↩