Regex - extract all headers from markdown string - javascript

I am using gray-matter in order to parse .MD files from the file system into a string. The result the parser produces is a string like this:
\n# Clean-er ReactJS Code - Conditional Rendering\n\n## TL;DR\n\nMove render conditions into appropriately named variables. Abstract the condition logic into a function. This makes the render function code a lot easier to understand, refactor, reuse, test, and think about.\n\n## Introduction\n\nConditional rendering is when a logical operator determines what will be rendered. The following code is from the examples in the official ReactJS documentation. It is one of the simplest examples of conditional rendering that I can think of.\n\n
I am now trying to write a regular expression that would extract all the heading text from the string. Headers in markdown start with a # (there can be from 1-6), and in my case always end with a new line.
I've tried using the following regular expression but calling it on my test string returns nothing:
const testString = "\n# Clean-er ReactJS Code - Conditional Rendering\n\n## TL;DR\n\nMove render conditions into appropriately named variables. Abstract the condition logic into a function. This makes the render function code a lot easier to understand, refactor, reuse, test, and think about.\n\n## Introduction\n\nConditional rendering is when a logical operator determines what will be rendered. The following code is from the examples in the official ReactJS documentation. It is one of the simplest examples of conditional rendering that I can think of.\n\n"
const HEADING_R = /(?<!#)#{1,6} (.*?)(\\r(?:\\n)?|\\n)/gm;
const headings = HEADING_R.exec(content);
console.log('headings: ', headings);
This console logs headings as null (no matches found). The result that I am looking for would be: ["# Clean-er ReactJS Code - Conditional Rendering", "## TL;DR", "## Introduction"].
I believe the regular expression is wrong, but have no idea why.

/#{1,6}.+(?=\n)/g
#{1,6} ... matches the character # at least once or as sequence of maximum 6 equal characters.
.+ matches any character (except for line terminators) at least once and as many times as possible (greedy)
does so until the positive lookahead (?=\n) matches ...
which is ... \n ... a newline / line-feed.
uses the global modifier which does match everything.
Edit
Having mentioned
"matches any character (except for line terminators)"
thus a regex like ... /#{1,6}.+/g ... should already do the job (no need for a positive lookahead) for the OP's use case which is ...
"Headers in markdown start with a # (there can be from 1-6), and in my case always end with a new line."
The result that I am looking for would be: ["# Clean-er ReactJS Code - Conditional Rendering", "## TL;DR", "## Introduction"].
const testString = `\n# Clean-er ReactJS Code - Conditional Rendering\n\n## TL;DR\n\nMove render conditions into appropriately named variables. Abstract the condition logic into a function. This makes the render function code a lot easier to understand, refactor, reuse, test, and think about.\n\n## Introduction\n\nConditional rendering is when a logical operator determines what will be rendered. The following code is from the examples in the official ReactJS documentation. It is one of the simplest examples of conditional rendering that I can think of.\n\n`;
// see...[https://regex101.com/r/n6XQub/2]
const regXHeader = /#{1,6}.+/g
console.log(
testString.match(regXHeader)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
Bonus
Refactoring the above regex into e.g. /(?<flag>#{1,6})\s+(?<content>.+)/g by utilizing named capturing groups alongside with matchAll and a mapping task, one could achieve a result like computed by the next provided example code ...
const testString = `\n# Clean-er ReactJS Code - Conditional Rendering\n\n## TL;DR\n\nMove render conditions into appropriately named variables. Abstract the condition logic into a function. This makes the render function code a lot easier to understand, refactor, reuse, test, and think about.\n\n## Introduction\n\nConditional rendering is when a logical operator determines what will be rendered. The following code is from the examples in the official ReactJS documentation. It is one of the simplest examples of conditional rendering that I can think of.\n\n`;
// see...[https://regex101.com/r/n6XQub/4]
const regXHeader = /(?<flag>#{1,6})\s+(?<content>.+)/g
console.log(
Array
.from(
testString.matchAll(regXHeader)
)
.map(({ groups: { flag, content } }) => ({
heading: `h${ flag.length }`,
content,
}))
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

The issue is that you are using a literal for the regex and should not double escape the backslash, so you can write it as (?<!#)#{1,6} (.*?)(\r(?:\n)?|\n)
You can shorten the pattern capturing what you want and match the trailing newline instead of using a lookbehind assertion.
(#{1,6} .*)\r?\n
Retrieving all capture group 1 values:
const testString = "\n# Clean-er ReactJS Code - Conditional Rendering\n\n## TL;DR\n\nMove render conditions into appropriately named variables. Abstract the condition logic into a function. This makes the render function code a lot easier to understand, refactor, reuse, test, and think about.\n\n## Introduction\n\nConditional rendering is when a logical operator determines what will be rendered. The following code is from the examples in the official ReactJS documentation. It is one of the simplest examples of conditional rendering that I can think of.\n\n"
const HEADING_R = /(#{1,6} .*)\r?\n/g;
const headings = Array.from(testString.matchAll(HEADING_R), m => m[1]);
console.log('headings: ', headings);

Related

How can I remove empty strings in template literals?

I'm creating a script that loops through an array of objects and creates .edn file that we're using to migrate some of our client's data.
As we have to use .edn file which uses Clojure, I generate a template literal string and populate the data in the format that we need.
In the generated string I have many conditional operations, where I check if some data exist before returning some string. If the data doesn't exist I cannot use null, so I use an empty string. Now when my file gets created, those empty strings create an extra line that I want to remove. Investigating it more, the string adds /n which actually creates those extra lines from an empty string.
What is the best way to do it?
This is an example of what I'm doing:
arrayOfObjects.map(a => {
return `
; ...other code that doesn't depend on data
; Code that depends on data, if it's false I want to remove it completely and
; and avoid passing empty string as it creates extra space in the generated file
${a.stripe_connected_account
? `[[:im.stripeAccount/id]
#:im.stripeAccount{:stripeAccountId "${a.stripe_connected_account}"
:user #im/ref :user/${a.user}}]`
: ""}
`;
});
Appreciate any help, thanks!
An empty string doesn't do that. T${''}h${''}i${''}s is no different from This. The "extra space" is the whitespace that you (unconditionally) include in your template string around the ${ } part:
If you'd start the ${ } expression earlier and end it later and put the whitespace (if you even want it) as part of your if-true expression of the ternary, you will get what you want.
For example:
arrayOfObjects.map(a => {
return `
; ...other code that doesn't depend on data
; Code that depends on data, if it's false I want to remove it completely and
; and avoid passing empty string as it creates extra space in the generated file${
a.stripe_connected_account ? `
[[:im.stripeAccount/id]
#:im.stripeAccount{:stripeAccountId "${a.stripe_connected_account}"
:user #im/ref :user/${a.user}}]`
: ""
}`;
});
(Yes, the strange indentation is a result of merging two different indentation levels, the code's and the output's, and preventing line breaks in the output unless desired. I tried several ways to format it but none of them looked good - there are only ugly ways, no way to indent it nicely and have the result you want, unless you were to write a tagged template function that would smartly detect when a ${ } expression would need surrounding whitespace stripped and when it wouldn't.)

Unexpected behavior with function composition

I'm writing a little utility function to convert strings from one word separation scheme to another. The overall project is using lodash, which I know comes with stuff like _.camelCase, but I felt it was more extensible to not leverage those scheme-conversion helpers.
The idea is that other developers can easily add their own scheme definition to the ones I already have:
const CASES = [
{name: 'lower_kebab', pattern: /^[a-z]+(_[a-z]+)*$/g,
to_arr: w=> w.split('_'),
to_str: a=> a.map(w=>w.toLowerCase()).join('_')},
{name: 'UpperCamel', pattern: /^([A-Z][a-z]*)+$/g,
to_arr: w=> w.match(/[A-Z][a-z]*/g),
to_str: a=> a.map(_.capitalize).join('')},
//...
];
So each Case needs a pattern to determine if a string is of that scheme, a to_arr to split the string approprioately, and a to_str to join an array of words into a string of that scheme (name is optional, but it's good to be descriptive). I've included those two becuase it's in the conversion from lower_kebab to UpperCamel where I'm getting some unexpected behavior.
I've implemented the actual conversion function like so:
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdn.jsdelivr.net/lodash/4.17.3/lodash.min.js"></script>
<script src="https://cdn.jsdelivr.net/lodash/4.17.3/lodash.fp.min.js"></script>
<script>
$(document).ready(()=>{
var CASES = [
{ name: 'lower_kebab', pattern: /^[a-z]+(_[a-z]+)*$/g,
to_arr: w=> w.split('_'),
to_str: a=> a.map(w=>w.toLowerCase()).join('_')
},
{ name: 'UpperCamel', pattern: /^([A-Z][a-z]*)+$/g,
to_arr: w=> w.match(/[A-Z][a-z]*/g),
to_str: a=> a.map(_.capitalize).join('')
},
//...
];
function convert_to(target_scheme_example){
return _.compose(
CASES.find(c=>c.pattern.test(target_scheme_example)).to_str
, str=> CASES.find(c=>c.pattern.test(str)).to_arr(str) );
}
$('#go').on('click', ()=> $('#result').text(
convert_to( $('#dst').val() )( $('#src').val() )
));
});
</script>
<p>Try "<strong>UpperCamel</strong>" to "<strong>lower_kebab</strong>" and vice-versa.</p>
<input id="dst" value="UpperCamel" placeholder="Example of target scheme">
<input id="src" value="lower_kebab" placeholder="String to convert">
<button id="go">Convert</button>
<div>
<p><strong>Result:</strong></p>
<p id="result"></p>
</div>
The "real" version lives strictly in server-side code, so all that DOM-related stuff in the snippet is purely for demonstration purposes (the "real" version also does a little error checking using _.get which I excluded here for brevity).
Here's where things get weird.
On the server side, the problem manifests as convert_to('UpCa')('activity_template') evaluating to things like "Activity_template" and "activity template". In the demo snippet, I believe the same issue is manifesting as only being able click "Convert" only once without throwing an exception.
Any thoughts? Are my RegExs a little off? Have I misunderstood how to use _.compose? If the tool were just broken, that'd be one thing, but it's really throwing me off how it works for many cases, but not all.
From the documentation on RegExp#test:
test() called multiple times on the same global regular expression instance will advance past the previous match.
This is the reason why it only works the first time: the regular expression object (pattern) maintains state resulting from the previous execution of the test method on it.
To avoid this behaviour, you could do one of the following:
Remove the g modifiers from the pattern regular expressions, since they are not necessary for the kind of matching you are trying to do, or
Use the String#match method instead, swapping the position of the string and the regular expression:
return _.compose(
CASES.find(c=>target_scheme_example.match(c.pattern)).to_str
, str=> CASES.find(c=>str.match(c.pattern)).to_arr(str) );
}

Intellij Javascript multiline structural search and replace

In our project a lot of angular unit tests contain following syntax:
inject(['dependency1', 'dependency2', function(_dependency1_, _dependency2_) {
dependency1 = _dependency1_;
dependency2 = _dependency2_;
}]);
In tests the array which lists the dependencies with string values is obsolete, since this is only useful when using minification. So we issued a coding convention to change this syntax to:
inject(function(_dependency1_, _dependency2_) {
dependency1 = _dependency1_;
dependency2 = _dependency2_;
});
Now I've been replacing a couple of these in existing code when I came across them, but I've gotten really tired of doing this manually. So I'm trying to solve this in IntelliJ by using structural search and replace. This is my search template so far:
inject([$injection$, function($argument$) {
$statement$;
}]);
with occurrences:
$injection$: 1 to infinite
$argument$: 1 to infinite
$statement$: 1 to infinite
The replace template is defined as follows:
inject(function($argument$) {
$statement$;
});
This does not work for the example I defined in the beginning however, it only matches and replaces correctly for a single line statement in the function body, so following example is replaced correctly:
inject(['dependency1', 'dependency2', function(_dependency1_, _dependency2_) {
dependency1 = _dependency1_;
}]);
Am I missing something? When I check out the simple if-else example on the Jetbrains website I get the feeling that this should work.
I have tried removing the semicolon behind the $statement$ variable, this didn't match multiple lines and resulted in the semicolons being removed after replacement. I've also tried applying a regex expressions to the $statement$ variable, but these didn't help either.
((.*)=(.*);\n)+
didn't match, probably because the semicolon is filtered out by the IntelliJ structural search before the actual regex matching is performed.
(.*)=(.*)
matched, but it replaced with the same behaviour as without the regex.
Matching multiple statements with a variable in JavaScript is currently broken because of a bug.

Translating conditional statements in string format to actual logic

I have a good knowledge of real time graphics programming and web development, and I've started a project that requires me to take a user-created conditional string and actually use those conditions in code. This is an entirely new kind of programming problem for me.
I've tried a few experiments using loops and slicing up the conditional string...but I feel like I am missing some kind of technique that would make this more efficient and straightforward. I have a feeling regular expressions may be useful here, but perhaps not.
Here is an example string:
"IF#VAR#>=2AND$VAR2$==1OR#VAR3#<=3"
The values for those actual variables will come from an array of objects. Also, the different marker symbols around the variables denote different object arrays where the actual value can be found (variable name is an index).
I have complete control over how the conditional string is formatted (adding symbols around IF/ELSE/ELSEIF AND/OR
well as special symbols around the different operands) so my options are fairly open. How would you approach such a programming problem?
The problem you're facing is called parsing and there are numerous solutions to it. First, you can write your own "interpreter" for your mini-language, including lexer (which splits the string into tokens), parser (which builds a tree structure from a stream of tokens) and executor, which walks the tree and computes the final value. Or you can use a parser generator like PEG and have the whole thing built for you automatically - you just provide the rules of your language. Finally, you can utilize javascript built-in parser/evaluator eval. This is by far the simplest option, but eval only understands javascript syntax - so you'll have to translate your language to javascript before eval'ing it. And since eval can run arbitrary code, it's not for use in untrusted environments.
Here's an example on how to use eval with your sample input:
expr = "#VAR#>=2AND$VAR2$==1OR#VAR3#<=3"
vars = {
"#": {"VAR":5},
"$": {"VAR2":1},
"#": {"VAR3":7}
}
expr = expr.replace(/([##$])(\w+)(\1)/g, function($0, $1, $2) {
return "vars['" + $1 + "']." + $2;
}).replace(/OR/g, "||").replace(/AND/g, "&&")
result = eval(expr) // returns true

zen-coding: ability to ascend the DOM tree using ^

I forked the excellent zen-coding project, with an idea to implement DOM ascension using a ^ - so you can do:
html>head>title^body>h1 rather than html>(head>title)+body>h1
Initially I implemented with rather shoddy regex methods. I have now implemented using #Jordan's excellent answer. My fork is here
What I still want to know
Are there any scenarios where my function returns the wrong value?
Disclaimer: I have never used zen-coding and this is only my second time hearing about it, so I have no idea what the likely gotchas are. That said, this seems to be a working solution, or at least very close to one.
I am using Zen Coding for textarea v0.7.1 for this. If you are using a different version of the codebase you will need to adapt these instructions accordingly.
A couple of commenters have suggested that this is not a job for regular expressions, and I agree. Fortunately, zen-coding has its own parser implementation, and it's really easy to build on! There are two places where you need to add code to make this work:
Add the ^ character to the special_chars variable in the isAllowedChar function (starts circa line 1694):
function isAllowedChar(ch) {
...
special_chars = '#.>+*:$-_!#[]()|^'; // Added ascension operator "^"
Handle the new operator in the switch statement of the parse function (starts circa line 1541):
parse: function(abbr) {
...
while (i < il) {
ch = abbr.charAt(i);
prev_ch = i ? abbr.charAt(i - 1) : '';
switch (ch) {
...
// YOUR CODE BELOW
case '^': // Ascension operator
if (!text_lvl && !attr_lvl) {
dumpToken();
context = context.parent.parent.addChild();
} else {
token += ch;
}
break;
Here's a line-by-line breakdown of what the new code does:
case '^': // Current character is ascension operator.
if (!text_lvl && !attr_lvl) { // Don't apply in text/attributes.
dumpToken(); // Operator signifies end of current token.
// Shift context up two levels.
context = context.parent.parent.addChild();
} else {
token += ch; // Add char to token in text/attribute.
}
break;
The implementation above works as expected for e.g.:
html>head>title^body
html:5>div#first>div.inner^div#second>div.inner
html:5>div>(div>div>div^div)^div*2
html:5>div>div>div^^div
You will doubtless want to try some more advanced, real-world test cases. Here's my modified source if you want a kick-start; replace your zen_textarea.min.js with this for some quick-and-dirty testing.
Note that this merely ascends the DOM by two levels and does not treat the preceding elements as a group, so e.g. div>div^*3 will not work like (div>div)*3. If this is something you want then look at the logic for the closing parenthesis character, which uses a lookahead to check for multiplication. (Personally, I suggest not doing this, since even for an abbreviated syntax it is horribly unreadable.)
You should look for Perl's Text::Balanced alternative in the language that you're using.

Categories

Resources