Regex get all quoted words that are not also single quoted - javascript

Would it be possible to get all quoted text with a single regex?
Example text from regexr:
Edit the "Expression" & Text to see matches. Roll over "matches" or the expression for details.
Undo mistakes with ctrl-z.
Save 'Favorites & "Share" expressions' with friends or the Community. "Explore" your results with Tools. A full Reference & Help is available in the Library, or watch the video Tutorial.
In this case I would like to capture Expression, matches and Explore but not Share since 'Favorites & "Share" expressions' is single quoted.

You can't build a regex that matches only the parts you want in Javascript, however you can build a pattern that matches all the string without gaps and use a capture group to extract the part you want:
/[^"']*(?:'[^']*'[^"']*)*"([^"]*)"/g
#^----------------------^ all that isn't content between double quotes
Since your string may end with something like abcd 'efgh "ijkl" mnop' qrst (in short without the part you want but with a double quote part inside single quote substring), It's more secure to change the pattern to:
/[^"']*(?:'[^']*(?:'[^"']*|$))*(?:"([^"]*)"|$)/g
and to discard the last match.

Without special regex pattern:
var mystr = "Edit the \"Expression\" & Text to see matches. Roll over \"matches\" or the expression for details. Undo mistakes with ctrl-z. Save 'Favorites & \"Share\" expressions' with friends or the Community. \"Explore\" your results with Tools. A full Reference & Help is available in the Library, or watch the video Tutorial."
var myarr = mystr.split(/\"/g)
var opening=false;
for(var i=1; i<myarr.length;i=i+2){
if((myarr[i-1].length-myarr[i-1].replace(/'/g,"").length)%2===1){opening=!opening;}
if(!opening){console.log(myarr[1]);}
}
How works:
split text by "
odd index is string with " wrapper
if before this index, odd numbers of ' exists, this item wrapped by ' and should not be considered

Related

How to search for double spaces within <p> elements in Atom Editor?

I'm creating a web page by copy-pasting paragraphs of a Word document. When I copy-paste, the beginning of sentences is preceded by two spaces, which I'd like to replace globally by a single space. However, I would not like this to affect my indentation, which is also by two spaces.
I'm trying to do this by using Atom's 'search with regex' feature. I'm trying to enter the expression
<p>.+ .+</p>
However, this does not produce any matches:
There should be matches, though, as illustrated by the highlighted areas within the p elements in a regular search for ' ':
Also, from a little experiment in the Python shell, this regular expression seems to match what I'm looking for:
In [8]: re.findall(r'<p>.+ .+</p>', ' <p>Foo Bar</p>')
Out[8]: ['<p>Foo Bar</p>']
It seems from Atom's documentation, though, that the search term should be a Javascript RegExp expression. I tried from the docs (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions) to adapt this to Javascript, but so far no success. How can I make this search in Atom work?
In the case of having <p>Foo Bar</p> as the sample input, the following find and replace will work:
"Find in current buffer" = <p>(.+) (.+)</p>
"Replace in current buffer" = <p>$1 $2</p>
The highlighted spaces in your image is within a <p class="flow-text"> - that has a class attribute - and you are searching for <p> which doesn't include attributes. You should do this:
<p[^>]*>.+ .+</p>
However this produces a single match which you have to use capturing groups to replace two spaces with one. You may want to try a more precise approach. Search for:
[ ]{2,}(?=(?:(?!<p[^>]*>).)*<\/p>)
and replace with a single space.

Regex appears to ignore multiple piped characters

Apologies for the awkward question title, I have the following JavaScript:
var wordRe = new RegExp('\\b(?:(?![<^>"])fox|hello(?![<\/">]))\\b', 'g'); // Words regex
console.log('<span>hello</span> <hello>fox</hello> fox link hello my name is fox'.replace(wordRe, 'foo'));
What I'm trying to do is replace any word that isn't nested in a HTML tag, or part of a HTML tag itself. I.e I want to only match "plain" text. The expression seems to be ignoring the rule for the first piped match "fox", and replacing it when it shouldn't be.
Can anyone point out why this is? I think I might have organised the expression incorrectly (at least the negative lookahead).
Here is the JSFiddle.
I'd also like to add that I am aware of the implications of using regex with HTML :)
For your regex work, you want lookbehind. However, as of this writing, this feature is not supported in Javascript.
Here is a workaround:
Instead of matching what we want, we will match what we don't want and remove it from our input string. Later, we can perform the replace on the cleaned input string.
var nonWordRe = new RegExp('<([^>]+).*?>[^<]+?</\\1>', 'g');
var test = '<span>hello</span> <hello>fox</hello> fox link hello my name is fox';
var cleanedTest = test.replace(nonWordRe, '');
var final = cleanedTest.replace(/fox|hello/, 'foo'); // once trimmed final=='foo my name is foo'
NOTA:
I have build this workaround based on your sample. But here are some points that may need to be explored if you face them:
you may need to remove self closing tags (<([^>]+).*?/\>) from the test string
you may need to trim the final string (final)
you may need a descent html parser if tags can contain other tags as HTML allow this.
Javascript doesn't, again as of this writing, recursive patterns.
Demo
http://jsfiddle.net/yXd82/2/

Match attribute value of XML string in JS

I've researched stackoverflow and find similar results but it is not really what I wanted.
Given an xml string: "<a b=\"c\"></a>" in javascript context, I want to create a regex that will capture the attribute value including the quotation marks.
NOTE: this is similar if you're using single quotation marks.
Currently I have a regular expression tailored to the XML specification:
[_A-Za-z][\w\.\-]*(?:=\"[^\"]*\")?
[_A-Za-z][\w\.\-]* //This will match the attribute name.
(?:=\"[^\"]*\")? //This will match the attribute value.
\"[^\"]*\" //This part concerns me.
My question now is, what if the xml string looks like this:
<shout statement="Hi! \"Richeve\"."></shout>
I know this is a dumb question to ask but I just want to capture rare cases that this scenario might happen (I know the coder can use single quotes on this scenario) but there are cases that we don't know the current value of the attribute given that the attribute value changes dynamically at runtime.
So to make this clearer, the result of that using the correct regex should be:
"Hi! \"Richeve\"."
I hope my question is clear. Thanks for all the help!
PS: Note that the language context is Javascript and I know it is tempting to use lookbehinds but currently lookbehinds are not supported.
PS: I know it is really hard to parse XML but I have an elegant solution to this :) so I just need this small problem to be solved. So this problem only main focus is capturing quotation marked string tokens containing quotation marks inside the string token.
The standard pattern for content with matching delimiters and embedded escaped delimiters goes like this:
"[^"\\]*(?:\\.[^"\\]*)*"
Ignoring the obvious first and last characters in the pattern, here's how the rest of the pattern works:
[^"\\]*: Consume all characters until a delimiter OR backslash (matching Hi! in your example)
(?:\\.[^"\\]*)* Try to consume a single escaped character \\. followed by a series of non delimiter/backslash characters, repeatedly (matching \"Richeve first and then \". next in your example)
That's it.
You can try to use a more generic delimiter approach using (['"]) and back references, or you can just allow for an alternate pattern with single quotes like so:
("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')
Here's another description of this technique that might also help (see the section called Strings): http://www.regular-expressions.info/examplesprogrammer.html
Description
I'm pretty really sure embedding double quotes inside a double quoted attribute value is not legal. You could use the unicode equivalent of a double quote \x22 inside the value.
However to answer the question, this expression will:
allow escaped quotes inside attribute values
capture the attribute statement 's value
allow attributes to appear in any order inside the tag
will avoid many of the edge cases which will trip up pattern matching inside html text
doesn't use lookbehinds
<shout\b(?=\s)(?=(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*?\sstatement=(['"])((?:\\['"]|.)*?)\1(?:\s|\/>|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/shout>
Example
Pretty Rubular
Ugly RegexPlanet set to Javascript
Sample Text
Note the difficult edge case in the first attribute :)
<shout onmouseover=' statement="He said \"I am Inside the onMouseOver\" " ; if ( 6 > a ) { funRotate(statement) } ; ' statement="Hi! \"Richeve\"." title="sometitle">SomeString</shout>
Matches
Group 0 gets the entire tag from open to close
Group 1 gets the quote surrounding the statement attribute value, this is used to match the closing quote correctly
Group 2 gets the statement attribute value which may include escaped quotes like \" but not including the surrounding quotes
[0][0] = <shout onmouseover=' statement="He said \"I am Inside the onMouseOver\" " ; if ( 6 > a ) { funRotate(statement) } ; ' statement="Hi! \"Richeve\"." title="sometitle">SomeString</shout>
[0][1] = "
[0][2] = Hi! \"Richeve\".

Javascript - parse formatted text and extract values in order?

I have a field with wiki style rendering on it that I'd like to bust up in Javascript.
The text I'm trying to parse looks like this:
{color:#47B}_name1_{color}
{color:#555}description1{color}
---
{color:#47B}_name2_{color}
{color:#555}description2{color}
---
{color:#47B}_name3_{color}
{color:#555}description3{color}
---
etc
Where name1 and description1 belong together, name2 and description2 belong together, and so forth. The values for name and description are user supplied values, with description potentially spanning multiple lines.
My end goal is to be able to extract the values of each name and each description from the text (and be able to reliably associated name1 with description1, etc).
My question is: If I used a regex to match all the names into an array and all the descriptions into an array, can I be ensured that the items in the array are in the correct order? That is, will names[0] always be the first name in the parsed text (assuming I did a javascript regex match into the names array)? Also- is this bad practice/should I do this another way?
The regular expression I'm trying to use to match names is:
/^(\{color\:#47B\})(_)(\s*?)(.*?)(\s*?)(_)(\{color\})$/
And the regular expression I'm using to match descriptions is:
/(\{color\:#555\})(.*?)(\{color\})/
A regex search will always return matches in source order (i.e. in the order in which they occur in the source text.)
I assume you are asking this question because you're hoping to do two regex matches (one for name, one for description) and then get two result arrays, and guarantee that namesmatch[i] always goes with descriptionmatch[i]. However, this will only be true if your source text is always exactly perfect.
In this case it may be better or safer either to use a single regex that matches both at once, or split your source up by those -- delimiters and then match within each block. The reason why it may be safer is that your source text may contain errors, and at least in this case you can detect that and have as much good data as possible.
A note about your regexes. The . does not match newlines, so if the text between your {color} braces might have a newline you need to include newlines explicitly. [\s\S] and [^] are common idioms for this. Alternatively, if all . in a regex should match newlines, set the dotAll flag (s).

find regex for validating terms (keyword) input

unfortunately i'm poor in regex! can you guide me to write a regex in javascript which can determine my terms input box. a user should input terms with this format:
#(all alphanumeric chars + blank + dash + quotation )
for example:
#keyword1#key word2#keyword3#key-word4#key'word5
and these inputs should be illegal:
#####
##keyword1#key2#
# #keyword
#!%^&
As you wrote a term is specified by:
/#[a-zA-Z0-9 '-]+/
Repeat that pattern, and force it to contain the start and end of the string with ^ and $.
/^(#[a-zA-Z0-9 '-]+)+$/
/#[a-zA-Z0-9][a-zA-Z0-9 '-]+/
When you said "# #keyword" should be invalid, I've assumed you mean "# " should be invalid and "#keyword" should be extracted from that string. The first 'box' means a keyword will always begin with a lowercase letter, uppercase letter, or number. If thats too restrictive and you want to allow for example "#-keyword", just add dash in before the first close-square-bracket, like so:
/#[a-zA-Z0-9-][a-zA-Z0-9 '-]+/
And to return an array of results in javascript, apply it to the string using the "global" modifier ('g' after the second slash):
arrayOfKeywords = keywordString.match(/#[a-zA-Z0-9-][a-zA-Z0-9 '-]+/g);
You may wish to see this code at my test page. Regular-expressions.info is a useful site to learn more about regular expressions. They also have an interactive page to test regexes on, which can be useful when playing around.

Categories

Resources