how to extract body contents using regexp [duplicate]

how to extract body contents using regexp [duplicate] - javascript

This question already has answers here:
Regular Expression to Extract HTML Body Content
(6 answers)
Closed 8 years ago.
I have this code in a var.
<html>
<head>
.
.
anything
.
.
</head>
<body anything="">
content
</body>
</html>
or
<html>
<head>
.
.
anything
.
.
</head>
<body>
content
</body>
</html>
result should be
content

Note that the string-based answers supplied above should work in most cases. The one major advantage offered by a regex solution is that you can more easily provide for a case-insensitive match on the open/close body tags. If that is not a concern to you, then there's no major reason to use regex here.
And for the people who see HTML and regex together and throw a fit...Since you are not actually trying to parse HTML with this, it is something you can do with regular expressions. If, for some reason, content contained </body> then it would fail, but aside from that, you have a sufficiently specific scenario that regular expressions are capable of doing what you want:
const strVal = yourStringValue; //obviously, this line can be omitted - just assign your string to the name strVal or put your string var in the pattern.exec call below
const pattern = /<body[^>]*>((.|[\n\r])*)<\/body>/im;
const array_matches = pattern.exec(strVal);
After the above executes, array_matches[1] will hold whatever came between the <body and </body> tags.

var matched = XMLHttpRequest.responseText.match(/<body[^>]*>([\w|\W]*)<\/body>/im);
alert(matched[1]);

I believe you can load your html document into the .net HTMLDocument object and then simply call the HTMLDocument.body.innerHTML?
I am sure there is even and easier way with the newer XDocumnet as well.
And just to echo some of the comments above regex is not the best tool to use as html is not a regular language and there are some edge cases that are difficult to solve for.
https://en.wikipedia.org/wiki/Regular_language
Enjoy!

Related

How to safely prepare a string for JavaScript using PHP? [duplicate]

This question already has answers here:
How do I pass variables and data from PHP to JavaScript?
(19 answers)
Closed 2 years ago.
Considering the following piece of code rendered via PHP:
<script type="application/javascript">
const str = '<?= $str ?>';
</script>
How to prepare the $str so it can be rendered safely?
P.S. Safely is defined as "JavaScript gets the same Unicode contents as $str has in PHP".

You could use PHP's json_encode:
<script type="application/javascript">
const str = <?=json_encode($str)?>;
</script>
This will return a JSON representation of your String, meaning that:
it will be wrapped with double quotes
escape double quotes which might be in the String itself
escape Unicode characters
With this code alone (const str = ...), there is absolutely zero risk of XSS. The String is safe by itself, and can be manipulated by JS.
However, it can become an XSS hazard if you use that String like this:
eval(str); for obvious reasons
elem.innerHTML = str; for example if str === '<button onclick="sendCookiesToEvilServer()">Click me</button>', or '<style>body { display: none; }</style>'
... that list is not exhaustive
If you want to display that String to the user, prefer .innerText to .innerHTML, or maybe look at strip_tags.
If you do need to use innerHTML because the string is allowed to contain some HTML, you need to properly sanitize it against XSS. And it's a little harder, because then, you need a parser to allow/remove only some HTML tags and/or attributes. strip_tags will allow you to do so to a certain extent (i.e. you could only allow <b> for example, but that won't prevent that <b> from having an attribute onclick="...").

why this small piece of JavaScript breaks? [duplicate]

This question already has answers here:
Why does <!--<script> cause a DOM tree break on the browser?
(2 answers)
Closed 6 years ago.
Why this code breaks:
<script>
var test = "<!-- <script ";
</script>
<h1>
If you can see this it means the page didn't break
</h1>
https://jsfiddle.net/y3w7ugaw/
and this doesn't
<script>
var test = "<!-- <script";
</script>
<h1>
If you can see this it means the page didn't break
</h1>
https://jsfiddle.net/mL1xxygo/
I should not break since test var is a string

Good question. The two examples are not the same in that the first has a space between <script and the following closing double quote while the second does not. Both examples have the character sequence <!--, used to introduce comments in HTML source, inside the javascript string.
The first example does not show the header, which can be made to reappear by either
removing the <!-- characters, OR
by removing the space after <script in the string value.
The question alluded to in comment states that the HTML is invalid although reading the HTML parsing spec does not make the reason particularly obvious.
A javascript solution is to escape characters confusing the parser with a backslash, even though the character does not normally need escaping. JavaScript ignores backslashes before ordinary characters whilst the parser does not.
Hence either
var test = "<\!-- <script ";
or
var test = "<\!-- <script";
both successfully create a string containing the HTML start comment sequence without confusing the parser.

Prevent JS from parsing string [duplicate]

This question already has answers here:
Escaping </script> tag inside javascript
(3 answers)
Closed 8 years ago.
Was playing around with some code and just realized you can't write a script tag in a string without the browser trying to display:
<html>
<head>
<script>
var code = "<script></script>";
</script>
</head>
This prints to the screen. Weird - why this behavior?

This has nothing to do with JavaScript "string parsing". Rather it's about HTML parsing.
It is simply not valid for HTML for a <script> element to contain the sequence </script> (actually, any </ although browsers are lenient on that) in it's content - any such sequence will always be treated as the closing tag.
See Escaping </script> tag inside javascript for lots of the details.
A common solution is thus to separate the sequence using string concatenation
var code = "<script><"+"/script>";
Although it is also valid to use an escape ("<script><\/script>") or an escape sequence ("<script><\x2fscript>").
The CDATA approach should not be used with HTML, as it's only for XML.

Wordpress & Javascript: String variable having html tags being read by browser with newline character

I have gone crazy trying to resolve this issue.
In my javascript code I have am defining a string variable in which I am putting an HTML table in the form of string.. i.e.:
var tData="<table><tbody><tr><a><th>Type</th><th>Score</th><th>Percentile</th></a></tr><tr><td><a>Overall</a></td><td>2.4</td><td>50%</td></tr><tr><td><a>Best 100</a></td><td>2.3</td><td>70%</td></tr></tbody></table>";
Now this variable assignment through the string is being read by my browser (both chrome and firefox) as an HTML code with line breaks. Take a look at the image below for more clarity.
The code works fine if I remove html tags and write a simple string. So I can assure you there are no previous inverted comma errors (i checked them multiple times) and no bogus characters.
I have spent too many hours on this issue. Please please help me on this.
EDIT
Added Wordpress in title and Tags as this is a wordpress issue.

Since your document is XHTML, you have to enclose your code into a CDATA section:
<script>
<![CDATA[
// code here
]]>
</script>
This prevents the browser from interpreting <...> sequences in the content as tags.

If you want multiline strings in JavaScript, you have to unescape the newline, ie
var str = "abc\
de";

Ok. Eureka!!!
I found a get around. I broke the following string :
var tData="<table><tbody><tr><a><th>Type</th><th>Score</th><th>Percentile</th></a></tr><tr><td><a>Overall</a></td><td>2.4</td><td>50%</td></tr><tr><td><a>Best 100</a></td><td>2.3</td><td>70%</td></tr></tbody></table>";
into
var tData = "<tab"+"le><tb"+"ody><t"+"r><a><t"+"h>Type</t"+"h><t"+"h>Score</t"+"h><t"+"h>Percentile</t"+"h></a></t"+"r><t"+"r><t"+"d><a>Overall</a></t"+"d><t"+"d>2.4</t"+"d><t"+"d>50%</t"+"d></t"+"r><t"+"r><t"+"d><a>Best 100</a></t"+"d><t"+"d>2.3</t"+"d><t"+"d>70%</t"+"d></t"+"r></tbo"+"dy></ta"+"ble>";
to fool the browser. I am still hoping for a better answer please.

Delete all invisible characters (whitespace) around that area,
then give it another try.
Try this:
var tData="<table><tbody>";
tData+="<tr><th><a>Type</a></th><th>Score</th><th>Percentile</th></tr>";
tData+="<tr><td><a>Overall</a></td><td>2.4</td><td>50%</td></tr>";
tData+="<tr><td><a>Best 100</a></td><td>2.3</td><td>70%</td></tr>";
tData+="</tbody></table>";
Possible Duplicate No visible cause for "Unexpected token ILLEGAL"

whay backaward slash in the parameter element of the javascript object?

I was inspecting this site in firebug. Inside the third <script/> tag in the head section of the page , I found an object variable declared in the following way ( truncated here however by me) :
var EM={
"ajaxurl":"http:\/\/ipsos.com.au\/wp-admin\/admin-ajax.php",
"bookingajaxurl":"http:\/\/ipsos.com.au\/wp-admin\/admin-ajax.php",
"locationajaxurl":"http:\/\/ipsos.com.au\/wp-admin\/admin-ajax.php?action=locations_search",
"firstDay":"1","locale":"en"};
The utility of the variable is unknown to me. What struck me is the 3 urls presented there. Why are the backward slashes present there? Couldn't it be something like :
"ajaxurl" : "http://ipsos.com.au/wp-admin/admin-ajax.php"
?

In a script element there are various character sequences (depending on the version of HTML) that will terminate the element. </script> will always do this.
<\/script> will not.
Escaping / characters will not change the meaning of the JS, but will prevent any such HTML from ending the script.

The \/\/ is to avoid the below scenario:
when the url looks something similar to "ajaxurl" : "http://google.com/search?q=</script>"
Try copy paste the url in browsers address bar. This is handled correctly. Otherwise, You might end up getting script errors and page might not work as you've expected.
imagine DOM manipulators replacing the value as it is in the src attribute of the script tag and then the javascript engine reporting multiple errors because that particular script referenced might not get loaded due to incorrectly defined src value
Hope this helps.
Life would be hectic without these lil things

It is used to escape the characters..
The backslash () can be used to insert apostrophes, new lines, quotes, and other special characters into a string.
var str = " Hello "World" !! ";
alert(str)
This won't work..
You have to escape them first
var str = " Hello \"World\" !! ";
alert(str) ; \\ This works
In terms of Javascript / and <\/ are identical inside a string. As far as HTML is concerned </ starts an end tag but <\/ does not.

Develop Reference

JavaScript is the programming language of the Web.

how to extract body contents using regexp [duplicate] - javascript

var matched = XMLHttpRequest.responseText.match(/<body[^>]>([\w|\W])<\/body>/im); alert(matched[1]);

Related

How to safely prepare a string for JavaScript using PHP? [duplicate]

why this small piece of JavaScript breaks? [duplicate]

Prevent JS from parsing string [duplicate]

Wordpress & Javascript: String variable having html tags being read by browser with newline character

whay backaward slash in the parameter element of the javascript object?

Categories

Resources

Develop Reference

JavaScript is the programming language of the Web.

how to extract body contents using regexp [duplicate] - javascript

var matched = XMLHttpRequest.responseText.match(/<body[^>]*>([\w|\W]*)<\/body>/im); alert(matched[1]);

Related

How to safely prepare a string for JavaScript using PHP? [duplicate]

why this small piece of JavaScript breaks? [duplicate]

Prevent JS from parsing string [duplicate]

Wordpress & Javascript: String variable having html tags being read by browser with newline character

whay backaward slash in the parameter element of the javascript object?

Categories

Resources

var matched = XMLHttpRequest.responseText.match(/<body[^>]>([\w|\W])<\/body>/im); alert(matched[1]);