How to solve error while parsing HTML

How to solve error while parsing HTML - javascript

I´m trying to get the elements from a web page in Google spreadsheet using:
function pegarAsCoisas() {
var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
var elements = XmlService.parse(html);
}
However I keep geting the error:
Error on line 2: Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character. (line 4, file "")
How do I solve this? I want to get the H1 text from this site, but for other sites I´ll have to select other elements.
I know the method XmlService.parse(html) works for other sites, like Wikipedia. As you can see here.

The html isn't xml. And you don't need to try to parse it. You need to use string methods:
function pegarAsCoisas() {
var urlFetchReturn = UrlFetchApp.fetch("http://www.saosilvestre.com.br");
var html = urlFetchReturn.getContentText();
Logger.log('html.length: ' + html.length);
var index_OfH1 = html.indexOf('<h1');
var endingH1 = html.indexOf('</h1>');
Logger.log('index_OfH1: ' + index_OfH1);
Logger.log('endingH1: ' + endingH1);
var h1Content = html.slice(index_OfH1, endingH1);
var h1Content = h1Content.slice(h1Content.indexOf(">")+1);
Logger.log('h1Content: ' + h1Content);
};

The XMLService service works only with 100% correct XML content. It's not error tolerant. Google apps script used to have a tolerant service called XML service but it was deprecated. However, it still works and you can use that instead as explained here: GAS-XML

Technically HTML and XHTML are not the same. See What are the main differences between XHTML and HTML?
Regarding the OP code, the following works just fine
function pegarAsCoisas() {
var html = UrlFetchApp
.fetch('http://www.saosilvestre.com.br')
.getContentText();
Logger.log(html);
}
As was said on previous answers, other methods should be used instead of using the XmlService directly on the object returned by UrlFetchApp. You could try first to convert the web page source code from HTML to XHTML in order to be able to use the Xml Service Service (XmlService), use the Xml Service as it could work directly with HTML pages, or to handle the web page source code directly as a text file.
Related questions:
How to parse an HTML string in Google Apps Script without using XmlService?
What is the best way to parse html in google apps script

Try replace itemscope by itemscope = '':
function pegarAsCoisas() {
var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
html = replace("itemscope", "itemscope = ''");
var elements = XmlService.parse(html);
}
For more information, look here.

Related

node_xslt documentation or examples not available

Hello,
I am new to xml, xsl, but I want to use node_xslt to Transform XML documents.
I have seen the nodejs page for this https://www.npmjs.com/package/node_xslt and https://developer.mozilla.org/en-US/docs/XSLT/PI_Parameters, but it does not contains any example, so can someone provide me some link or examples through which I can understand about this..

Thank you very much , I have got answer of my question,
The Link http://www.tutorialspoint.com/xslt/xslt_syntax.htm is very useful,
I have require to do just the following
var xslt = require('node_xslt');
var document = xslt.readXmlFile('path of xml file');
var stylesheet = xslt.readXsltFile('path of xslt file');
var transformedString = xslt.transform(stylesheet, document, []);
This will give us the HTML by using xml and xslt, can be check by print this..
console.log("transformedString>>>>" + transformedString);

How to scrape embedded JSON using PhantomJS

I need to get a particular piece of data from a JSON string encoded within a script tag within a returned HTML document using phantomjs. The HTML looks basically like this:
... [preamble html tags etc.]
....
<script id="ine-data" type="application/json">
{"userData": {"account_owner": "Grib"},
"skey":"b207ff1f8d5a394c2f7af1681ad3470c",
"location": "EU"
</script>
<script id="notification-data" type="application/json">
... [other stuff including html body]
What I need to get to is the value for skey within the JSON. I am unable to use the selectors to even get to the script. For instance,
page.open('https://www.site1.com/dash', function(status) {
var ine_data = document.querySelectorAll('script').item(0);
console.log(ine_data); phantom.exit();
});
This returns null. Can anyone point me in the right direction please?

The PhantomJS function you're looking for is called page.evaluate (documentation). It allows you to run javascript sandboxed within the javascript environment of the browser itself.
So following your example:
page.open('https://www.site1.com/dash', function(status) {
var ske = page.evaluate(function() {
var json_text = document.querySelector("#ine-data").innerHTML,
json_values = JSON.parse(json_text);
return json_values.skey;
});
console.log(ske)
phantom.exit();
});
Though I'd note that the JSON in your example is invalid (missing a trailing }), so my example won't work without fixing that first!

Workaround for xml data islands

I recently inherited a huge webapp which is a combination of JSP,Javascript and Java. It works only on IE due to the way it has been coded using xml data islands and other things which prevent smooth functioning on other browsers. Everything was fine until a few days when Windows 7 boxes were rolled out for a few users in which IE9/10 has trouble with some of the javascript in the application. For example, the following data island is a snippet from my html page.
<xml id = "underlyingdata" ondatasetcomplete="window.dialogArguments.parent.repopulateDropDown(this, underlyingdd)">
</xml>
<xml id="termdata" ondatasetcomplete="window.dialogArguments.parent.repopulateDropDown(this, termdd)">
</xml>
On this page there is another line of code
window.dialogArguments.parent.request(underlyingdata, "CONTRACT.LIST.WB", "PULP AND PAPER|" + instrumentdd.options[instrumentdd.selectedIndex].text);
that calls a function which is as follows
function request(xmldataisland, requestmethod, parameters
{
var screwcache=Math.round(Math.random()*10000);
xmldataisland.value=null;
xmldataisland.load("/webaccess/Request?sessionid=" + sessionid + "&request=" + requestmethod + "&parameters=" + parameters+"&screwcache="+screwcache);
}
This fails in IE9/10 with the error that 'load' is not a valid method(Script 438 error) on 'xmldataisland' object, whereas it works perfectly fine on IE 5 to IE 8.
I believe the xmldataisland object in the above function is of type XMLDocument. Why does the load method fail? What is the workaround for this? I read and hear from many sources that using data islands is a terrible idea. What would be the correct alternative for this in that case?

From IE10 onwards, XML data islands are no longer supported - the browser parses them as HTML. The Mozilla Developer Network have produced an article which gives a cross-browser alternative to XML data islands, namely an HTML5 "data block". The article demonstrates that a <script> element can be used as a data block if the src attribute is omitted and the type attribute does not specify an executable script type. You must also ensure that the embedded XML content doesn't include a </script> tag.
Source: https://developer.mozilla.org/en/docs/Using_XML_Data_Islands_in_Mozilla
Here's the HTML example they give...
<!DOCTYPE html>
<html>
<head>
<title>XML Data Block Demo</title>
<!-- this is the data block which contains the XML data -->
<script id="purchase-order" type="application/xml">
<purchaseOrder xmlns="http://example.mozilla.org/PurchaseOrderML">
<lineItem>
<name>Line Item 1</name>
<price>1.25</price>
</lineItem>
<lineItem>
<name>Line Item 2</name>
<price>2.48</price>
</lineItem>
</purchaseOrder>
</script>
<script>
function runDemo() {
// the raw XML data can be retrieved using this...
var orderSource = document.getElementById("purchase-order").textContent;
// the XML data can be parsed into a DOM tree using the DOMParser API...
var parser = new DOMParser();
var doc = parser.parseFromString(orderSource, "application/xml");
var lineItems = doc.getElementsByTagNameNS("http://example.mozilla.org/PurchaseOrderML", "lineItem");
var firstPrice = lineItems[0].getElementsByTagNameNS("http://example.mozilla.org/PurchaseOrderML", "price")[0].textContent;
document.body.textContent = "The purchase order contains " + lineItems.length + " line items. The price of the first line item is " + firstPrice + ".";
}
</script>
</head>
<body onload="runDemo()";>
Demo did not run
</body>
</html>
You will need to write your own method to load data into that block (perhaps using jQuery.get or .load methods).
Hope that helps!

4.5 years after the accepted answer was given, it's possible to do an update using HTML5. The script tag is not required. HTML5 is based on XML and allows you to make up your own tag-names. You can create an "xml island" for all browsers like so:
<parms style="display:none;">
<op>
encrypt
</op>
<msg>
Hello World!
</msg>
</parms>
And do something like this to use it.
var xml = document.querySelector("parms");
var op = xml.querySelector("op").textContent;
var msg = xml.querySelector("msg").textContent;
You could also use a div (display:none) as a container for XML that is fetched using XHR. Since HTML5 is based on XML, you can return the XML from the server with MIME type text/XML and use .innerHTML to put it into the div.

JavaScript runtime error: Unable to add dynamic content

I'm making a javascript metro app and have some code like this:
<script>
document.writeln(foo());//this line is trouble
</script>
and when I tried to run, it gave me a rather long error:
Unhandled exception at line 20, column 9 in ms-appx://a375ffac-3b69-475a-bd53-ee3c1ccf4c4e/default.html
0x800c001c - JavaScript runtime error: Unable to add dynamic content.
A script attempted to inject dynamic content, or elements previously
modified dynamically, that might be unsafe. For example, using the
innerHTML property to add script or malformed HTML will generate this
exception. Use the toStaticHTML method to filter dynamic content, or
explicitly create elements and attributes with a method such as
createElement. For more information, see
http://go.microsoft.com/fwlink/?LinkID=247104.
How can I get around this?

Windows 8 restricts the content you can set through innerHTML and Writeln, because it's considered unsafe...
The correct way to add content is:
// The untrusted data contains unsafe dynamic content
var unTrustedData = "<img src='http://www.contoso.com/logo.jpg' on-click='calltoUnsafeCode();'/>";
// Safe dynamic content can be added to the DOM without introducing errors
var safeData = window.toStaticHTML(unTrustedData);
// The content of the data is now
// "<img src='http://www.contoso.com/logo.jpg'/>"
// and is safe to add because it was filtered
document.write(safeData);
If your code has some javascript, you can use this function (But microsoft dont recomend it):
MSApp.execUnsafeLocalFunction(function() {
var body = document.getElementsByTagName('body')[0];
body.innerHTML = '<div style="color:' + textColor + '">example</div>';
});
See at:
http://msdn.microsoft.com/en-us/library/windows/apps/Hh767331.aspx
For your case:
MSApp.execUnsafeLocalFunction(function() {
document.writeln(foo());
});
Note that you should only do this if you understand your content is safe; if you don't, I would recommend using the toStaticHTML method.

regarding to the docs I would try :
document.writeln(window.toStaticHTML(foo()));

Windows 8 store apps have a restriction on placing dynamic content inside innerHTML attribute. To fix this you need to include winstore-jscompat.js file from following location in your page as first reference. Please see this link to know more about winstore-jscompat.
This file is not required on Windows 10.

PhantomJS create page from string

Is it possible to create a page from a string?
example:
html = '<html><body>blah blah blah</body></html>'
page.open(html, function(status) {
// do something
});
I have already tried the above with no luck....
Also, I think it's worth mentioning that I'm using nodejs with phantomjs-node(https://github.com/sgentle/phantomjs-node)
Thanks!

It's very simple, take a look at the colorwheel.js example.
var page = require('webpage').create();
page.content = '<html><body><p>Hello world</p></body></html>';
That's all! Then you can manipulate the page, e.g. render it as an image.

To do this you need to set the page content to your string.
phantom.create(function (ph) {
ph.createPage(function (page) {
page.set('viewportSize', {width:1440,height:900})
//like this
page.set('content', html);
page.render(path_to_pdf, function() {
//now pdf is written to disk.
ph.exit();
});
});
});
you need to use page.set() to set the html content.
as per https://github.com/sgentle/phantomjs-node#functionality-details
Properties can't be get/set directly.
Instead use page.get('version', callback) or page.set('viewportSize', {width:640,height:480}), etc.
Nested objects can be accessed by including dots in keys, such as
page.set('settings.loadImages', false)

Looking at the phantomjs API, page.open requires a URL as the first argument, not an HTML string. This is why the what you tried does not work.
However, one way that you might be able to achieve the effect of creating a page from a string is to host an empty "skeleton page," somewhere with a URL (could be localhost), and then include Javascript (using includeJs) into the empty page. The Javascript that you include into the blank page can use document.write("<p>blah blah blah</p>") to dynamically add content to the webpage.
I've ever done this, but AFAIK this should work.
Sample skeleton page:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head></head>
<body></body>
</html>

Just wanted to mention I recently had a similar need and discovered that I could pass file:// style references as an URL param, so I dumped my HTML string into a local file then passed the full path to my capture script (django_screamshot) which basically uses casperjs and phantomjs + a capture.js script.
Anyway it just works and its reasonably fast..

I got the following to work in PhantomJS version 2.0.0. Whereas before, I was using page.open() to open a page from the filesystem and set a callback:
page.open("bench.html", pageLoadCallback);
Now, I accomplish the same thing from a string variable with the HTML page. The page.setContent() method requires a URL as the second argument, and this uses fs.absolute() to construct a file:// URL.
page.onLoadFinished = pageLoadCallback;
page.setContent(bench_str, "file://" + fs.absolute(".") + "/bench.html");

Develop Reference

JavaScript is the programming language of the Web.

How to solve error while parsing HTML - javascript

The XMLService service works only with 100% correct XML content. It's not error tolerant. Google apps script used to have a tolerant service called XML service but it was deprecated. However, it still works and you can use that instead as explained here: GAS-XML

Try replace itemscope by itemscope = '': function pegarAsCoisas() { var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText(); html = replace("itemscope", "itemscope = ''"); var elements = XmlService.parse(html); } For more information, look here.

Related

node_xslt documentation or examples not available

How to scrape embedded JSON using PhantomJS

Workaround for xml data islands

JavaScript runtime error: Unable to add dynamic content

PhantomJS create page from string

Categories

Resources