I'm using NSXMLParser on an iPhone App to parse HTML Files for a RSS or Atom Feed Link.
Everything works fine until the parser find a <script> element that includes Javascript code without the CDATA Declaration, this causes a Parse Error.
Is possible to tell the parser to skip all the elements named <script>?
Why not just implement parser:parseErrorOccured: and tell it to fail gracefully? I don't believe there's a way to say 'skip this element'
It's not possible to my knowledge to just skip an element. However you may be able to use regex replacement to filter out the invalid content.
Another possibility would maybe to use Tidy to try to clean it up before parsing.
Related
i am sorry to create a topic about this, but this little thing has been boggling my brain for the past 2 hours. Chrome returns the right element by Xpath as well as by javascript script, but Selenium tells me that in the very code, that Chrome runs perfectly fine is an error:
javascript error: missing ) after argument list
This is the code I am currently trying:
driver.execute_script('let clickable = document.evaluate("//a[contains(#onclick,\"openFbLWin\")]", document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue; clickable.click();')
i know its a bit messy, but overall the most important thing is //a[contains(#onclick,\"openFbLWin\")]
as for the XPath selector.
I think this is because when you run this in python, it converts "//a[contains(#onclick,\"openFbLWin\")]" into "//a[contains(#onclick,"openFbLWin")]", without the slashes. Then, when this is run in javascript, it can't parse the string because there is a double-quote inside of another double-quote. To fix this, change your xPath to "//a[contains(#onclick,'openFbLWin')]".
You should use selenium's inbuilt search and click tools, though. It is much more readable and faster (use find_element_by_xpath and click)
I want to integrate LinkedIn sharing on my page.
Reading the documentation that LinkedIn provides here:
https://developer.linkedin.com/docs/getting-started-js-sdk
.. I was surprised to see they require this script tag in the head secion of my page
<script type="text/javascript" src="//platform.linkedin.com/in.js">
api_key: [API_KEY]
onLoad: [ONLOAD]
authorize: [AUTHORIZE]
lang: [LANG_LOCALE]
</script>
I don't get what is happening here. First of all, w3schools says that """Note: If the "src" attribute is present, the element must be empty.""" (https://www.w3schools.com/tags/tag_script.asp).
I also went here: https://html.spec.whatwg.org/multipage/scripting.html#the-script-element
(I'm not 100% sure how authoritative this is...but looks authoritative based on the format and length :P) - it also says there that if there's a src attribute, then the body should basically be empty - in any case - LinkedIn's script syntax is not explainable by these 2 resources.
So does anyone know what's up with the script body syntax? Are those JS labels? And if so, I don't get how they're used. I thought labels are used with continue/break statements, to get out of loops. I don't understand how LinkedIn's API can get information from me if I provide it in that syntax.
Is the script body somehow fed to the script, and it parses it itself?
Can someone please explain to me what's going on?
Thanks!
What you said is correct. When the src attribute is added, the body of the script is not executed. There is a way to get around this however. That's by retrieving the script tag, extracting the innerHTML and using eval on it. You will need to of course do that on document ready.
I don't know how LinkedIn does it exactly, but HTML standards didn't change for them nor the order of loading, so either they use something similar to what I explained or some more clever way of parsing the body of the script.
Other notes to consider: instead of using document ready event, you could bind that into your library. As in retrieve the last available script tag, and extract the body, at the time of load of your library, that will be the last element loaded either way, so you should be able to retrieve the code without using any events. (That would need testing, but DOM elements are loaded synchronously, top-down approach).
Obviously using eval is not recommended, its quite slow, but definitely provides the functionality you're looking for.
PS. Forgive any formatting errors. I'm typing this from my mobile, 2k miles away from home. Otherwise I'd be more than happy to even provide some sample code snippets and do the above testing myself.
I'm trying to make simple templating for users on a site. I have a test line like this :
<div id="test">Test</div>
It will alert the HTML properly with the following JS in all browsers except FF:
alert( document.getElementById( 'test' ).innerHTML );
In FF it will change the curly braces to their HTML encoded version. I don't want to just URL decode in case the user enters HTML with an actual URL instead of one of the templated ones. Any ideas to solve this outside of REGEXing the return value?
My fiddle is here: http://jsfiddle.net/davestein/ppWkT/
EDIT
Since it's seemingly impossible to avoid the difference in FF, and we're still early in development, we are just going to switch to using [] instead of {}. Marking #Quentin as the correct answer since it's what I'm going by,
When you get the innerHTML of something, you get a serialised representation of the DOM which will include any error recovery or replacing constructs with equivalents that the browser does.
There is no way to get the original source from the DOM.
If your code won't contain %xx elsewhere, you can just run it through unescape().
I wish to develop some kind of external API which will include users putting some nonstandard tags on their pages (which I will then replace with the correct HTML). For example:
<body>
...
...
<LMS:comments></LMS:comments>
...
...
...
</body>
Hoe can I target and replace the <LMS:comments></LMS:comments> part?
Just use getElementsByTagName as usual to get the element.
You cannot change the tag name, you will have to replace the entire element.
See http://jsfiddle.net/2vcjm/
You want to use regular expressions.
Take a look at this page to get started:
http://www.regular-expressions.info/brackets.html
That whole website is a great reference.
If your document is valid XHTML (as opposed to just HTML), you can use XSLT to parse it.
There are JavaScript XSLT libraries, such as Google's AJAXSLT.
Barring that, you will need to extract the relevant part of the DOM, take the value of "innerHTML" for the contents, and replace the custom tags using JavaScript's regex and replace() function.
However, this sort of processing is usually done server-side, by passing your custom "HTML+" through some sort of templating/enrichment engine (which will also use XSLT or HTML parsers or worst case regexes).
I want to write a web application that allows users to enter any HTML that can occur inside a <div> element. This HTML will then end up being displayed to other users, so I want to make sure that the site doesn't open people up to XSS attacks.
Is there a nice library in Python that will clean out all the event handler attributes, <script> elements and other Javascript cruft from HTML or a DOM tree?
I am intending to use Beautiful Soup to regularize the HTML to make sure it doesn't contain unclosed tags and such. But, as far as I can tell, it has no pre-packaged way to strip all Javascript.
If there is a nice library in some other language, that might also work, but I would really prefer Python.
I've done a bunch of Google searching and hunted around on pypi, but haven't been able to find anything obvious.
Related
Sanitising user input using Python
As Klaus mentions, the clear consensus in the community is to use BeautifulSoup for these tasks:
soup = BeautifulSoup.BeautifulSoup(html)
for script_elt in soup.findAll('script'):
script_elt.extract()
html = str(soup)
Whitelist approach to allowed tags, attributes and their values is the only reliable way. Take a look at Recipe 496942: Cross-site scripting (XSS) defense
What is wrong with existing markup languages such as used on this very site?
You could use BeautifulSoup. It allows you to traverse the markup structure fairly easily, even if it's not well-formed. I don't know that there's something made to order that works only on script tags.
I would honestly look at using something like bbcode or some other alternative markup with it.
Eric,
Have you thought about using a 'SAX' type parser for the HTML? I'm really not sure
though that it would ignore the events properly though. It would also be a bit harder to construct than using something like Beautiful Soup. Handling syntax errors may be a problem with SAX as well.
What I like to do in situations like this is to construct python objects (subclassed from an XML_Element class) from the parsed HTML. Then remove any undesired objects from the tree, and finally re-serialize the objects back to html. It's not all that hard in python.
Regards,