I'm using Mechanize, although I'm open to Nokogiri if Mechanize can't do it.
I'd like to scrape the page after all the scripts have loaded as opposed to beforehand.
How might I do this?
I think a good option is something like this with Nokogiri, Watir, and PhantomJs:
b = Watir::Browser.new(:phantomjs)
b.goto URL
doc = Nokogiri::HTML(b.html)
The resulting doc will be from when after the scripts have been loaded. And phantomjs is nice because there is no need to load a browser.
Nokogiri and Mechanize are not full web browsers and do not run JavaScript in a browser-model DOM. You want to use something like Watir or Selenium which allow you to use Ruby to control an actual web browser.
In addition to watir-webdriver and capybara-webkit, celerity is a good option although it is jruby only.
I don't know anything about mechanize or nokogiri so I can't comment specifically on them. However, the issue of getting JavaScript after it's been modified is one I believe can only be solved with more JavaScript. In order to get the newly generated HTML you would need to get the .innerHTML of the document element. This can be tricky since you would have to inject js into a page.
The only way I know of to accomplish this is to write a FireFox plugin. With a plugin you can run JavaScript on a page even though it's not your page. Sorry I'm not more help, I hope that this helps to put you on the right path.
If you're interested in plug-ins this is one place to start:http://anthonystechblog.wordpress.com/category/internet/firefox/
Related
I need to make changes to an existing project that uses iFrames to dynamically load external html files. However, the html files are part of the same project, not external sites. If I'm not mistaken, iFrames are considered a terrible way of loading html content unless they are used to actually display external sites.
I have looked into web components but apparently, browser support is still spotty and unfortunately, I need to support IE9.
I know that the JQuery load() method can accomplish this but in my online research, that doesn't often come up as a proper way of loading external html in general and a proper replacement for iFrames in particular.
Is there a reason why JQuery shouldn't be used here and are there better and established ways of doing this? For example, I once saw a framework that dynamically built the interface out of separate "partials" but I don't remember which framework that was.
It depends on the HTML -
If it's built like a full page - then iFrames are actually a decent solution - Also, iframes with the same origin let you have full control over the content from the parent, while still protecting CSS and JS variables which is pretty convenient.
If not - jQuery.load() will do the trick, you can also do it manually ofc, but if you already have jQuery in your project, just use it.
The load() function is almost always the best way to go, if you are encountering a specific issue using that function maybe you can share it?
The story is - I'd like to make some changes on my Blogger template but I don't want to do it in a way it could make my readers see what I'm doing.
So, I was thinking about using Greasemonkey or some kind of Firefox add-on to replace all the code in template just for current session and my browser since I'm the one using the script/add-on.
Is there a way to remove all the HTML visible in Bloggers's Template Edit HTML with one I provide in script?
Thanks!
It is always better to create a test blog for experimenting with designs. This is what almost all the webmasters do. You may also use some offline editors such as Adobe Dreamweaver.
I want to develop an extension which works on scripts coming from HTTP response. I know that whole HTML code first goes to rendering engine inside browser where it is parsed to create a DOM tree. Any script embedded inside is passed to the JavaScript Engine.(Correct me if I am wrong. :) )
So I wanted to intercept the JavaScript code before it is sent to the JavaScript Engine in order to modify them accordingly.
Are there any APIs for Mozilla Firefox which would allow me to do this? How can I do it?
while doing some stuff i stumbled across this:
https://developer.mozilla.org/en-US/docs/XPCOM_Interface_Reference/NsITraceableChannel?redirectlocale=en-US&redirectslug=NsITraceableChannel
this allows you to modify stuff before it is parsed. see this topic here:
http://forums.mozillazine.org/viewtopic.php?f=19&t=2800541
here is a working example of getting the content before it is shown to user. it doesnt change it though, thats what im asking in the mozillazine topic. the writeBytes should modify it, once you figure it out please share as im interested as well
https://github.com/Noitidart/demo-nsITraceableChannel
You can follow this answer on how to intercept each request and modify before sending it to the page itself. You can do transpilation or whatever you'd like there.
take a look at this guys addons code. he does exactly what you are looking for:
https://addons.mozilla.org/en-US/firefox/addon/javascript-deminifier/
You can try invade before HTML'll be parsed and take all tags, work with them and put it back.
...I wanted to intercept these javascript code before Javascript Engine and modify them accordingly. Is there any APIs for mozilla firefox? How can I do it?
You can use page-mod of the Addon-SDK by setting contentScriptWhen: "start"
Then after completely preventing the document from getting parsed you can fetch the same document on the side, do any modifications and inject the resulting document in the page. Here is an answer which does just that https://stackoverflow.com/a/36097573/6085033
I will try to summarize the best I can what I need and what is blocking me to do it.
What I need
I need to append script tags to the head of an html file, BUT during my "build" process. I'm using ant as a automation build tool, and I would like to avoid placing tokens in my HTML file to then replace it with ant, or also I will like to avoid any midway solution using regular expression matching. Waht I would really like to use is plain javascript running through rhino javascript interpreter and exceute it easily from an ant task, and finally add the script tag dinamically.
What is blocking me?
I really don't know anyway that I can load an html file without issuing a GET or a POST HTTP methods. Cause I'm building my code from source I don't have it under an HTTP server, so I wish I could find someway to load the HTML DOM into a javascript variable and then write it with the new script tag that I need.
I need all the DOM manipulation features without having a browser that renders the HTML file.
Best!
Demian
From what I understand you would like to have a valid DOM object from an HTML file, as if you were running in a browser, but do it "offline"? e.g. be able to do a jQuery selector on the DOM and edit it?
You can always start by looking into an embeded open source browser (http://www.chromium.org/)?
But I would look into node.js, see this question Can I use jQuery with Node.js?
This will allow you to do DOM traversing and modifications without a browser as far as I understand
This question already has answers here:
How do I hide javascript code in a webpage?
(12 answers)
Closed 8 years ago.
How do I hide my javascript/jquery scripts from html page (from view source on right click)? please give suggestion to achive this .
Thanks.
You can't hide the code, JavaScript is interpreted on the browser. The browser must parse and execute the code.
You may want to obfuscate/minify your code.
Recommended resources:
CompressorRater
YUI Compressor
JSMin
Keep in mind, the goal of JavaScript minification reduce the code download size by removing comments and unnecessary whitespaces from your code, obfuscation also makes minification, but identifier names are changed, making your code much more harder to understand, but at the end obfuscation gives you only a false illusion of privacy.
Your best bet is to either immediately delete the script tags after the dom tree is loaded, or dynamically create the script tag in your javascript.
Either way, if someone wants to use the Web developer tool or Firebug they will still see the javascript. If it is in the browser it will be seen.
One advantage of dynamically creating the script tag you will not load the javascript if javascript is turned off.
If I turned off the javascript I could still see all in the html, as you won't have been able to delete the script tags.
Update: If you put in <script src='...' /> then you won't see the javascript but you do see the javascript file url, so it is just a matter of pasting that into the address bar and you d/l the javascript. If you dynamically delete the script tags it will still be in the View Source source, but not in firebug's html source, and if you dynamically create the tag then firebug can see it but not in View Source.
Unfortunately, as I mentioned Firebug can always see the javascript, so it isn't hidden from there.
The only one I haven't tried, so I don't know what would happen is if you d/l the javascript as an ajax call and then 'exec' is used on that, to run it. I don't know if that would show up anywhere.
It's virtually impossible. If someone want's your source, and you include it in a page, they will get it.
You can try trapping right click and all sorts of other hokey ways, but in the end if you are running it, anyone with Firefox and a 100k download (firebug) can look at it.
You can't, sorry. No matter what you do, even if you could keep people from being able to view source, users can alway use curl or any similar tool to access the JavaScript manually.
Try a JavaScript minifier or obfuscator if you want to make it harder for people to read your code. A minifier is a good idea anyhow, since it will make your download smaller and your page load faster. An obfuscator might provide a little bit more obfuscation, but probably isn't worth it in the end.
Firebug can show obfuscation, and curl can get removed dom elements, while checking referrers can be faked.
The morale? Why try to even hide javascript? Include a short copyright notice and author information. If you want to hide it so an, say, authentication system cannot be hacked, consider strengthening the server-side so there are no open holes in server that are closed merely though javascript. Headers, and requests can easily be faked through curl or other tools.
If you really want to hide the javascript... don't use javascript. Use a complied langage of sorts (java applets, flash, activex) etc. (I wouldn't do this though, because it is not a very good option compared to native javascript).
Not possible.
If you just want to hide you business logic from user and not the manipulation of html controls of client side than you can use server side programming with ajax.