How do I automate opening and downloading a webpage?

How do I automate opening and downloading a webpage? - javascript

There is a website that has an HTML <video> element in it when loaded, however this element isn't present if I just download it with wget, so I guess it gets loaded by a script that's only ran when the page is opened in a browser. I need the video's direct link, in an automated fashion.
Could you please tell me if I have the right idea, and if there is a possible solution? Could I for example run a browser from the command line, let it load the page and all of the referenced content, then save the .html file?

You could use headless Chrome, potentially with Puppeteer scripting for that.
Though, depending on the details, there may be easier options that would get you what you need. It sounds like you're currently trying to scrape a third party website using wget. Instead of, or in addition to, requesting the .html content with wget, you could request the relevant javascript file and then extract the video url from there.

Related

What is the best way to display webpage content for a given URL using javascript?

I am developing a small labeling tool that given a URL should display a document hosted on that URL and allow a user to choose a label for that document.
I want to display the contents of the URL for this purpose. As far as I know, I can either get the URL content, parse the contents, and display or use an iframe option.
Without using parser
Iframes are not enabled for the target URL, the contents of which I want to display. Is there any other way to do this using javascript without using parser?
Using parser
I can crawl the contents of the URL, get everything between and dump it in the webpage area.
I'm new to javascript and front end development so I am not sure whether these are the only options.
Are there other options to do this?
If the parser is the only option, Can I dump the HTML that I get from the remote URL? I understand that images and other media that may be within on remote url won't be displayed. Is there any other caveat to this method? More importantly, Is this the best way to do this?

Most sites do it via the iframe like you mentioned like codepen.
Also, you can use Puppeteer ( a headless browser ) to do these sort of things. Get the contents using web scraping or take a screenshot or print a pdf. Pretty nifty library.
Most things that you can do manually in the browser can be done using
Puppeteer! Here are a few examples to get you started:
Generate screenshots and PDFs of pages. Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR"
(Server-Side Rendering)).
Automate form submission, UI testing, keyboard input, etc. Create an up-to-date, automated testing environment.
Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
Hope this helps !

How do I get the data of a website as shown in INSPECT ELEMENT and not in VIEW PAGE SOURCE?

I want to get the INSPECT ELEMENT data of a website. Let's say Truecaller. So that i can get the Name of the person who's mobile number I searched.
But whenever i make a python script it gives me the PAGE SOURCE that does not contain the required information.
Kindly help me. I am a beginner so kindly excuse me of any mistake in the question.

TL;DR: Use Selenium (and PhantomJS)
The view page source will give you the html that was loaded when you made a request for the page (which is most likely what you are getting when you make a request from python.
Since nowadays a lot of pages load things and modify the DOM after the initial html was loaded, you will not get most of the information you want just by looking into that initial response.
To get the inspect element information you will need some sort of web browser to actually go to the page, wait for the information you want to load, and then use it. However you still want to do this in your python script.
Enter selenium, which is a tool for browser automation (mostly used for testing webpages). You can create a python script that opens a browser page and executes whatever code you write for it to do (even wait for a while and search for an after load DOM element!). Your script will still open a browser (which is kind of weird I would guess).
Enter PhantomJS, another library that you can use to have a headless browser to do all your web testing without having to rely on the actual browser UI.
Using selenium only you might achieve your goals, but with phantomjs you can do that in an even cleaner way! Good Luck.

INSPECT ELEMENT and VIEW PAGE SOURCE are not the same.
View source shows you the original HTML source of the page. When you view source from the browser, you get the HTML as it was delivered by the server, not after javascript does its thing.
The inspector shows you the DOM as it was interpreted by the browser. This includes for example changes made by javascript which cannot be seen in the HTML source.

what you see in the element inspector is not the source-code anymore.
You see a javascript manipulated version.
Instead of trying to execute all the scripts on your own which may lead into multiple problems like cross origin security and so on,
search the network tab for the actual search request and its parameters.
Then request the data from there, that is the trick.
Also it seems like you need to be logged in to search on the url you provided so you need to eventually adapt cookie/session/header and stuff, just like a request from your browser would.
So what i want to say is, better analyse where the data you look for is coming from if it is not in the source

Block JS from loading on certain domains

I have a web service that works through giving users javascript to embed in their code. Users can also place that code on other sites to make it work there. However I also need to allow users to create a blacklist of sites that the JS should not function on. For example, a competitor or an inappropriate site.
Is there a way to check where our JS files are being loaded from, and block loading or break functionality on a per account basis?
Edit: The javascript loads an iframe on the site, so another solution would be to somehow block certain domains from loading an iframe from our server, or serve different content to that iframe
Edit 2: We're also trying to avoid doing this from with the JS because it could be downloaded and modified to get pass the block

Inspecting the url of the page
Yes, the javascript file, when it starts executing, can inspect window.url and see if the url of the main document is ok.
To see where the script was loaded from
It can also go through the dom, looking for the script node which brought in the javascript file itself and see from where the JS was loaded.
However
Anyone can load the javascript into a text editor, then change it to eliminate the tests, then host the modified JS on their own server. Obfuscating or minimizing the JS can slow someone down but obscurity is not security.

One thing you could do is have the javascript load another javascript file. That you serve from the server at a given url. The trick here is that that url will not go to a file but to a server end point that will return a javascript file. The you have that endpoint check for the routes for that user and decide if it will return the javascript you want to work or an error javascript of some kind.
This blog shows how to do it in php.dynamic-javascript-with-php

Open local html file in current window with Javascript Bookmarklet

I'm trying to build a sample bookmarklet to grab current webpage source code and pass it to a validator. Validator is not a an online website, but a folder with bunch of javascript and html files. I'm trying to open file:///C:/Users/Electrifyings/Desktop/Validator/Main.html file with the help of javascript bookmarklet code and put the source code in the textarea in the newly opened window, but it is not working for some reasons that I'm not aware of.
Here is the sample code with algorithm:
javascript:(function(){var t = document.body.innerHTML;window.open('file:///C:/Users/RandomHero/Desktop/test.html',_self);document.getElementById("validator_textarea")=t;})()
Here are the steps:
Grab current web page source code in a variable.
Open locally stored HTML web page in current or new window or new tab (either way is fine with me, but no luck)
Put the source code from the variable into the validator textarea of the newly opened HTML file.
I have tried above code with a lot of variations, but got stuck on the part where it opens the new window. Either it's not opening the new window at all or it is opening blank window without loading the file.
Would love to get some help with this issue, thanks a lot.
Oh and btw,
Windows 7 x64, Tried IE, Firefox and Chrome. All latest and stable builds. I guess it's not a browser side issues, but something related to javascript code not opening the URI with file:/// protocol. Let me know if any more details are needed. :)

You wouldn't want a webpage you visit to be able to open up file://c:/Program Files/Quicken/YourSensitiveTaxInfo right? Because then if you make a mistake and go to a "bad" website (either a sleazy one or a good one that's been compromised by hackers), evil people on the intarweb would suddenly have access to your private info. That would suck.
Browser makers know this, and for that reason they put VERY strict limits to prevent Javascript code from accessing files on a user's local computer. This is what is getting in the way of your plan.
Solutions?
build the whole validator in to the bookmarklet (not likely to work unless it's really small)
put your validator code up on the web somewhere
write a plug-in (because the user has to choose to install a plug-in, they get much more freedom than webpages ... even though for Firefox, Chrome, etc. plug-ins are basically just Javascript)
* * Edit * *
Extra bonus solution, if you don't limit yourself to a purely-client-side implementation:
Have your bookmarklet add a normal (HTML) form to the page.
Also add an iframe to the page (it's ok if you hide it with CSS styling)
Set the form's target attribute to point to the iframe. This will make it so that, when the user submits the form and the server replies back to that submission, the server's reply will go to the (hidden) iframe, instead of replacing the page as it normally would.
Add a file input to your form - you won't be able to access the file within that input using Javascript, but that's ok because your server will be doing the accessing, not your bookmarklet.
Write a server-side script which takes the form submissions, reads the file that came with it, and then parrots that file back as the response. In other words, you'll have a URL that you can POST to, and when it sees a file in the POST's contents, it will respond back with the contents of that file.
Now that you've got all that the user can pick their validator file using the file input, upload it to your server, your server will respond back with the file it just got, and that file will appear as the contents of the iframe.
And now that you finally have the file that you worked so hard to get (inside your iframe) you can do $('#thatIframe').html() and viola, you have access to your file. You can save the current page's source and then replace the whole page with that uploaded file (and then pass the saved page source back to the new validator page), or you can do whatever else you want with the contents of the uploaded validator file.
Of course, if the file doesn't vary from computer to computer, you can make all of that much simpler by just having a server that sends the validator file back; this could be a pure Apache server with no logic whatsoever, as all it would have to do is serve a static file.
Either way though, if you go with this approach and your new file upload script is not on the same server as your starting webpage, you will have a new security problem: cross-domain script limitations. However, these limitations are much less strict than local file access ones, so there are ways to work around them (JSONP, cross-site policy files, etc.). There are already tons of great Stack Overflow posts explaining these techniques, so I won't bother repeating them here.
Hope that helps.

How can I test JavaScript without placing it on the website?

I'm talking about something like GreaseMonkey but that would accept the script just as it would be on the website. Adding external scripts to Greasemonkey has been a pain for me so far.
So, I have a client who wanted me to write a specific script for him. Because the script reads the URL of the page visited by a user I can only test it on the website but i don't have access to the source code of the website. I'd like to make sure I deliver to the client a 100% working script so I would love to test it first.
How can I do that? Any plugins that would just allow me to copy the script and would run it every time I load a page of the website?

Obviously, if you can, you want to set up a copy of the page on which the script operates, on a local web server where you can play around with things.
If that isn't possible for whatever reason, you can inject your script directly into their site when you're looking at it with your browser using a bookmarklet. The code to do it is roughly:
var script = document.createElement('script');
script.src = "...the path to your script file, ideally on a local web server rather than a file:// path...";
document.body.appendChild(script);
Once you've tweaked the above (pretty much just supplying the src value), you can turn it into a bookmark via the Crunchinator. Once you have your bookmarket, just visit the site you're developing this for and click your bookmarklet, and your script will be added to the page (just for you, obviously, and just for that visit to the page).
Then your develop/test cycle becomes:
Modify the script file (for instance, to fix a bug)
Open their site
Click your bookmarklet to add your script file to the page

Using something like GreaseMonkey can lead to unexpected results since GM runs outside of the Browsers Sandbox and GM scripts always run after everything else has loaded.
My solution for this would probably be:
Setup a local WebServer
Use "Save page..." to get the page contents, then put them on your localhost
Now add your script to the page etc. and make it work
That gives you A) A flexible development environment and B) more "realworld" results, hell you can even edit you hosts file to use the same URL that your client's page has (of course you need to re-edit the file if you want to visit the original page) and C) you can test in IE etc. too.

Develop Reference

JavaScript is the programming language of the Web.