Can I wget the result from the javascript generated webpage - javascript

the url link : https://live.eservice-hk.net/viutv
will return the text results (one text line) and show it on the browser.
I wanna to get those results via wget but I can't.
Then I watch the website page source and discovered that the page was generated by javascript.
How do I get the results instead of the javascript?

No! You cannot wget (or even curl) the dynamically generated javascript result from the page. You need a webdriver like Selenium for that or maybe use Chrome in Headless Mode.
But for that particular page (and more specifically for that particular text result), you can use curl to get the text-link:
curl -X POST -d '{"channelno":"099","deviceId":"0000anonymous_user","format":"HLS"}' https://api.viu.now.com/p8/1/getLiveURL | jq '.asset.hls.adaptive[0]'
Note: The POST data and link is taken from the page's source. jq is a nice, little command line utility to handle JSON data on command line.

Related

How to run a file on server on html-button click

how do I manage to execute a Python-file or a bash-script on a server via a press of a html-button without using a framework?
I have tried several versions of Ajax-Calls suggested in answers to similar questions, but none of them seem to work for me.
Note: Im using an Apache-Server on a RaspberryPi
According to apache documentation, you should configure apache to run CGI script, define an url prefix that maps a directory that contains your script and finally your script must return output in a particular way, or Apache will return an error message.
So, for example, if you define an url prefix like:
ScriptAlias "/cgi-bin/" "/path/to/your/script/cgi-bin/"
you can create a simple anchor to execute your script, you don't need to use ajax for this.
click me
To successful run your script you should follow apache directive in "Apache Tutorial: Dynamic Content with CGI"

How the get the whole source html code of a webpage that is generated by Javascript using Java / Webdriver?

I am a newbie in programming and I have a task here I need to solve. I am trying to get the html source code of a webpage using Java / Webdriver method getPageSource(). Problem is, that page is somehow generated, probably by javascript, so the result I get is html code containing just page skeleton - a table that is empty, not filled by data. But, there is tag like <script type="text/javascript" src="/x/js/main.c0e805a3.js"></script> in the very bottom of that html code.
The question is, how can I force Webdriver to run that Javascript and give me the result - the whole source html with data. I already tried to use this (js.executeScript("window.location = '/x/js/main.c0e805a3.js'");) before calling getPageSource() but not successful.
Any help will be appreciated, thanks!
There are quite a few setups, now, that can run the Java-Script on a web-page. The most well known, I think, is likely Selenium since I think it has been around for a while. Others include karate, Puppeteer, and even an old tool called Rhino. Puppeteer is a Google, Inc. project that uses Java-Script (server-side Java-Script, called Node.js. They don't like us comparing, contrasting libraries here.
I haven't had the time to engage Selenium, yet, but I write HTML parser, search and update code all the time. If your only goal is to load a page whose contents are dynamically "filled in by AJAX calls" - and what I mean by that, you only want the contents of an HTML that would normally see when you visit the sites web-page, and you are not concerned with button presses then the one I have been using for that is called Splash This tool does have the ability to let you invoke Java-Script, but if all you want to do is see the JS on a page dynamically load the table, then, literally, all you have to do is start-the tool, and add one line to your program.
On Google Cloud Platform, these 2 lines will start a Splash Proxy Server. If you are writing your code on AWS (Amazon) or Azure (Microsoft), it would likely be similar. If you are running your code in an office on the local machine, you would have to research how to start it.
Install Docker. Make sure Docker version >= 17 is installed.
Pull the image:
$ sudo docker pull scrapinghub/splash
Start the container:
$ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
Then, in your code, all you have to do is the following:
// If your original code looked like this:
URL url = new URL("https://en.wikipedia.org/wiki/Christopher_Columbus");
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", USER_AGENT);
return new BufferedReader(new InputStreamReader(con.getInputStream()));
Change the first line of code in this example to this, and (theoretically), and dynamically loaded HTML tables that are completed with the onload page events will be automatically loaded before returning the HTML page.
// Add this line to your methods
String splashProxy = "http://localhost:8050/render.html?url=";
URL url = new URL(splashProxy + "https://en.wikipedia.org/wiki/Christopher_Columbus");
For most web-sites, any initial tables that are filled by JS/jQuery/AJAX will be filled in. If you are willing to learn teh Lua Programming Language, you can also start invoking the methods there. It has been pretty convenient for my purposes, since I am not writing web-page testing code (code that simulates user button presses). If that is what you are doing, Selenium is likely worth spending time learning / studying the A.P.I.

Query selector all in rvest package

I try to execute this javascript command:
document.querySelectorAll('div.Dashboard-section div.pure-u-1-1 span.ng-scope')[0].innerText
in r using the rvest package using the following code:
library(rvest)
url <- read_html("")
url %>%
html_nodes("div.Dashboard-section div.pure-u-1-1 span.ng-scope") %>%
html_text()
but I take as result this:
character(0)
and I was expect this:
"Displaying results 1-25 of 10,897"
what can I do?
In a nutshell, the rvest package can fetch HTML, but it cannot execute Javascript. The page you tried to fetch loads data via AJAX, javascript.
For a workaround you could use RSelenium package, as user neoFox suggested. Selenium Webdriver would start Firefox or Chrome for you, navigate to the page, wait until it is loaded. and get the data-fragment from the HTML DOM.
Or use the much smaller phantomjs headless browser which would download the HTML page to an html file, without popping up a browser GUI. Read in and parse the downloaded HTML file with R.
Both need some serious configuration. Selenium is java based.
Phantomjs requires to read at least its documentation.
You could also inspect the page, find out the POST-request the site is making, and send this POST yourself. Then fetch the JSON it is returning and count the result items yourself.

What does "curl" mean?

I'm developing Facebook JavaScript apps on a daily basis, but keep stumbling into some code snippets I don't understand, on the Facebook Documentation, and other websites I visit.
I searched Google for CURL, and found some descriptions about it. I can't figure out how Facebook wants me to use it.
curl -F "title=Example Title" -F "description=Description" \
-F "start_time=1329417443" \
"https://graph.facebook.com/PAGE_ID/milestones?access_token=_"
It's nonsens for me. Can you help me understand in what context I can use it for Facebook , and maybe in general, and guide me in the right direction where to find more on the subject?
curl is a command line utility that lets you send an HTTP request. It can be very useful for developing with web service APIs. I believe it comes pre-installed with most linux distros but you would need to download and install it for Windows. (It probably comes with Cygwin but can be installed on its own as well.)
I would suggest making sure it's directory is added to your PATH environmental variables. Again, probably not a problem in linux but you will need to do this manually in windows.
curl is a command to fetch requests. The -F (--form) argument is used to specify form POST parameters.
Citation from man curl:
-F/--form <name=content>
(HTTP) This lets curl emulate a filled-in form in which a user
has pressed the submit button. This causes curl to POST data
using the Content-Type multipart/form-data according to RFC
2388. This enables uploading of binary files etc. To force the
'content' part to be a file, prefix the file name with an #
sign. To just get the content part from a file, prefix the file
name with the symbol <. The difference between # and < is then
that # makes a file get attached in the post as a file upload,
while the < makes a text field and just get the contents for
that text field from a file.
curl is a way of fetching items. The -F is one of many parameters...
http://curl.haxx.se/docs/manpage.html
Also:
Have you seen http://developers.facebook.com/docs/reference/api/batch/
and it could be useful for something like:
http://chaolam.wordpress.com/2010/06/07/implementing-facebook-real-time-updates-api-with-curl-examples/
of course FB docs use curl to show a common basic way to perform the request ... it depends on what platform language libraries are you using the actual way to perform the graph http request
...so that if you are Facebook JavaScript developer you have to use XMLHttpRequest (or i suppose facebook js lib calls)

input URL, output contents of "view page source", i.e. after javascript / etc, library or command-line

I need a scalable, automated, method of dumping the contents of "view page source", after manipulation, to a file. This non-interactive method would be (more or less) identical to an army of humans navigating my list of URLs and dumping "view page source" to a file. Programs such as wget or curl will non-interactively retrieve a set of URLs, but do not execute javascript or any of that 'fancy stuff'.
My ideal solution looks like any of the following (fantasy solutions):
cat urls.txt | google-chrome --quiet --no-gui \
--output-sources-directory=~/urls-source
(fantasy command line, no idea if flags like these exist)
or
cat urls.txt | python -c "import some-library; \
... use some-library to process urls.txt ; output sources to ~/urls-source"
As a secondary concern, I also need:
dump all included javascript source to file (a la firebug)
dump pdf/image of page to file (print to file)
HTML Unit does execute javascript. Not sure if you can obtain the HTML code after DOM manipulation, but give it a try.
You could write a little Java program that fits your requirements, and execute it through command line like in your examples.
I haven't tried the below code, just had a look at the JavaDoc :
public static void main(String[] args) {
String pageURL = args[1];
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(pageURL);
String pageContents = page.asText();
// Save the resulting page to a file
}
EDIT :
Selenium (another web testing framework) can take page screenshots it seems.
Search for selenium.captureScreenshot.
You can use IRobotSoft web scraper to automate this. The source code is in UpdatedPage variable. You only need to save the variable to a file.
It has a function CapturePage() to capture the web page to an image file too.

Categories

Resources