Request module wait for document ready - javascript

I'm doing web scraping. Actually I use request module in node, modern sites are using the newer frameworks like Angular, EmberJS and generate html. When I load the page with request the document is not ready, so I get just the javascript code and not the HTML code.
Is possible to generate a timeout and then load the page?

The request module is just an HTTP client, it will only get you the text returned from a particular URL. A straightforward way to achieve what you are trying to do would be to open a URL with a headless browser like PhantomJS (https://github.com/sgentle/phantomjs-node) and actually execute the page before evaluating its content.

Related

ReferenceError: document is not defined , nodejs does not understand DOM

Im trying to interact with a datamuse API using fetch()GET request and display it to the DOM.
But when i run node index.js im getting this error: ReferenceError: document is not defined.
const submitButton = document.querySelector('#submit');
i Googled it and got to know that nodejs does not understand DOM like how the browser does.
I tried fixing it:
with ESLint ,setting: env {browser: true}
installing JSdom package,then getting the error jsdom not defined
Could not figure ,please Help
Node.js is a completely different Javascript environment from the browser. It has a different library of functions available to it than the browser does. For example, the browser has the ability to parse HTML and present the DOM API and it has the window object. Node.js isn't browser at all so it doesn't have those features. Instead, it has TCP and HTTP networking, file system access, etc... the kinds of things you would typically use in a server implementation.
If, from node.js, you are trying to fetch a web page from some other server and then parse that HTML and then understand or manipulate the DOM elements in that web page, you would need a library for doing that. Libraries such as cheerio and puppeteer are popular tools for doing that. Cheerio, parses the HTML, but does not run the Javascript in the page and then offers a jQuery-like API for accessing the DOM. Puppeteer actually runs the chromium browser engine to parse the page and run the Javascript in the page to give you a fully representative DOM and can even do things like take screenshots of an actual rendered page.

How to run a php script with nested javascript from the terminal(Linux)

I have a few scripts which scrape html data from websites.
I use php to fetch the webpage and then js to extract the information I need, eventually using ajax to send data back to php and eventually to the databse. This all works fine, but I have to open a browser and launch the scripts manually due to the javascript.
Is there a way in which I can run some sort of 'cron' to run this script?
Cheers
Yes, you can use node.js to run your script, unless you are using browser-specific features (e.g., reference to document or window).

How to extract html page from a response using mozilla add-on sdk APIs?

I am creating an Add-on for Mozilla Firefox using Mozilla Add-on SDK , for that I need to parse the HTML page that I get as response when I request for a web page. So that after parsing the whole web page , I can to run a segmentation process on it. So that I can redisplay it on the screen by editing them as much as required. So , please give me a solution to , store or parse the HTML page so that I can edit it dynamically and redisplay it. How do I retrieve only the HTML page from the response.
If by "response" you mean the response of XMLHttpRequest, then you get the 'responseText' and use DOMParser to covert it into DOM.
Then you can make the changes and display.
If by "response" you mean when a page is loading, then you can run the code of the addon, before, as soon as, or after DOM is loaded and make the changes to the display as required.
More information is required for a more comprehensive reply.
Update on new info
You can run the script in the addon, based on URL by using PageMod
PageMod
Modifying Web Pages Based on URL
util/match-pattern

how to load javascript with html fragments into an existing web app

I'm developing a small app designed to embed HTML + JavaScript (JavaScript manages the behavior of HTML) into existing websites. My small app is an ASP.Net MVC 3 app. What's the best approach for delivering JavaScript to the web client? I do not have access to the web clients except for giving them the URL to retrieve the HTML/JavaScript. The web clients will be retrieving the HTML/JavaScript using jQuery. jQuery will then load the results of the ASP.Net MVC 3 app into the DOM. Should the JavaScript that's needed to manage the behavior of the embedded HTML simply be a at the end of the HTML fragment? Thanks for your time.
If the loading mechanism is in place, and simply inserts the payload of the HTTP request into the DOM somewhere, then including a <script> as the last tag in the payload is probably the best way to go.
Any DOM elements the script depends on should be ready for use when it is executed, and there isn't anything wrong with that technique that I know of.
You could get more sophisticated, but not without complicating your jQuery loading mechanism.

Is it possible access other webpages from within another page

Basically, what I'm trying to do is simply make a small script that accesses finds the most recent post in a forum and pulls some text or an image out of it. I have this working in python, using the htmllib module and some regex. But, the script still isn't very convenient as is, it would be much nicer if I could somehow put it into an HTML document. It appears that simply embedding Python scripts is not possible, so I'm looking to see if theres a similar feature like python's htmllib that can be used to access some other webpage and extract some information from it.
(Essentially, if I could get this script going in the form of an html document, I could just open one html document, rather than navigate to several different pages to get the information I want to check)
I'm pretty sure that javascript doesn't have the functionality I need, but I was wondering about other languages such as jQuery, or even something like AJAX?
As Greg mentions, an Ajax solution will not work "out of the box" when trying to load from remote servers.
If, however, you are trying to load from the same server, it should be fairly straightforward. I'm presenting this answer to show how this could be done using jQuery in just a few lines of code.
<div id="placeholder">Please wait, loading...</div>
<script type="text/javascript" src="/path/to/jquery.js">
</script>
<script type="text/javascript>
$(document).ready(function() {
$('#placeholder').load('/path/to/my/locally-served/page.html');
});
</script>
If you are trying to load a resource from a different server than the one you're on, one way around the security limitations would be to offer a proxy script, which could fetch the remote content on the server, and make it seem like it's coming from your own domain.
Here are the docs on jQuery's load method : http://docs.jquery.com/Ajax/load
There is one other nice feature to note, which is partial-page-loading. For example, lets say your remote page is a full HTML document, but you only want the content of a single div in that page. You can pass a selector to the load method, as in my example above, and this will further simplify your task. For example,
$('#placeholder').load('/path/to/my/locally-served/page.html #someTargetDiv');
Best of luck!-Mike
There are two general approaches:
Modify your Python code so that it runs as a CGI (or WSGI or whatever) module and generate the page of interest by running some server side code.
Use Javascript with jQuery to load the content of interest by running some client side code.
The difference between these two approaches is where the third party server sees the requests coming from. In the first case, it's from your web server. In the second case, it's from the browser of the user accessing your page.
Some browsers may not handle loading content from third party servers very gracefully (that is, they might pop up warning boxes or something).
You can embed Python. The most straightforward way would be to use the cgi module. If the script will be run often and you're using Apache it would be more efficient to use mod_python or mod_wsgi. You could even use a Python framework like Django and code the entire site in Python.
You could also code this in Javascript, but it would be much trickier. There's a lot of security concerns with cross-site requests (ah, the unsafe internet) and so it tends to be a tricky domain when you try to do it through the browser.

Categories

Resources