Node - How to scrape dynamic websites?

Node - How to scrape dynamic websites? - javascript

I am trying to scrape the content of a website that must be dynamic because I do not get the content using a regular fetching. The suggested libraries to do that (i could find) was phantom that is not maintained anymore and playwright. With playwright I do get the content but playwright opens the browser
let browser;
browser = await playwright.chromium.launch({headless: false});
const page = await browser.newPage();
await page.goto("https://www.example.com");
the problem with this, I would have a crom job or something like that to periodically fetch content, so the browser couldnt be opened so no content would be fetched again.
Is there any other way I could do it? Thanks!

Related

How to integrate require.js for the backend of an html page? [duplicate]

I'm trying to do some web scraping with Puppeteer and I need to retrieve the value into a Website I'm building.
I have tried to load the Puppeteer file in the html file as if it was a JavaScript file but I keep getting an error. However, if I run it in a cmd window it works well.
Scraper.js:
getPrice();
function getPrice() {
const puppeteer = require('puppeteer');
void (async () => {
try {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('http://example.com')
await page.setViewport({ width: 1920, height: 938 })
await page.waitForSelector('.m-hotel-info > .l-container > .l-header-section > .l-m-col-2 > .m-button')
await page.click('.m-hotel-info > .l-container > .l-header-section > .l-m-col-2 > .m-button')
await page.waitForSelector('.modal-content')
await page.click('.tile-hsearch-hws > .m-search-tabs > #edit-search-panel > .l-em-reset > .m-field-wrap > .l-xs-col-4 > .analytics-click')
await page.waitForNavigation();
await page.waitForSelector('.tile-search-filter > .l-display-none')
const innerText = await page.evaluate(() => document.querySelector('.tile-search-filter > .l-display-none').innerText);
console.log(innerText)
} catch (error) {
console.log(error)
}
})()
}
index.html:
<html>
<head></head>
<body>
<script src="../js/scraper.js" type="text/javascript"></script>
</body>
</html>
The expected result should be this one in the console of Chrome:
But I'm getting this error instead:
What am I doing wrong?

EDIT: Since puppeteer removed support for puppeteer-web, I moved it out of the repo and tried to patch it a bit.
It does work with browser. The package is called puppeteer-web, specifically made for such cases.
But the main point is, there must be some instance of chrome running on some server. Only then you can connect to it.
You can use it later on in your web page to drive another browser instance through its WS Endpoint:
<script src="https://unpkg.com/puppeteer-web">
</script>
<script>
const browser = await puppeteer.connect({
browserWSEndpoint: `ws://0.0.0.0:8080`, // <-- connect to a server running somewhere
ignoreHTTPSErrors: true
});
const pagesCount = (await browser.pages()).length;
const browserWSEndpoint = await browser.wsEndpoint();
console.log({ browserWSEndpoint, pagesCount });
</script>
I had some fun with puppeteer and webpack,
playground-react-puppeteer
playground-electron-react-puppeteer-example
See these answers for full understanding of creating the server and more,
Official link to puppeteer-web
Puppeteer with docker
Puppeteer with chrome extension
Puppeteer with local wsEndpoint

Instead, use Puppeteer in the backend and make an API to interface your frontend with it if your main goal is to web scrape and get the data in the frontend.

Puppeteer runs on the server in Node.js. For the common case, rather than using puppeteer-web to allow the client to write Puppeteer code to control the browser, it's better to create an HTTP or websocket API that lets clients indirectly trigger Puppeteer code.
Reasons to prefer a REST API over puppeteer-connect:
better support for arbitrary client codebases--clients that aren't written in JS (desktop, command line and mobile apps, for example) can use the API just as easily as the browser can
no dependency on puppeteer-connect
lower client-side complexity; for many use cases JS won't be required at all if HTML forms suffice
better control of client behavior--running a browser on the server is a heavy load and has powerful capabilities that are easy to exploit
easier to integrate with other backend code and resources like the file system
provides seamless integration with an existing API as just another set of routes
hiding Puppeteer as an implementation detail lets you switch to, say, Playwright in the future without the client code being affected.
Similarly, rather than exposing a mock fs object to read and write files on the server, we expose REST API endpoints to accomplish these tasks. This is a useful layer of abstraction.
Since there are many use cases for Puppeteer in the context of an API (usually Express), it's hard to offer a general example, but here are a few case studies you can use as starting points:
Puppeteer unable to run on Heroku
Puppeteer doesn't close browser
Parallelism of Puppeteer with Express Router Node JS. How to pass page between routes while maintaining concurrency

connect puppeteer to already open browser page

I'm creating an chrome extension that helps people to login in one webesite, and I need to puppeteer connect in this url that the user is, can I connect in one already open website page to manipulate it?
I've tried to
const browserURL = "http://127.0.0.1:21222";
const browser = await puppeteer.connect({ browserURL });
and I tried start the chrome with:
chrome.exe --remote-debugging-port=21222
I need to connect one specific url for example fecebook.com, I tied:
example:
const browserURL = "http://facebook.com:21222";
and without the ":21222"...
I'm using window 10
node v16.16.0
thanks for helping!

how to open multiple puppeteer instances without effecting it's speed

so what happening to me is when i open 1 puppeteer instance it would go fast a but the more i open the more time it need to load the URL + fill information is that a normal thing ?
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: 'example.png' });
await browser.close();
})();

Answer
Performance of
multiple pupeteer instances
and
running on the same machine
and
testing a single application
is highly dependent on the performance of your machine (4 cored , 8 threads , corei7 7700hq)
On my local setup I could not run more than 10 parallel instances and the performance drop was noticeable the more instances I've launched.
My Story
I have faced similar challenge, when I was trying to simulate multiple users using the same application in parallel.
I know: pupeteer (and/or) similar ui-test-automation tools are not good tools for stresstesting your application; or that: there are better tools for that.
Nevertheless, my case was:
Run "user-like" behavior
From the other end of the world
Collect HAR files - that represent network timings of the browser interacting with 10-20 different systems
Analyze the behavior
My approach was - maybe this helps you:
Create a puppeteer test
Enable headless running
Make it triggerable via curl
Dockerize it
Run the docker image on 10 different machines (5-10 dockerized pupeteer tests/machine)
Trigger them all at once via curl

JavaScript bot that click on pages

I'm new to JavaScript, I'm trying to make a bot that interacts with a web page and automates some steps such as clicking some buttons. I wanted to use for example document.getElementById('button').Click but the command should refer to a specific url of a page. If you can, do some code examples.

To run this command you can create some code using selenium or puppeteers.
This enables a webpage to run on backend end, then to do some stuff like cliking button if you want. It's pretty hard to define how you want to do that, but this is en example of puppeteers in javascript (Node JS)
const puppeteer = require("puppeteer")
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://myurl');
await page.evaluate(() => {
//do something on the page
});
})();
don't forget to install pupeteers with npm install and to add some sleep functions if the page takes too long to render (the page can be available but the js of the page didn't finish his work so you need to wait)

puppeteer's setContent function not making network requests for static files

I am currently trying to figure out how to render my React app within the puppeteer browser so that I could take a screenshot and perform visual regression testing in the future. Most of the ways I've seen this achieved was by starting up an express server and serving up the html page when a certain endpoint is requested. However, I'd like to avoid starting up a server if possible and would like to perhaps use something like ReactDOM.render within the puppeteer browser to render it like how React would.
So far, I've tried creating a build of the React app and then using puppeteer's page.setContent api to render out the build's index.html to the puppeteer browser directly. However, when I do this, the browser doesn't render anything on screen and does not make network requests for the static files.
This is what it looks like in puppeteer:
puppeteer browser
This is the method with page.setContent:
const buildHTML = await fs.readFileSync('build/index.html', 'utf8');
// create a new browser tab
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
// set page's html content with build's index.html
await page.setContent(buildHTML);
// take a screenshot
const contentLoadingPageScreenshot = await page.screenshot({fullPage: true});
Also, if anyone knows of a better way to run an image snapshot test without having to start up a server, please let me know. Thank you.

With ref: puppeteer doc, you can put options for setContent to wait for the network request
await page.setContent(reuslt, {
waitUntil: ["load","networkidle0"]
});

Develop Reference

JavaScript is the programming language of the Web.

Node - How to scrape dynamic websites? - javascript

Related

How to integrate require.js for the backend of an html page? [duplicate]

connect puppeteer to already open browser page

how to open multiple puppeteer instances without effecting it's speed

JavaScript bot that click on pages

puppeteer's setContent function not making network requests for static files

Categories

Resources