This question already has answers here:
How can I dump the entire Web DOM in its current state in Chrome?
(4 answers)
Closed 3 years ago.
I’m trying to fetch a table from a site that needs to be rendered. That causes my fetched data to be incomplete. The body is empty as the scripts hasn't been run yet I guess.
Initially I wanted to fetch everything in the browser but I can’t do that since the CORS header isn't set and I don’t have access to the server.
Then I tried a server approach using node.js together with node-fetch and JSDom. I read the documentation and found the option {pretendToBeVisual: true } but that didn't change anything. I have a simple code posted below:
const fetch = require('node-fetch');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
let tableHTML = fetch('https://www.travsport.se/uppfodare/visa/200336/starter')
.then(res => res.text())
.then(body => {
console.log(body)
const dom = new JSDOM(body, {pretendToBeVisual: true })
return dom.window.document.querySelector('.sportinfo_tab table').innerHTML
})
.then(table => console.log(table))
I expect the output to be the html of the table but as of now I only get the metadata and scripts in the response making the code crash when extracting innerHTML.
Why not use google-chrome headless ?
I think the site you quote does not work for --dump-dom, but you can activate --remote-debugging-port=9222 and do whatever you want like said in https://developers.google.com/web/updates/2017/04/headless-chrome
Another useful reference:
How can I dump the entire Web DOM in its current state in Chrome?
Related
My question is similar to this one about Python, but, unlike it, mine is about Javascript.
1. The problem
I have a large list of Web Page URLs (about 10k) in plain text;
For each page#URL (or for majority of) I need to find some metadata and a title;
I want to NOT LOAD full pages, only load everything before </head> closing tag.
2. The questions
Is it possible to open a stream, load some bytes and, upon getting to the </head>, close stream and connection? If so, how?
Py's urllib.request.Request.read() has a "size" argument in number of bytes, but JS's ReadableStreamDefaultReader.read() does not. What should I use in JS then as an alternative?
Will this approach reduce network traffic, bandwidth usage, CPU and memory usage?
Answer for question 2:
Try use node-fetch's fetch(url, {size: 200})
https://github.com/node-fetch/node-fetch#fetchurl-options
I don't know if there is a method in which you can get only the head element from a response, but you can load the entire HTML document and then parse the head from it even though it might not be so efficient compared to other methods. I made a basic app using axios and cheerio to get the head element from an array of urls. I hope this might help someone.
const axios = require("axios")
const cheerio = require("cheerio")
const URLs = ["https://stackoverflow.com/questions/73191546/get-only-html-head-from-url"]
for (let i = 0; i < URLs.length; i++) {
axios.get(URLs[i])
.then(html => {
const document = html.data
// get the start index and the end index of the head
const startHead = document.indexOf("<head>")
const endHead = document.indexOf("</head>") + 7
//get the head as a string
const head = document.slice(startHead, endHead)
// load cheerio
const $ = cheerio.load(head)
// get the title from the head which is loaded into cheerio
console.log($("title").html())
})
.catch(e => console.log(e))
}
This question already has answers here:
DOM manipulation inside web worker
(3 answers)
Is there a way to create out of DOM elements in Web Worker?
(10 answers)
Parsing XML in a Web Worker
(2 answers)
Closed last year.
I have a service worker, if the network is down, it serves up a cached response.
The HTML of the body is in response.body. I wish to alter it to add a banner to alert the user that this page is in fact not live data, but is a cached page.
Is there a way to alter the page with DOM manipulation, e.g.
const reponse = await cache.match(request);
const body = await response.text();
document = buildDocument(body); // <- !!! Imaginary desired function
const banner = document.getElementById('banner');
banner.innerHTML = "This is a cached page!";
reponse.body = document.innerHTML;
return response
At the moment I guess one might have to settle on regex/replace of text:
adjusted = body.replace('<span id="banner"></span>', '<span id="banner">This is a cached page!</span>')
I need to find the sizes/metadata of externally hosted images in a document (e.g., markdown documents that have image tags in it), but need to do it without actually downloading the image.
Is there any way to do this easily on NodeJS/ExpressJs using javascript? Some of the solutions are many years old and not sure if there are better methods now.
You can do what was suggested in comments by only grabbing the HEAD instead of using a GET when you call the image.
Using got or whatever you like (http, axios, etc) you set the method to HEAD and look for content-length.
My example program that grabs a twitter favicon, headers only, looks like this:
const got = require('got');
(async () => {
try {
const response = await got('https://abs.twimg.com/favicons/twitter.ico', { method: 'HEAD' });
console.log(response.headers);
} catch (error) {
console.log('something is broken. that would be a new and different question.');
}
})();
and in the response I see the line I need:
'content-length': '912'
If the server doesn't respect HEAD or doesn't return a content-length header, you are probably out of luck.
This question already has answers here:
Load and execute external js file in node.js with access to local variables?
(6 answers)
Closed 4 years ago.
I'm writing integration tests for purescript FFI bindings with google's API map.
The problem Google's code is meant to be loaded externally with a <script> tag in browser not downloaded and run in a node process. What I've got now will download the relevant file as gmaps.js but I don't know what to do to actually run the file.
exports.setupApiMap = function() {
require('dotenv').config();
const apiKey = process.env.MAPS_API_KEY;
const gmaps = "https://maps.googleapis.com/maps/api/js?key=" + apiKey;
require('download')(gmaps, "gmaps.js");
// what now???
return;
};
For my unit tests, I must later be able to run new google.maps.Marker(...). Then I can check that my setTitle, getTitle etc. bindings are working correctly.
This is a duplicate question of this one. The correct code was.
exports.setupApiMap = async function() {
require('dotenv').config();
const apiKey = process.env.MAPS_API_KEY;
const gmaps = "https://maps.googleapis.com/maps/api/js?key=" + apiKey;
await require('download')(gmaps, __dirname);
const google = require('./js');
return;
};
The key was to download to __dirname before using require. That said my specific use cases didn't work since google's API map code just can't be run in a node process. It must be run in a browser.
I am trying to build a module that does some basic scraping on an official NBA box score page (e.g. https://stats.nba.com/game/0021800083) using request-promise and cheerio. I wrote the following test code:
const rp = require("request-promise");
const co = require("cheerio");
// the object to be exported
var stats = {};
const test = (gameId) => {
rp(`http://stats.nba.com/game/${gameId}`)
.then(response => {
const $ = co.load(response);
$('td.player').each((i, element) => {
console.log(element);
});
});
};
// TESTING
test("0021800083");
module.exports = stats;
When I inspect the test webpage, there are multiple instances of td tags with class="player", but for some reason selecting them with cheerio doesn't work.
But cheerio does successfully select some elements, including a, script and div tags.
Help would be appreciated!
Using a scraper like request-promise will not work for a site built with AngularJS. Your response does not consist of the rendered HTML, as you are probably expecting. You can confirm by console logging the response. In order to properly scrape this site you could use PhantomJS, Selenium Webdriver, and the like.
An easier approach is to identify the AJAX call that is providing the data your are after and scrape that instead. For this, go to the site and in the developer tools, open the Network tab. Look at the list of requests and identify which one has the data you are after.
Assuming you are after the player stats in the tables, the one I believe you are looking for is "0021800083_gamedetail.json"
Further reading:
Can you scrape a Angular JS website
Scraping Data from AngularJS loaded page
Best of luck!