Get only HTML <head> from URL - javascript

My question is similar to this one about Python, but, unlike it, mine is about Javascript.
1. The problem
I have a large list of Web Page URLs (about 10k) in plain text;
For each page#URL (or for majority of) I need to find some metadata and a title;
I want to NOT LOAD full pages, only load everything before </head> closing tag.
2. The questions
Is it possible to open a stream, load some bytes and, upon getting to the </head>, close stream and connection? If so, how?
Py's urllib.request.Request.read() has a "size" argument in number of bytes, but JS's ReadableStreamDefaultReader.read() does not. What should I use in JS then as an alternative?
Will this approach reduce network traffic, bandwidth usage, CPU and memory usage?

Answer for question 2:
Try use node-fetch's fetch(url, {size: 200})
https://github.com/node-fetch/node-fetch#fetchurl-options

I don't know if there is a method in which you can get only the head element from a response, but you can load the entire HTML document and then parse the head from it even though it might not be so efficient compared to other methods. I made a basic app using axios and cheerio to get the head element from an array of urls. I hope this might help someone.
const axios = require("axios")
const cheerio = require("cheerio")
const URLs = ["https://stackoverflow.com/questions/73191546/get-only-html-head-from-url"]
for (let i = 0; i < URLs.length; i++) {
axios.get(URLs[i])
.then(html => {
const document = html.data
// get the start index and the end index of the head
const startHead = document.indexOf("<head>")
const endHead = document.indexOf("</head>") + 7
//get the head as a string
const head = document.slice(startHead, endHead)
// load cheerio
const $ = cheerio.load(head)
// get the title from the head which is loaded into cheerio
console.log($("title").html())
})
.catch(e => console.log(e))
}

Related

document.documentElement.outerHTML.length is not equal with doc type in chrome network tab

I want to get the document size, when the page is ready. (just in time after server request).
document.addEventListener("DOMContentLoaded", function (event) {
var pageSizelenght = document.documentElement.outerHTML.length;
//});
This does not give me the exact result with chrome dev-tools document type file in network section.
F.e the dcoument size is shown as 1.5 MB, but in the code it returns 1.8MB with
document.documentElement.outerHTML.length
If it is not proper way, how can I get the document size listed in network section?
If you can help me, so much appreciated.
As has been said in comments, the outerHTML is a serialization of the DOM representation after parsing of the original markup. This may be completely different than the original markup and will most likely not match in size at all:
const markup = "<!doctype html><div><p><div>foo";
// DOMParser helps parsing markup into a DOM tree, like loading an HTML page does
const doc = (new DOMParser()).parseFromString(markup, "text/html");
const serialized = doc.documentElement.outerHTML;
console.log({ serialized });
console.log({ markup: markup.length, serialized: serialized.length });
To get the size of the original markup you can call the performance.getEntriesByType("navigation") method, which will return an array of PerformanceNavigationTiming objects, where you'll find the one of the initial document (generally at the first position).
These PerformanceNavigationTiming objects have fields that let you know the decoded-size of the resource, its coded-size (when compressed over the wire), and its transferred-size (can vary if cached).
Unfortunately, this API doesn't work well for iframes (moreover the ones that are fetched through POST requests like StackSnippets), so I have to outsource the live demo to this Glitch project.
The main source is:
const entry = performance.getEntriesByType("navigation")
// That will probably be the first here,
// but it might be better to check for the actual 'name' of the entry
.find(({name}) => name === location.href);
const {
decodedBodySize,
encodedBodySize,
transferSize
} = entry;
log({
decodedBodySize: bToKB(decodedBodySize),
encodedBodySize: bToKB(encodedBodySize),
transferSize: bToKB(transferSize)
});
function bToKB(b) { return (Math.round(b / 1024 * 100) / 100) + " KB"; }

Fetch elements from a website which load after 3 seconds

When I fetch from a website using fetch, I need to figure out a way to get an element which loads after 3-4 seconds. How should I attempt to do this? My code currently is
const body = await fetch('websiteurl')
let html = await body.text();
const parser = new DOMParser();
html = parser.parseFromString(html, 'text/html');
html.getElementById('pde-ff'); // undefined
I can assure this element exists and if I go to the website and use the last line and replace html with document, it works but I need to wait for website to load. Any ideas?
pretty much your stuff isnt loading at the correct time
so you need to wait for html to open and theres a lot of ways to do this
use jquery $(function() {alert("It's loaded!");});
use vanilla js window.addEventListener('load', function () {alert("It's loaded!");})

Get the rendered HTML from a fetch in javascript [duplicate]

This question already has answers here:
How can I dump the entire Web DOM in its current state in Chrome?
(4 answers)
Closed 3 years ago.
I’m trying to fetch a table from a site that needs to be rendered. That causes my fetched data to be incomplete. The body is empty as the scripts hasn't been run yet I guess.
Initially I wanted to fetch everything in the browser but I can’t do that since the CORS header isn't set and I don’t have access to the server.
Then I tried a server approach using node.js together with node-fetch and JSDom. I read the documentation and found the option {pretendToBeVisual: true } but that didn't change anything. I have a simple code posted below:
const fetch = require('node-fetch');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
let tableHTML = fetch('https://www.travsport.se/uppfodare/visa/200336/starter')
.then(res => res.text())
.then(body => {
console.log(body)
const dom = new JSDOM(body, {pretendToBeVisual: true })
return dom.window.document.querySelector('.sportinfo_tab table').innerHTML
})
.then(table => console.log(table))
I expect the output to be the html of the table but as of now I only get the metadata and scripts in the response making the code crash when extracting innerHTML.
Why not use google-chrome headless ?
I think the site you quote does not work for --dump-dom, but you can activate --remote-debugging-port=9222 and do whatever you want like said in https://developers.google.com/web/updates/2017/04/headless-chrome
Another useful reference:
How can I dump the entire Web DOM in its current state in Chrome?

Can i scrape this site using just node?

im very new to JavaScript so be patient.
I've been trying to scrape a site and get all the product URLs in a list that i will use later in other function like this:
url='https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx'
var http = require('http-get');
var request = require("request");
var cheerio = require("cheerio");
function getURLS(url) {
request(url, function(err, resp, body){
var linklist = [];
$ = cheerio.load(body);
var links = $('#productResults a');
for(valor in links) {
if(links[valor].attribs && links[valor].attribs.href && linklist.indexOf(links[valor].attribs.href) == -1){
linklist.push(links[valor].attribs.href);
}
}
var extended_links = [];
linklist.forEach(function(link){
extended_link = 'https://www.fromuthtennis.com/frm/' + link;
extended_links.push(extended_link);
})
console.log(extended_links);
})
};
This does work unless you go to the second page of items like this:
url='https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx#Filter=[pagenum=2*ava=1]'
var http = require('http-get');
var request = require("request");
var cheerio = require("cheerio"); //etc...
As far as i know this happens because the content on the page is loaded dynamically.
To get the contents of the page i believe i need to use PhantomJS because that would allow me to get the html code after the page has been fully loaded, so i installed the phantomjs-node module. I want to use NodeJS to get the URL list because the rest of my code is written on it.
I've been reading a lot about PhantomJS but using the phantomjs-node is tricky and i still don't understand how could i get the URL list using it because i'm very new to JavaScript or coding in general.
If someone could guide me a little bit i'd appreciate it a lot.
Yes, you can. That page looks like it implements Google's Ajax Crawling URL.
Basically it allows websites to generate crawler friendly content for Google. Whenever you see a URL like this:
https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx#Filter=[pagenum=2*ava=1]
You need to convert it to this:
https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx?_escaped_fragment_=Filter%3D%5Bpagenum%3D2*ava%3D1%5D
The conversion is simply take the base path: https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx, add a query param _escaped_fragment_ who's value is URL fragment Filter=[pagenum=2*ava=1] encoded into Filter%3D%5Bpagenum%3D2*ava%3D1%5D using standard URI encoding.
You can read the full specification here: https://developers.google.com/webmasters/ajax-crawling/docs/specification
Note: This does not apply to all websites, only websites that implement Google's Ajax Crawling URL. But you're in luck in this case
You can see any product you want without using dynmic content using this url:
https://www.fromuthtennis.com/frm/showproduct.aspx?ProductID={product_id}
For example to see product 37023:
https://www.fromuthtennis.com/frm/showproduct.aspx?ProductID=37023
All you have to do is for(var productid=0;prodcutid<40000;productid++) {request...}.
Another approach is to use phantom module. (https://www.npmjs.com/package/phantom). It will let you run phantom command directly from your NodeJS app

Scraping authenticated website in node.js

I want to scrape my college website (moodle) with node.js but I haven't found a headless browser able to do it. I have done it in python in just 10 lines of code using RoboBrowser:
from robobrowser import RoboBrowser
url = "https://cas.upc.edu/login?service=https%3A%2F%2Fatenea.upc.edu%2Fmoodle%2Flogin%2Findex.php%3FauthCAS%3DCAS"
browser = RoboBrowser()
browser.open(url)
form = browser.get_form()
form['username'] = 'myUserName'
form['password'] = 'myPassword'
browser.submit_form(form)
browser.open("http://atenea.upc.edu/moodle/")
print browser.parsed
The problem is that the website requires authentication. Can you help me? Thanks!
PD: I think this can be useful https://www.npmjs.com/package/form-scraper but I can't get it working.
Assuming you want to read a 3rd party website, and 'scrape' particular pieces of information, you could use a library such as cheerio to achieve this in Node.
Cheerio is a "lean implementation of core jQuery designed specifically for the server". This means that given a String representation of a DOM (or part thereof), cheerio can traverse it in much the same way as jQuery can.
An example from Max Ogden show how you can use the request module to grab HTML from a remote server and then pass it to cheerio:
var $ = require('cheerio')
var request = require('request')
function gotHTML(err, resp, html) {
if (err) return console.error(err)
var parsedHTML = $.load(html)
// get all img tags and loop over them
var imageURLs = []
parsedHTML('a').map(function(i, link) {
var href = $(link).attr('href')
if (!href.match('.png')) return
imageURLs.push(domain + href)
})
}
var domain = 'http://substack.net/images/'
request(domain, gotHTML)
Selenium has support for multiple languages and multiple platforms and multiple browsers.

Categories

Resources