Translate all innerText in document html with google translate api

Translate all innerText in document html with google translate api - javascript

i am making a request to this url to translate text from english to spanish
URL: https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=es&dt=t&q=Hello
and efectivelly i´m getting translated text to spanish, so, now i want to get dinamically all innerText in body document and then put again translated text, how can i do this?
In simple words, I want to dynamically translate the website with a button click.
This is my example code to start:
let textToBeTranslate =["hello","thanks","for","help me"]
var url = "https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=es&dt=t&q="+textToBeTranslate;
fetch(url)
.then(data => data.json()).then(data => {
//Text translated to spanish
var textTranslated = data[0][0][0].split(", ");
console.log(textTranslated)
//output: ["hola gracias por ayudarme"]
//Now i want to dinamically put translated text in body tag again
}).catch(error => {
console.error(error)
});

Try this:
const translateElement = async element => {
const
elementNode = element.childNodes[0],
sourceText = elementNode && elementNode.nodeValue;
if (sourceText)
try {
const
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=es&dt=t&q=' + sourceText,
resultJson = await fetch(url),
result = await resultJson.json(),
translatedText = result[0][0][0].split(', ');
elementNode.nodeValue = translatedText;
} catch (error) {
console.error(error);
}
}
}
For a single element - Just call it, like this:
(async () => await translateElement(document.body))();
For all elements in the DOM - You will need to recursively go over all elements starting from the desired parent tag (body, in your case), and call the above function for each element, like this:
(async () => {
const
parent = 'body',
selector = `${parent}, ${parent} *`,
elements = [...document.querySelectorAll(selector)],
promises = elements.map(translateElement);
await Promise.all(promises);
})();
Remarks:
I used childNodes[0].nodeValue instead of innerHtml or
innerText to keep the child elements.
Note that go over the entire DOM is not recommended and can lead to problems like changing script and style tags.

Related

How to use Cheerio in NodeJS to scrape img srcs

This is the code
const getHappyMovies = async () => {
try {
const movieData = [];
let title;
let description;
let imageUrl;
const response = await axios.get(happyUrl); //https://www.imdb.com/list/ls008985796/
const $ = load(response.data);
const movies = $(".lister-item");
movies.each(function () {
title = $(this).find("h3 a").text();
description = $(this).find("p").eq(1).text();
imageUrl = $(this).find("a img").attr("src");
movieData.push({ title, description, imageUrl });
});
console.log(movieData);
} catch (e) {
console.error(e);
}
};
Here's the output I'm receiving:
And this is the website I'm scraping from
Now I need to get the src of that image, but it's returning something else, as shown in the output image.

The golden rule of Cheerio is "it doesn't run JS". As a result, devtools is often inaccurate since it shows the state of the page after JS runs.
Instead, either look at view-source:, disable JS or look at the HTML response printed from your terminal to get a more accurate sense of what's actually on the page (or not).
Looking at the source:
<img alt="Tokyo Story"
class="loadlate"
loadlate="https://m.media-amazon.com/images/M/MV5BYWQ4ZTRiODktNjAzZC00Nzg1LTk1YWQtNDFmNDI0NmZiNGIwXkEyXkFqcGdeQXVyNzkwMjQ5NzM#._V1_UY209_CR2,0,140,209_AL_.jpg"
data-tconst="tt0046438"
height="209"
src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png"
width="140" />
You can see src= is a placeholder image but loadlate is the actual URL. When the image is scrolled into view, JS kicks in and lazily loads the loadlate URL into the src attribute, leading to your observed devtools state.
The solution is to use .attr("loadlate"):
const axios = require("axios");
const cheerio = require("cheerio"); // 1.0.0-rc.12
const url = "<Your URL>";
const getHappyMovies = () =>
axios.get(url).then(({data: html}) => {
const $ = cheerio.load(html);
return [...$(".lister-item")].map(e => ({
title: $(e).find(".lister-item-header a").text(),
description: $(e).find(".lister-item-content p").eq(1).text().trim(),
imageUrl: $(e).find(".lister-item-image img").attr("loadlate"),
}));
});
getHappyMovies().then(movies => console.log(movies));
Note that I'm using class selectors which are more specific than plain tags.

Getting rid of \r\n text showing up in web scraper

I am using async to fetch table tags in a website. It works great, however it is putting all of the \r\n tags at the bottom of my table. I can't figure out how to get rid of them in my .match(). Anyone have any answers?
var fetchCommand = "https://api.allorigins.win/get?url=" + encodeURIComponent("sampleurl");
(async () => {
const response = await fetch(fetchCommand);
const text = await response.text();
let result = text.match(/(?<=\<table>).*(?=\<\/table>)/);
console.log(result);
let html_content = document.getElementById("table");
html_content.innerHTML = result;
return html_content;
})()
</script>```

Works for me by parsing the response JSON with DOMParser() and then use querySelector() to get the table. You might want to look for all tables with querySelectorAll(). I also use outerHTML on the table element to add it to the DOM because innerHTML strips the html tags.
function fetchTable() {
fetch(`https://api.allorigins.win/get?url=${encodeURIComponent('https://www.w3schools.com/html/html_tables.asp')}`)
.then((res) => {
return res.json();
})
.then((data) => {
const parser = new DOMParser();
const htmlDoc = parser.parseFromString(data.contents, 'text/html');
const firstTable = htmlDoc.querySelector('table');
html_content.innerHTML = firstTable.outerHTML;
})
}
Edit: Here is a working example: https://jsfiddle.net/q0dj7rvz/
If you keep having issues with the end of line characters, please post the URL with the table.

Web Scrape with Puppeteer within a table

I am trying to scrape this page.
https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214
I want to be able to find the grade count for PSA 9 and 10. If we look at the HTML of the page, you will notice that PSA does a very bad job (IMO) at displaying the data. Every TR is a player. And the first TD is a card number. Let's just say I want to get Card Number 1 which in this case is Kevin Garnett.
There are a total of four cards, so those are the only four cards I want to display.
Here is the code I have.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214");
const tr = await page.evaluate(() => {
const tds = Array.from(document.querySelectorAll('table tr'))
return tds.map(td => td.innerHTML)
});
const getName = tr.map(name => {
//const thename = Array.from(name.querySelectorAll('td.card-num'))
console.log("\n\n"+name+"\n\n");
})
await browser.close();
})();
I will get each TR printed, but I can't seem to dive into those TRs. You can see I have a line commented out, I tried to do this but get an error. As of right now, I am not getting it by the player dynamically... The easiest way I would think is to create a function that would think about getting the specific card would be doing something where the select the TR -> TD.card-num == 1 for Kevin.
Any help with this would be amazing.
Thanks

Short answer: You can just copy and paste that into Excel and it pastes perfectly.
Long answer: If I'm understanding this correctly, you'll need to map over all of the td elements and then, within each td, map each tr. I use cheerio as a helper. To complete it with puppeteer just do: html = await page.content() and then pass html into the cleaner I've written below:
const cheerio = require("cheerio")
const fs = require("fs");
const test = (html) => {
// const data = fs.readFileSync("./test.html");
// const html = data.toString();
const $ = cheerio.load(html);
const array = $("tr").map((index, element)=> {
const card_num = $(element).find(".card-num").text().trim()
const player = $(element).find("strong").text()
const mini_array = $(element).find("td").map((ind, elem)=> {
const hello = $(elem).find("span").text().trim()
return hello
})
return {
card_num,
player,
column_nine: mini_array[13],
column_ten: mini_array[14],
total:mini_array[15]
}
})
console.log(array[2])
}
test()
The code above will output the following:
{
card_num: '1',
player: 'Kevin Garnett',
column_nine: '1-0',
column_ten: '0--',
total: '100'
}

Selecting an html node's text content with htmlparser2 in Node.js

I want to parse some html with htmlparser2 module for Node.js. My task is to find a precise element by its ID and extract its text content.
I have read the documentation (quite limited) and I know how to setup my parser with the onopentag function but it only gives access to the tag name and its attributes (I cannot see the text). The ontext function extracts all text nodes from the given html string, but ignores all markup.
So here's my code.
const htmlparser = require("htmlparser2");
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
const parser = new htmlparser.Parser({
onopentag: function(name, attribs){
if (attribs.id === "heading1"){
console.log(/*how to extract text so I can get "Some heading" here*/);
}
},
ontext: function(text){
console.log(text); // Some heading \n Foobar
}
});
parser.parseComplete(file);
I expect the output of the function call to be 'Some heading'. I believe that there is some obvious solution but somehow it misses my mind.
Thank you.

You can do it like this using the library you asked about:
const htmlparser = require('htmlparser2');
const domUtils = require('domutils');
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
var handler = new htmlparser.DomHandler(function(error, dom) {
if (error) {
console.log('Parsing had an error');
return;
} else {
const item = domUtils.findOne(element => {
const matches = element.attribs.id === 'heading1';
return matches;
}, dom);
if (item) {
console.log(item.children[0].data);
}
}
});
var parser = new htmlparser.Parser(handler);
parser.write(file);
parser.end();
The output you will get is "Some Heading". However, you will, in my opinion, find it easier to just use a querying library that is meant for it. You of course, don't need to do this, but you can note how much simpler the following code is: How do I get an element name in cheerio with node.js
Cheerio OR a querySelector API such as https://www.npmjs.com/package/node-html-parser if you prefer the native query selectors is much more lean.
You can compare that code to something more lean, such as the node-html-parser which supports simply querying:
const { parse } = require('node-html-parser');
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
const root = parse(file);
const text = root.querySelector('#heading1').text;
console.log(text);

Error "Object reference chain is too long" on querySelectorAll request

I would like to get all the elements in my DOM with a specific css path:
var elements = await chromeless.evaluate(() => document.querySelectorAll('div a'))
console.log(elements[0].innerHTML)
console.log(elements[1].innerHTML)
but this code gives me the error "Object reference chain is too long" on the first line
This code works though:
var element = await chromeless.evaluate(() => document.querySelectorAll('div a')[0].innerHTML)
console.log(element)
and I could potentially use a loop to retrieve them all but I have no idea how many elements have this css in my DOM so I don't know how many times to loop.
What's the correct syntax to get all the desired elements?

const elements = await chromeless.evaluateHandle(() => {
const allOccurances = [...document.querySelectorAll("div a")];
const data = allOccurances.map((node) => node.innerHTML);
return data;
});
const response = await elements.jsonValue();
console.log(response);
Instead of chromeless we can use page by creating a page as per puppeteer documentation https://pptr.dev/#?product=Puppeteer&version=v13.1.3&show=api-class-page

Develop Reference

JavaScript is the programming language of the Web.

Translate all innerText in document html with google translate api - javascript

Related

How to use Cheerio in NodeJS to scrape img srcs

Getting rid of \r\n text showing up in web scraper

Web Scrape with Puppeteer within a table

Selecting an html node's text content with htmlparser2 in Node.js

Error "Object reference chain is too long" on querySelectorAll request

Categories

Resources