Getting rid of \r\n text showing up in web scraper

Getting rid of \r\n text showing up in web scraper - javascript

I am using async to fetch table tags in a website. It works great, however it is putting all of the \r\n tags at the bottom of my table. I can't figure out how to get rid of them in my .match(). Anyone have any answers?
var fetchCommand = "https://api.allorigins.win/get?url=" + encodeURIComponent("sampleurl");
(async () => {
const response = await fetch(fetchCommand);
const text = await response.text();
let result = text.match(/(?<=\<table>).*(?=\<\/table>)/);
console.log(result);
let html_content = document.getElementById("table");
html_content.innerHTML = result;
return html_content;
})()
</script>```

Works for me by parsing the response JSON with DOMParser() and then use querySelector() to get the table. You might want to look for all tables with querySelectorAll(). I also use outerHTML on the table element to add it to the DOM because innerHTML strips the html tags.
function fetchTable() {
fetch(`https://api.allorigins.win/get?url=${encodeURIComponent('https://www.w3schools.com/html/html_tables.asp')}`)
.then((res) => {
return res.json();
})
.then((data) => {
const parser = new DOMParser();
const htmlDoc = parser.parseFromString(data.contents, 'text/html');
const firstTable = htmlDoc.querySelector('table');
html_content.innerHTML = firstTable.outerHTML;
})
}
Edit: Here is a working example: https://jsfiddle.net/q0dj7rvz/
If you keep having issues with the end of line characters, please post the URL with the table.

Related

Parse html from JSON file

I'm trying to get HTML tags to work in my json-file that i fetch via js.
So i want the return to somehow make the <strong> to work when render it on the page. How would i do that?
Sample of the json:
{
"header_title": "<strong>test</strong>"
}
JS:
const newTranslations = await fetchTranslationsFor(
newLocale,
);
async function fetchTranslationsFor(newLocale) {
const response = await fetch('/lang/en.json');
return await response.json();
}
To render it i do like so: pseudo.
element.innerText = json.myprop;

Change innerText to innerHTML. When you use the text method, it escapes the html characters. Innerhtml renders the exact html.
element.innerHTML = json.myprop;

Puppeteer returning undefined (JS) using xPath

I'm trying to scrape this element: on this website.
My JS code:
const puppeteer = require("puppeteer");
const url = 'https://magicseaweed.com/Bore-Surf-Report/1886/'
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const title = await page.$x('/html/body/div[1]/div[2]/div[2]/div/div[2]/div[2]/div[2]/div/div/div[1]/div/header/h3/div[1]/span[1]')
let text = await page.evaluate(res => res.textContext, title[0])
console.log(text) // UNDEFINED
text is undefined. What is the problem here? Thanks.

I think you need to fix 1 or 2 issues on your code.
textContent vs textContext
xpath
For the content you want the xpath should be:
const title = await page.$x('/html/body/div[1]/div[2]/div[2]/div/div[2]/div[2]/div[2]/div/div/div[1]/div/div[1]/div[1]/div/div[2]/ul[1]/li[1]/text()')
And to get the content of this:
const text = await page.evaluate(el => {
return el.textContent.trim()
}, title[0])
Notice you need send title[0] as an argument to the page function.
OR
if you don't need to use xpath, it seems you could get directly using class name to find the element:
const rating = await page.evaluate(() => {
return $('.rating.rating-large.clearfix > li.rating-text')[0].textContent.trim()
})

Xpath doesn't recognize anchor tag?

I'm running some Node.js code to scrape a website and return some text from this part of the html:
And here's the code I'm using to get it
const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;
const axios = require('axios');
(async () => {
const response = await axios.get('https://www.aritzia.com/en/product/sculpt-knit-tank-%28arjun-knit-top%29/66139.html?dwvar_66139_color=17388');
const html = response.data;
const document = parse5.parse(html.toString());
const xhtml = xmlser.serializeToString(document);
const doc = new dom().parseFromString(xhtml);
const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
const nodes = select("//x:div[contains(#class, 'pdp-product-brand')]/*/text()", doc);
console.log(nodes.length ? nodes[0].nodeValue : nodes.length)
})();
The code above works as expected -- it prints Babaton.
But when I swap out the xpath above for one that includes a instead of * (i.e. //x:div[contains(#class, 'pdp-product-brand')]/a/text()) it instead tells me that nodes.length === 0.
I would expect it to give the same result because the div that it's pointing to does in fact have a child anchor tag (see screenshot above). I'm just confused why it doesn't work with a and was wondering if anybody else knew the answer. Thanks!

Web Scrape with Puppeteer within a table

I am trying to scrape this page.
https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214
I want to be able to find the grade count for PSA 9 and 10. If we look at the HTML of the page, you will notice that PSA does a very bad job (IMO) at displaying the data. Every TR is a player. And the first TD is a card number. Let's just say I want to get Card Number 1 which in this case is Kevin Garnett.
There are a total of four cards, so those are the only four cards I want to display.
Here is the code I have.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214");
const tr = await page.evaluate(() => {
const tds = Array.from(document.querySelectorAll('table tr'))
return tds.map(td => td.innerHTML)
});
const getName = tr.map(name => {
//const thename = Array.from(name.querySelectorAll('td.card-num'))
console.log("\n\n"+name+"\n\n");
})
await browser.close();
})();
I will get each TR printed, but I can't seem to dive into those TRs. You can see I have a line commented out, I tried to do this but get an error. As of right now, I am not getting it by the player dynamically... The easiest way I would think is to create a function that would think about getting the specific card would be doing something where the select the TR -> TD.card-num == 1 for Kevin.
Any help with this would be amazing.
Thanks

Short answer: You can just copy and paste that into Excel and it pastes perfectly.
Long answer: If I'm understanding this correctly, you'll need to map over all of the td elements and then, within each td, map each tr. I use cheerio as a helper. To complete it with puppeteer just do: html = await page.content() and then pass html into the cleaner I've written below:
const cheerio = require("cheerio")
const fs = require("fs");
const test = (html) => {
// const data = fs.readFileSync("./test.html");
// const html = data.toString();
const $ = cheerio.load(html);
const array = $("tr").map((index, element)=> {
const card_num = $(element).find(".card-num").text().trim()
const player = $(element).find("strong").text()
const mini_array = $(element).find("td").map((ind, elem)=> {
const hello = $(elem).find("span").text().trim()
return hello
})
return {
card_num,
player,
column_nine: mini_array[13],
column_ten: mini_array[14],
total:mini_array[15]
}
})
console.log(array[2])
}
test()
The code above will output the following:
{
card_num: '1',
player: 'Kevin Garnett',
column_nine: '1-0',
column_ten: '0--',
total: '100'
}

Translate all innerText in document html with google translate api

i am making a request to this url to translate text from english to spanish
URL: https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=es&dt=t&q=Hello
and efectivelly i´m getting translated text to spanish, so, now i want to get dinamically all innerText in body document and then put again translated text, how can i do this?
In simple words, I want to dynamically translate the website with a button click.
This is my example code to start:
let textToBeTranslate =["hello","thanks","for","help me"]
var url = "https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=es&dt=t&q="+textToBeTranslate;
fetch(url)
.then(data => data.json()).then(data => {
//Text translated to spanish
var textTranslated = data[0][0][0].split(", ");
console.log(textTranslated)
//output: ["hola gracias por ayudarme"]
//Now i want to dinamically put translated text in body tag again
}).catch(error => {
console.error(error)
});

Try this:
const translateElement = async element => {
const
elementNode = element.childNodes[0],
sourceText = elementNode && elementNode.nodeValue;
if (sourceText)
try {
const
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=es&dt=t&q=' + sourceText,
resultJson = await fetch(url),
result = await resultJson.json(),
translatedText = result[0][0][0].split(', ');
elementNode.nodeValue = translatedText;
} catch (error) {
console.error(error);
}
}
}
For a single element - Just call it, like this:
(async () => await translateElement(document.body))();
For all elements in the DOM - You will need to recursively go over all elements starting from the desired parent tag (body, in your case), and call the above function for each element, like this:
(async () => {
const
parent = 'body',
selector = `${parent}, ${parent} *`,
elements = [...document.querySelectorAll(selector)],
promises = elements.map(translateElement);
await Promise.all(promises);
})();
Remarks:
I used childNodes[0].nodeValue instead of innerHtml or
innerText to keep the child elements.
Note that go over the entire DOM is not recommended and can lead to problems like changing script and style tags.

Develop Reference

JavaScript is the programming language of the Web.

Getting rid of \r\n text showing up in web scraper - javascript

Related

Parse html from JSON file

Puppeteer returning undefined (JS) using xPath

Xpath doesn't recognize anchor tag?

Web Scrape with Puppeteer within a table

Translate all innerText in document html with google translate api

Categories

Resources