I am trying to scrape some webpage, doing some searching I get to know about fetch API.
I have fetched a webpage using fetch() API from an URL , then I parsed the page into a DOM object, Now I have whole webpage in a DOM object. Can I apply jQuery functions on that?
my code
async function getProductData(url)
{
try {
const resp = await fetch(url);
var respText = await resp.text();
var parser = new DOMParser();
var doc = parser.parseFromString(respText, 'text/html')
// I am trying to do something like that. is it possible to do so ?
$(doc).ready( function(){
console.log( $( this) .find( $("#productTitle") ).text() );
});
}
catch (error) {
console.log(error);
}
}
.ready is not mandatory for me. I just need to extract some data from doc object. if there is any better way to fetch data from webpage please let me know, it would be very helpful for me.
Thank you so much.
You do not need and jQuery here:
const resp = await fetch(url);
const respText = await resp.text();
const parser = new DOMParser();
const doc = parser.parseFromString(respText, 'text/html');
console.log(doc.querySelector('#productTitle').innerText);
Related
For a project I am working on I want to extract all EAN numbers from a list of different URL's in Google Sheet.
Now I am using the URL Fetch app method to get the HTML of the link but when I want to select the element I want to use in my script, I get the following error:
TypeError: html.getElementById is not a function
My code
function scrapeWebsite() {
var response = UrlFetchApp.fetch('https://www.bol.com/nl/nl/p/azul-tuin-van-de-koningin-bordspel/9300000094065315/?promo=main_860_product_4&bltgh=kj5mIYO78dnIY1vUCvxPbg.18_19.24.ProductTitle');
var html = response.getContentText();
var element = html.getElementById('test')
}
I think you need to await that fetch. Then take the response and .innerHTML it into a placeholder element like below:
async function scrapeWebsite() {
let response = await UrlFetchApp.fetch('https://www.bol.com/nl/nl/p/azul-tuin-van-de-koningin-bordspel/9300000094065315/?promo=main_860_product_4&bltgh=kj5mIYO78dnIY1vUCvxPbg.18_19.24.ProductTitle');
let html = response.text();
let element = html.getElementById('test')
let el = document.createElement( 'html' );
el.innerHTML = response;
}
I have this web page response :
{"Status":"OK","RequestID":"xxxxxxxxxx","Results":[{"SubscriberKey":"teste132133","Client":null,"ListID":0,"CreatedDate":"0001-01-01T00:00:00.000","Status":"Active","PartnerKey":null,"PartnerProperties":null,"ModifiedDate":null,"ID":0,"ObjectID":null,"CustomerKey":null,"Owner":null,"CorrelationID":null,"ObjectState":null,"IsPlatformObject":false}],"HasMoreRows":false}
And I would like to just retrieve the SubscriberKey, like : "SubscriberKey":"teste132133"
So, I'm trying to use the Parse Json, but I believe that I'm doing something wrong that I don't know
follow the code :
<script language="javascript" runat="server">
Platform.Load("Core","1");
var response = HTP.Get("https://xxxxxxxx.pub.sfmc-content.com/vjpsefyn1jp"); //web page link
var obj = Platform.Function.ParseJSON(response);
Write(obj.Results[0].SubscriberKey)
</script>
I only know client side JavaScript, maybe this will work for you, it uses fetch to get the reponse, and then extracts the json value. It uses an asynchronous function call so we can use await to make the code more readible.
<script type="module">
async function getKey() {
const response = await fetch("https://xxxxxxxx.pub.sfmc-content.com/vjpsefyn1jp")
const json = await response.json()
const Results = json.Results
const key = Results[0].SubscriberKey
return key;
}
const key = await getKey();
console.log(`The key is: ${key}`);
</script>
I am trying to fetch Google Shopping page with products and I need to get the thumbnails. The problem is that images encoded in base64 and the response contain shortened code in src attribute of images. Instead of full code there is ///////
src=""
Here is my code
let title = "RockDove Men's Original Two-Tone Memory Foam Slipper";
let urlparse = "https://www.google.com/search?tbm=shop&tbs=vw:g&q=" +
encodeURIComponent(title);
fetch(urlparse)
.then(data => {
return data.text();
})
.then(htmlString => {
// parsing html string into DOM
let parser = new DOMParser();
let doc = parser.parseFromString(htmlString, "text/html");
// retrieve products data from DOM
let products = doc.querySelectorAll(".sh-pr__product-results > div");
let productsArr = Array.prototype.slice.call(products);
let productsData = productsArr.map(el => {
return el.querySelector(".sh-dgr__thumbnail").innerHTML;
});
console.log(productsData);
});
I also tried to use .blob() instead of .text() and then FileReader to read from the
Blob object but result is the same
I'm querying SOLR7.5 for some large objects and would like to render them to a Browser UI as they are returned.
What are my options for reading the response bit by bit using when using the select request handler
I don't think there is anything native to Solr to do what you are asking.
One approach to handle this would be to return only the ID of the documents that match the criteria in your query (and not include the heavy part of the document) and then fetch the large part of the document asynchronously from the client.
i was looking in the wrong place. I just needed to read up on my webAPI fetch().
the response.json() reads the response to completion.
response.body.getReader() allows you to grab the stream in chunk and decode it from there.
let test = 'https://my-solr7/people/select?q=something'
fetchStream(test);
function fetchStream(uri, params = {}){
const options = {
method: 'GET',
};
var decoder = new TextDecoder();
fetch(uri, options)
.then ()
.then( (response) => {
let read;
const reader = response.body.getReader();
reader.read()
.then(read = (result) => {
if (result.done) return;
console.log(result.value);
let chunk = decoder.decode(result.value || new Uint8Array, {stream: !result.done});
console.log(chunk)
reader.read().then(read);
});
});
}
I'm having an issue rendering a PDF using EVOPdf from a WebAPI controller to an AngularJS app.
This is my code so far:
Angular call:
var url = 'api/form/build/' + id;
$http.get(url, null, { responseType: 'arraybuffer' })
.success(function (data) {
var file = new Blob([data], { type: 'application/pdf' });
if (window.navigator && window.navigator.msSaveOrOpenBlob) {
window.navigator.msSaveOrOpenBlob(file);
}
else {
var objectUrl = URL.createObjectURL(file);
window.open(objectUrl);
}
});
APIController method:
var url = "http://localhost/index.html#/form/build/" + id;
#region PDF Document Setup
HtmlToPdfConverter htmlToPdfConverter = new HtmlToPdfConverter();
htmlToPdfConverter.LicenseKey = "4W9+bn19bn5ue2B+bn1/YH98YHd3d3c=";
//htmlToPdfConverter.HtmlViewerWidth = 1024; //default
htmlToPdfConverter.PdfDocumentOptions.PdfPageSize = PdfPageSize.A4;
htmlToPdfConverter.PdfDocumentOptions.PdfPageOrientation = PdfPageOrientation.Portrait;
htmlToPdfConverter.ConversionDelay = 3;
htmlToPdfConverter.MediaType = "print";
htmlToPdfConverter.PdfDocumentOptions.LeftMargin = 10;
htmlToPdfConverter.PdfDocumentOptions.RightMargin = 10;
htmlToPdfConverter.PdfDocumentOptions.TopMargin = 10;
htmlToPdfConverter.PdfDocumentOptions.BottomMargin = 10;
htmlToPdfConverter.PdfDocumentOptions.TopSpacing = 10;
htmlToPdfConverter.PdfDocumentOptions.BottomSpacing = 10;
htmlToPdfConverter.PdfDocumentOptions.ColorSpace = ColorSpace.RGB;
// Set HTML content destination in PDF page
htmlToPdfConverter.PdfDocumentOptions.Width = 640;
htmlToPdfConverter.PdfDocumentOptions.FitWidth = true;
htmlToPdfConverter.PdfDocumentOptions.StretchToFit = true;
#endregion
byte[] outPdfBuffer = htmlToPdfConverter.ConvertUrl(url);
string outPdfFile = #"c:\temp\forms\" + id + ".pdf";
System.IO.File.WriteAllBytes(outPdfFile, outPdfBuffer);
HttpResponseMessage result = null;
result = Request.CreateResponse(HttpStatusCode.OK);
result.Content = new ByteArrayContent(outPdfBuffer.ToArray());
result.Content.Headers.ContentDisposition = new ContentDispositionHeaderValue("attachment");
result.Content.Headers.ContentDisposition.FileName = "filename.pdf";
result.Content.Headers.ContentType = new MediaTypeHeaderValue("application/pdf");
return result;
When I check the PDF that I write out using WriteAllBytes, it renders perfectly but when it is returned via the Angular call and opened in Adobe Reader, I get an "Invalid Color Space" error message that pops up quite a few times, but the document is not opened. When I change the colorspace to GrayScale, the PDF opens but it's blank.
I have a feeling that it's the ByteArrayContent conversion that's causing the issue, seen as that's the only thing that happens between the actual creation of the PDF and sending it back to the Angular call, but I've hit a brick wall and can't figure out what the problem is.
I'd really appreciate any help you guys can offer because I'm so close to sorting this out and I just need the document to "convert" properly when returned from the call.
Thanks in advance for any help.
Regards,
Johann.
The problem seems to like on the client side, the characters are not properly parsed in the response. For anyone strugling with this i found my solution here: SO Question
Have you tried Headless Chrome? Here is a nice article about this topic. I was using https://github.com/puppeteer/puppeteer for this purpose and it was an easily integrated solution.
// install puppeteer-core npm package cmd
npm i puppeteer-core
# or "yarn add puppeteer-core"
<!-- example.js start-->
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: 'example.png' });
await browser.close();
})();
Execute script on the command line
node example.js
Puppeteer sets an initial page size to 800×600px, which defines the screenshot size. The page size can be customized with Page.setViewport().
Example - create a PDF.
Save file as hn.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {
waitUntil: 'networkidle2',
});
await page.pdf({ path: 'hn.pdf', format: 'a4' });
await browser.close();
})();
Execute script on the command line
node hn.js
See Page.pdf() for more information about creating pdfs.
Example - evaluate script in the context of the page
Save file as get-dimensions.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// Get the "viewport" of the page, as reported by the page.
const dimensions = await page.evaluate(() => {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio,
};
});
console.log('Dimensions:', dimensions);
await browser.close();
})();
Execute script on the command line
node get-dimensions.js
See Page.evaluate() for more information on evaluate and related methods like evaluateOnNewDocument and exposeFunction.