I have this link https://nfse.blumenau.sc.gov.br/contrib/app/nfse/rel/rp_nfse_v23.aspx?s=61154301&e=00165960000101&f=2BED3D1E8 (if you try to access its gonna ask to solve a captcha but as long as i already have the session, the playwright doesnt need to worry it).
OUT page.goto: net::ERR_ABORTED at https://nfse.blumenau.sc.gov.br/contrib/app/nfse/rel/rp_nfse_v23.aspx?s=61154301&e=00165960000101&f=2BED3D1E8
Anybody knows why playwright cannot access it? I need to download the PDF Buffer of this link.
You can use the fetch API. Something like this:
const fetchResponse = browserContect.request.get('https://nfse.blumenau.sc.gov.br/contrib/app/nfse/rel/rp_nfse_v23.aspx?s=61154301&e=00165960000101&f=2BED3D1E8')
const pdfBuffer = await fetchResponse.body();
Found a Solution:
First i didnt concatenate the cookie on the calling of the function
const cook = 'ASP.NET_SessionId=' + cookie;
await setCookie(cook, urlFinal);
Then i used the got module to put the cookie and get the buffer of the pdf:
response = await got(urlFinal, {cookieJar}).buffer();
Plus: Sometimes it returned a blank pdf (i think because of the timeout of loading it). So i inserted a loop to check the size of the buffer and tried 20 times until it gets more than 'X' of lenght.
for (let j = 0;j<=25;j++){
console.log('Entrei no looping ==> ' + j);
response = await got(urlFinal, {cookieJar}).buffer();
if (response.toString().length>=10000){
j=21;
}
}
console.log('tamanho do buffer ==> ' + response.toString().length);
I tried that already but didnt work either:
const cookie = (await page.context().cookies()).filter(cookie => cookie.name === 'ASP.NET_SessionId')
.map(cookie => cookie.value)[0];
console.log(cookie);
const cookieJar = new CookieJar();
const setCookie = promisify(cookieJar.setCookie.bind(cookieJar));
await setCookie('ASP.NET_SessionId=' + cookie, urlFinal);
const response = await got(urlFinal, {cookieJar}).buffer();
its really a challenge. Because if i wont go with page.goto i loose the session. The code bellow would solve the problem.
await page.goto(urlFinal);
Related
I am a beginner at Javascript.
Trying to read data from GoogleSheets using google-spreadsheet API(https://www.npmjs.com/package/google-spreadsheet)
For some reason, console.log is not working in below code.
Also, I am getting all the google sheet info printed in the console, which is so much that I am not able to see the full log.
const { GoogleSpreadsheet } = require("google-spreadsheet");
const doc = new GoogleSpreadsheet("****************-*******-*******");
const serviceAccountCreds = require("./serviceAccountCredentials.json");
let expenseSheet;
// Load envs in process
process.env["GOOGLE_SERVICE_ACCOUNT_EMAIL"] = serviceAccountCreds.client_email;
process.env["GOOGLE_PRIVATE_KEY"] = serviceAccountCreds.private_key;
// authenticate to the google sheet
const authenticate = async () => {
await doc.useServiceAccountAuth({
client_email: process.env.GOOGLE_SERVICE_ACCOUNT_EMAIL,
private_key: process.env.GOOGLE_PRIVATE_KEY,
});
};
authenticate();
//load Spreadsheet doc info
const docInfo = async () => {
await doc.loadInfo(); // loads document properties and worksheets
console.log(doc.title); // This line does not seem to work
};
docInfo();
I am requiring help with 2 things.
The console.log does not seem to print anything.
The Google sheet information is getting printed, which I don't want it to be.
Thanks in advance.
To check the console.log, I tried to output console log to a file, by using the below command in windows.
nodejs readSheet.js > path_to_file
Even though the file is generated, there is no log in that file.
I use Puppeteer to get website HTML and then scrape data with Cheerio. Here is part of my code. It works fine almost every time, but sometimes I get undefined from companyAddress and companyIntro. In the beginning, I thought it may be due to the differences of different pages, but it happens even when I scraped the same page at different times (most of time I get the data but sometimes it is undefined). The page is rendered successfully, and the attributes and their value are confirmed to exist via devtool. I wonder the reason behind it. Could it be the problem of Puppeteer during fetching? Cheerio code is synchronous, so I don't think Cheerio is the problem. I never get error: cannot get attr('profile') of undefined, so it means there is a header element, but I get error: substring() of undefined. That is why I put a condition before it to check.
const puppeteer = require('puppeteer')
const cheerio = require('cheerio')
const baseUrl = 'https://www.104.com.tw'
const sleep = (milisecond) => {
return new Promise((resolve, reject) => setTimeout(resolve, milisecond))
}
const scrapeCompanyPage = async (dataList, page) => {
for (let i = 0; i < dataList.length; i++) {
await page.goto(dataList[i].companyUrl)
const html = await page.content()
const $ = cheerio.load(html)
const header = $('div.header')
//sometimes company data below is undefined, but header exists
dataList[i].companyAddress = header.attr('address') ? header.attr('address') : null
dataList[i].companyIntro = header.attr('profile') ? header.attr('profile').substring(0, 50) : null
await sleep(1000)
}
return dataList
}
The website this section of code scrapes is this: https://www.104.com.tw/company/1a2x6bk72b?jobsource=2018indexpoc
The content is different for different companyUrl, but the structure is the same.
The below is the HTML tag I want to select.
<div data-v-690c5d70="" data-v-09405bf2="" class="header mb-4" productpictures="" custno="13000000010336" industrydesc="..." indcat="..." empno="30" capital="80" address="..." custlink="https://unnotech.com"profile="..." management="..." phone="..." fax="..." hrname="HR" lat="25.0755569" lon="121.5756586" news="" newslink="" linkmore="[object Object]" corpimage1="" corpimage3="" corplink2="" corplink1="" corplink3="" envpictures="[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]" historys="" addrnodesc="..." reporturl="//www.104.com.tw/question_admin/reaction.cfm? j=5070426e34463e6730323a632c2e365f2444a42252525256a47682e2987j48" postalcode=""
>...</div>
Press Ctrl-U and you will see the source code of the main content is empty. The website might be made by React, Vue or other Javascript library to rendering. So you need to wait for the element appear.
But if I inspect the website in Developer Tools > Network tab > XHR Filter, reload the page, you will see the API they call to get these metadata like address, profile, etc... And you might not need to scrape the html.
The one at that link had a profile and address attribute so that wouldn't happen there.
If the attribute is missing you will get undefined for example $(div).attr('foo')
For node 14+ you can use the optional chaining ? operator to avoid problems with those:
dataList[i].companyIntro = header.attr('profile')?.substring(0, 50)
I am currently trying to make a data pipeline using Node.js
Of course, it's not the best way to make it but I want to try implementing it anyways before I make improvements upon it.
This is the situation
I have multiple gzip compressed csv files on AWS S3. I get these "objects" using aws sdk
like the following and make them into readStream
const unzip = createGunzip()
const input = s3.getObject(parameterWithBucketandKey)
.createReadStream()
.pipe(unzip)
and using the stream above I create readline interface
const targetFile = createWriteSTream('path to target file');
const rl = createInterface({
input: input
})
let first = true;
rl.on('line', (line) => {
if(first) {
first = false;
return;
}
targetFile.write(line);
await getstats_and_fetch_filesize();
if(filesize > allowed_size){
changed_file_name = change_the_name_of_file()
compress(change_file_name)
}
});
and this is wrapped as a promise
and I have array of filenames to be retrieved from AWS S3 and map those array of filenames like this
const arrayOfFileNames = [name1, name2, name3 ... and 5000 more]
const arrayOfPromiseFileProcesses= arrayOfFileNames.map((filename) => return promiseFileProcess(filename))
await Promise.all(arrayOfPromiseFileProcesses);
// the result should be multiple gzip files that are compressed again.
sorry I wrote in pseudocode if it needs more to provide context then I will write more but I thought this would give a general contenxt of my problem.
My problem is that it writes to a file fine, but when i change the file_name it it doesn't create one afterwards. I am lost in this synchronous and asynchronous world...
Please give me a hint/reference to read upon. Thank you.
line event handler must be a async function as it invokes await
rl.on('line', async(line) => {
if(first) {
first = false;
return;
}
targetFile.write(line);
await getstats_and_fetch_filesize();
if(filesize > allowed_size){
changed_file_name = change_the_name_of_file()
compress(change_file_name)
}
});
I am struggling for hours trying to get to the iframe but I just can't type in this box for some reason. The HTML does not show input on the page or in the iframe this is the code I tried and was the closest but not really getting to the box to type. this is the part of the HTML I try to get into.
inspect from Chrome
and here is the code I am using
const iframeHandle = await page.$$('iframe');
const contentFrame = await iframeHandle[2].contentFrame();
const tester = await contentFrame.$$('#rte');
and when I run
console.log(tester.length);
I get 1 so i am getting into the iframe but I dont know how to type with in it so far I can see its only an emtpy tag in it
Maybe I am just missing something small any help will be most appreciated
You can utilize the frame call.
So from your code
const iframeHandle = await page.$$('iframe');
await this.browser.frame(iframeHandle);
Or something of the sort, according to your code should get you into that iframe.
Try focus on the input and type
const cardElement = await paymentFrame.$('#cardNumber');
// Input is focused.
await cardElement.focus();
or this should work
const frames = await page.frames();
let iframe = frames.find(f => f.name() === 'any_iframe');
const textInput = await iframe.$('#textInput');
textInput.click(); // this focusses on the element
textInput.type('description text');
I am trying to get the list of images in a directory from firebase storage.
Example I want to get all image in users/userid/images, but it does not work and popup an error which the function is undefined.
const listRef = storageRef.child('users/userid/images');
listRef.listAll().then(res=>{
res.items.forEach(itemRef=>{
// console.log(itemRef);
});
}).catch(e =>{});
The ability to list files in a storage bucket wasn't added until version 6.1.0 of the JavaScript SDK. So make sure your SDK is up to date.
And now we can do it as well, thnk you from react-native-firebase team.
const reference = storage().ref('images');
const getListOfImages = async (){
const res = await reference.child('profileimages').list();
return await Promise.all(res.items.map(i => i.getDownloadURL()));
}