We're currently trying to transfer code from a Python web scraper to a Node.js web scraper. The source is the Pastebin API. When scraping, the response is a javascript object like this:
[
{
scrape_url: 'https://scrape.pastebin.com/api_scrape_item.php?i=FD1BhNuR',
full_url: 'https://pastebin.com/FD1BhNuR',
date: '1580299104',
key: 'FD1BhNuR',
size: '19363',
expire: '0',
title: 'Weight Loss',
syntax: 'text',
user: 'loscanary'
}
]
Our Python script uses the requests library to request data from Pastebin's API and to get access to the actual body of the paste, in addition to the parameters above, we loop through the first entry and retrieve its text value. Here is an excerpt:
response = requests.get("https://scrape.pastebin.com/api_scraping.php?limit=1")
parsed_json = response.json()
print(parsed_json)
for individual in parsed_json:
p = requests.get(individual['scrape_url'])
text = p.text
print(text)
This brings back the actual body of the paste(s), which we can then search through to scrape for more keywords.
In Node, I don't know how to retrieve the same text value of the "scrape_url" parameter in the same way as I can with requests.text. I've tried using axios and request but the furthest I can get is accessing the "scrape_url" parameter with something like this:
const scrape = async () => {
try {
const result = await axios.get(pbUrl);
console.log(result.data[0].scrape_url);
} catch (err) {
console.error(err);
}
}
scrape();
How could I get the same result as I can with .text from the Python Requests library and in a loop?
Here is an example of how to do it (as mentioned by Olvin Roght )
const scrape = async () => {
try {
const result = await axios.get(pbUrl);
result.data.forEach(async (individual) => {
const scrapeUrl = individual['scrape_url'];
const response = await axios.get(scrapeUrl);
const text = response.data;
console.log("this is the text value from the url:", text);
});
} catch (err) {
console.error(err);
}
}
scrape();
Related
I am new to node, I want to download a pdf document from some another url when person hits a post request in the back-end, change the name of file and send the file back to original client where the pdf will be downloaded.
NOTE the file should not be saved in server
first there is controller file which contains following code
try {
const get_request: any = req.body;
const result = await printLabels(get_request,res);
res.contentType("application/pdf");
res.status(200).send(result);
} catch (error) {
const ret_data: errorResponse = await respondError(
error,"Something Went Wrong.",
);
res.status(200).json(ret_data);
}
Then after this the function printLabels is defined as
export const printLabels = async (request: any,response:any) => {
try {
const item_id = request.item_id;
let doc=await fs.createReadStream(`some url with ${item_id}`);
doc.pipe(fs.createWriteStream("Invoice_" + item_id + "_Labels.pdf"));
return doc;
} catch (error) {
throw error;
}
};
Using above code, I am getting error as no such file found. Also, I don't have access of front end so is it possible to test the API with postman for pdf which I am doing or my approach is incorrect?
Next solution working for Express, but I'm not sure if you're using Express-like framework. If that, please specify which framework you're using.
At first, you need to use sendFile instead of send:
try {
const get_request: any = req.body;
const result = await printLabels(get_request,res);
res.contentType("application/pdf");
res.status(200).sendFile(result);
} catch (error) {
const ret_data: errorResponse = await respondError(
error,"Something Went Wrong.",
);
res.status(200).json(ret_data);
}
Then, you returning readStream, instead of path to file. Notice, you need to use absolute path to do that.
const printLabels = async () => {
try {
let doc= await fs.createReadStream(path.join(__dirname, 'test.pdf'));
doc.pipe(fs.createWriteStream("Invoice_test_Labels.pdf"));
return path.join(__dirname, 'Invoice_test_Labels.pdf');
} catch (error) {
throw error;
}
};
About PostMan, of course you can see it or save it to file:
I'm relatively new working with promises in JS. I got the API call to work on the initial homepage, but I'm having issues when I go to another page that is using the same API call.
In my api.js file I have the following:
const key = apiKey;
const commentsUrl = axios.get(`https://project-1-api.herokuapp.com/comments/?api_key=${key}`);
const showsUrl = axios.get(`https://project-1-api.herokuapp.com/showdates?api_key=${key}`);
async function getData() {
const allApis = [commentsUrl, showsUrl];
try {
const allData = await Promise.allSettled(allApis);
return allData;
} catch (error) {
console.error(error);
}
}
In my index.html
import { getData } from "./api.js";
let data = await getData(); //This works and gathers the data from the API.
In my shows.html
import { getData } from "./api.js";
let showsData = await getData(); //This does not and says that cannot access commentsUrl (api.js) before it is initialized. But it is?
If I comment out the code from "show", the API GET request works fine and the index page loads the API data correctly. Can anyone explain to me what's happening and why I would be getting the uninitialized error?
I also should note that if I split the API calls onto two seperate two js files (one for the index, one for the shows), the API calls works and displays the data as it is intended to.
On the homepage, the code is executing the 2 GET requests on page load/initialization of the JS code. When you navigate away from the homepage, presumably with some sort of client-side routing, the 2 GET requests no longer reference 2 Promises as they have already been executed.
You could instead move the GET requests into your function like:
const key = apiKey;
async function getData() {
try {
const commentsUrl = axios.get(`https://project-1- api.herokuapp.com/comments/?api_key=${key}`);
const showsUrl = axios.get(`https://project-1-api.herokuapp.com/showdates?api_key=${key}`);
const allApis = [commentsUrl, showsUrl];
const allData = await Promise.allSettled(allApis);
return allData;
} catch (error) {
console.error(error);
}
}
I'm trying to build a nodeJS script that pulls records from an Airtable base, bumps a UPC list up against the [UPC Item DB API][1], writes the product description ("Title") and product image array from the API response to an object, and then updates corresponding Airtable records with the pre-formatted using the Airtable API. I can't link directly to the Airtable API for my base, but the "Update Record" should look like this:
{
record_id: 'myRecord',
fields: {
'Product Description': 'J.R. Watkins Gel Hand Soap, Lemon, 11 oz',
'Reconstituted UPC': '818570001330',
Images: [
'https://images.thdstatic.com/productImages/b3e507dc-2d4a-48d4-a469-51a34c454959/svn/j-r-watkins-hand-soaps-23051-64_1000.jpg',
'http://pics.drugstore.com/prodimg/332476/450.jpg',
]
}
}
var Airtable = require('airtable');
var base = new Airtable({apiKey: 'myKey'}).base('myBase');
var request = require('request');
// Function to slow the code down for easier console watching
function sleep(ms) {
return new Promise((resolve) => {
setTimeout(resolve, ms);
});
}
// Function to slow the code down for easier console watching
async function init(x) {
console.log(1);
await sleep(x*1000);
console.log(2);
}
// Big nasty async function
async function imagesToAirtable() {
///Run through the airtable list
/// create the UPC_list that will be updated and pushed to Airtable to update records
const upc_list = [];
/// Pull from Airtable and assign array to an object
const airtable_records = await base('BRAND')
.select( { maxRecords : 3 })
.all()
/// Troubleshooting console.logs
console.log(airtable_records.length);
console.log("Entering the FOR loop")
/// Loop through the list, append req'd fields to the UPC object, and call the UPCItemDB API
for (var i = 0 ; i< airtable_records.length ; i++) {
/// Push req'd fields to the UPC object
await upc_list.push(
{ record_id : airtable_records[i].get("Record ID"),
fields: {
"Product Description" : "",
"Reconstituted UPC": airtable_records[i].get("Reconstituted UPC"),
"Images": []
}
}
);
/// Troubleshooting console.logs
console.log(upc_list)
console.log("Break");
/// call API
await request.post({
uri: 'https://api.upcitemdb.com/prod/trial/lookup',
headers: {
"Content-Type": "application/json",
"key_type": "3scale"
},
gzip: true,
body: "{ \"upc\": \""+airtable_records[i].get("Reconstituted UPC")+"\" }",
}, /// appending values to upc_list object
function (err, resp, body) {
console.log("This is loop "+ i)
upc_list[i].fields["Images"] = JSON.parse(body).items[0].images
upc_list[i].fields["Product Description"] = JSON.parse(body).items[0].title
console.log(upc_list[i]);
}
)}
};
imagesToAirtable();
I haven't gotten to writing the Airtable "Update Record" piece yet because I can't get the API response written to the upc_list array.
I get an error message on the last run of the FOR loop. In this case, the first and second time through the loop work fine and update the upc_list object, but the third time, I get this error:
upc_list[i].fields["Images"] = JSON.parse(body).items[0].images
^
TypeError: Cannot read property 'fields' of undefined
I know this has to do with async/await, but I'm just not experienced enough at this point to understand what I need to do.
I also know that this big nasty async/await function should be written into individual functions and then called in one single main() function but I can't figure out how to make everything chain together properly with async/await. Tips on that would be welcome as well :)
I have tried separating FOR loop into two FOR loops. The first for the initial append of the upc_list item, and the second for the API call and append with the parsed response.
I was going to skip by this question until I saw this:
I also know that this big nasty async/await function should be written
into individual functions
You are so right about that. Let's do it!
// get records from any base, up to limit
async function getRecords(base, limit) {
return base(base)
.select( { maxRecords : limit })
.all();
}
// return a new UPC object from an airtable brand record
// note - nothing async is being done here
function upcFromBrandRecord(brand) {
return {
record_id: brand.get("Record ID"),
fields: {
"Product Description": "",
"Reconstituted UPC": brand.get("Reconstituted UPC"),
"Images": []
}
};
}
The request module you're using doesn't use promises. There's a promise-using variant, I believe, but without installing anything new, we can "promise-ify" the post method you're using.
async function requestPost(uri, headers, body) {
return new Promise((resolve, reject) => {
request.post({ uri: uri, headers, gzip: true, body },
(err, resp, body) => {
err ? reject(err) : resolve(body)
}
)}
});
}
Now we can write a particular one for your usage...
async function upcLookup(brand) {
const uri = 'https://api.upcitemdb.com/prod/trial/lookup';
const headers = {
"Content-Type": "application/json",
// probably need an api key in here
"key_type": "3scale"
};
const body = JSON.stringify({ upc: brand.get("Reconstituted UPC") });
const responseBody = await requestPost(uri, headers, body);
// not sure if you must parse, but copying the OP
return JSON.parse(responseBody);
}
For a given brand record, build a complete upc record by creating the structure and calling the upc api...
async function brandToUPC(brand) {
const result = upcFromBrandRecord(brand);
const upcData = await upcLookup(brand);
result.fields["Images"] = upcData.items[0].images;
result.fields["Product Description"] = upcData.items[0].title;
return result;
}
Now we have all the tools needed to write the OP function simply...
// big and nasty no more!
async function imagesToAirtable() {
try {
const airtable_records = await getRecords('BRAND', 3);
const promises = airtable_records.map(brandToUPC);
const upc_list = await Promise.all(promises); //edit: forgot await
console.log(upc_list);
} catch (err) {
// handle error here
}
}
That's it. Caveat. I haven't parsed this code, and I know little or nothing about the services you're using, or whether there was a bug hidden underneath the one you've been encountering. So it seems unlikely that this will run out of the box. What I hope I've done is demonstrate the value of decomposition for making nastiness disappear.
I am writing a lambda function that fetches data from DynamoDB and stores it in an array. Now I want to create a CSV file from this array and return it. (preferably directly from the lambda function, rather than uploading it to s3 and then sharing the link). Any idea, how to do this?
My code until now -
import AWS from "aws-sdk";
import createError from "http-errors";
import commonMiddleware from "../lib/commonMiddleware";
const dynamodb = new AWS.DynamoDB.DocumentClient();
async function getFile(event, context) {
const { id: someId } = event.pathParameters;
let data;
const params = {
TableName: process.env.TABLE_NAME,
IndexName: "GSIsomeId",
KeyConditionExpression: "someId = :someId",
ExpressionAttributeValues: {
":someId": someId,
},
};
try {
const result = await dynamodb.query(params).promise();
data = result.Items;
} catch (error) {
console.error(error);
throw new createError.InternalServerError(error);
}
// data is array of objects which I can change to 2d array using Object.values()
// I want to create and return a CSV from this array
return {
statusCode: 200,
body: JSON.stringify(data),
};
}
export const handler = commonMiddleware(getFile);
Once you generated the csv using any of the approaches mentioned here Convert JSON array to CSV in Node
I guess you can try sending back the byte array of the file.
In Javascript I have to print in the document a JSON with some data from MySQL database and I want to remove the break lines but I cannot achieve it.
I get the data through node.js and I use express.js for printing it on a web browser.
This is the result:
As you can see, there is a break line between both rows and I want to remove it since it is causing issues when I try to read the JSON on the android application I am building.
I have tried to search on internet about how to achieve it, most answers were about using str.replace(/\r?\n|\r/g, '') but it did not work in my case.
This is the JS code I currently have:
var dbData = '<%-usersList%>';
dbData = JSON.parse(dbData.replace(/\r?\n|\r/g, ''));
document.write(JSON.stringify(dbData));
And this is how I pass the data from node.js:
app.get('/dbJSON', function (req, res) {
getDBData().then((data) => {
res.render('./dbJSON.ejs', {
usersList: JSON.stringify(data.usersList)
});
});
});
This function calls the js file that gets the data from the db:
function getDBData() {
const users = new Promise((resolve, reject) => {
dbConnection
.getUsers()
.then(data => {
resolve(data)
})
});
const groups = new Promise((resolve, reject) => {
dbConnection
.getGroups()
.then(data => {
resolve(data)
})
});
const frmTexts = new Promise((resolve, reject) => {
dbConnection
.getFrmTexts()
.then(data => {
resolve(data)
})
});
return Promise.all([users, groups, frmTexts])
.then(data => {
return {
usersList: data[0],
groupsList: data[1],
frmTextsList: data[2]
}
});
}
Result of printing out data.usersList (in node.js):
Edit: Fixed! The reason of why I wanted to delete the break lines was because I had issues with parsing the JSON on Android Studio but I have just figured out it was due to I did not add the http:/ in the url string (my bad). Now my android app is working.
#Adrian2895,
Simply convert it to string and this will remove new-line/carriage-return. And also the above code is working fine (which I referred to use deep-cloning to remove any attribute).
You can use the concept of deep cloning to get rid of unwanted attributes from JSON obj.
var rawData = [{
'iduser':1,
'user':'XXX',
'password': 'XXX123',
'groups_idgroup': 1,
'idgroup':1,
'name': 'James'
},
{
'iduser':9,
'user':'XXX',
'password': 'XXX123',
'groups_idgroup': 2,
'idgroup':2,
'name': 'James'
}];
rawData.forEach( function(item, index){
rawData[index] = Object.assign({}, item, {'user':undefined, 'password':undefined, 'name': undefined} );
console.log(JSON.stringify(rawData));
Output:-
[{"iduser":1,"groups_idgroup":1,"idgroup":1},{"iduser":9,"groups_idgroup":2,"idgroup":2}]