Scraping the same page forever using puppeteer - javascript

Doing scraping. How can I stay on a page and read the content to search for data every xx seconds without refresh the page? I use this way but the pc crashes after some time. Any ideas on how to make it efficient? I would like to achieve it without using while (true). The readOdds function does not always delay the same time.
//...
while(true){
const html = await page.content();
cant = await readOdds(html); // some code with the html
console.info('Waiting 5 seconds to read again...');
await page.waitFor(5000);
}
this is a section
async function readOdds(htmlPage){
try {
var savedat = functions.mysqlDateTime(new Date());
var pageHtml=htmlPage.replace(/(\r\n|\n|\r)/gm,"");
var exp_text_all = /<coupon-section(.*?)<\/coupon-section>/g;
var leagueLinksMatches = pageHtml.match(exp_text_all);
var cmarkets = 0;
let reset = await mysqlfunctions.promise_updateMarketsCount(cmarkets, table_markets_count, site);
console.log(reset);
if(leagueLinksMatches == null){
return cmarkets;
}
for (let i = 0; i < leagueLinksMatches.length; i++) {
const html = leagueLinksMatches[i];
var expc = /class="title ellipsis-text">(.*?)<\/span/g;
var nameChampionship = functions.getDataInHtmlCode(String(html).match(expc)[0]);
var idChampionship = await mysqlfunctions.promise_db_insert_Championship(nameChampionship, gsport, table_championship);
var exp_text = /<ui-event-line(.*?)<\/ui-event-line>/g;
var text = html.match(exp_text);
// console.info(text.length);
for (let index = 0; index < text.length; index++) {
const element = text[index];
....

Simple Solution with recursive callback
However before we go into that, you can try to run the function itself instead of while which will loop forever without any proper control.
const readLoop = async() => {
const html = await page.content();
cant = await readOdds(html);
return readLoop() // run the loop again
}
// invoke it for infinite callbacks without any delays at all
await readLoop();
Which will run the same block function continuously, without any delay, as long as your readOdds function returns. You won't have to use page.waitFor and while.
Memory leak prevention
For advanced cases where you have respawn over a period of time, Queue like bull and process manager like PM2 comes into play. However, queue will void your without refresh the page? part of your question.
You definitely should use pm2 though.
The usage is as follows,
npm i -g pm2
pm2 start index.js --name=myawesomeapp // or your app file
There are few useful arguments,
--max-memory-restart 100M, It can limit memory usage to 100M and restart itself.
--max-restarts 50, It will stop working once it restarts 50 times due to error (or memory leak).
You can check the logs using pm2 logs myawesomeapp as you set the name above.

Related

WebDriverIO iterate through URL and scroll through each

I'm using WebDriverIO (v5.18.7) and I'm trying to write something the can go to each URL, and scroll down in increments until you reached the bottom, then move on the next URL. The issue I'm having is when it comes to the scrolling portion of the script, it might scroll once before it goes to the next URL.
From what I'm understanding, the documentation for WebDriverIO says the commands are send in asynchronous, which is handled in the framework behind the scenes. So I tried sticking with the framework and tried browser.execute / browser.executeAsync but wasn't able to get it working.
Here's what I have that seems close to what I want. Any guidance would be appreciated!
const { doesNotMatch } = require('assert');
const assert = require('assert');
const { Driver } = require('selenium-webdriver/chrome');
// variable for list of URLS
const arr = browser.config.urls
describe('Getting URL and scrolling', () => {
it('Get URL and scroll', async () => {
// let i = 0;
for (const value of arr) {
await browser.url(value);
await browser.execute(() => {
const elem = document.querySelector('.info-section');
// scroll until this reaches the end.
// add another for loop with a counter?
elem.scrollIntoView(); //Using this for now as a place holder
});
// i += 1;
}
})
})
Short anwer $('.info-section').scrollIntoView()
See https://webdriver.io/docs/api/element/scrollIntoView.html
WebdriverIO suppors sync and async modes, see https://webdriver.io/docs/sync-vs-async.html

NodeJS: Using Pipe To Write A File From A Readable Stream Gives Heap Memory Error

I am trying to create 150 million lines of data and write the data into a csv file so that I can insert the data into different databases with little modification.
I am using a few functions to generate seemingly random data and pushing the data into the writable stream.
The code that I have right now is unsuccessful at handling memory issue.
After a few hours of research, I am starting to think that I should not be pushing each data at the end of the for loop because it seems that the pipe method simply cannot handle garbage collection this way.
Also, I found a few StackOverFlow answers and NodeJS docs that recommend against using push at all.
However, I am very new to NodeJS and I feel like I am blocked and do not know how to proceed from here.
If someone can provide me any guidance on how to proceed and give me an example, I would really appreciate it.
Below is a part of my code to give you a better understanding of what I am trying to achieve.
P.S. -
I have found a way to write successfully handle memory issue without using pipe method at all --I used the drain event-- but I had to start from scratch and now I am curious to know if there is a simple way to handle this memory issue without completely changing this bit of code.
Also, I have been trying to avoid using any library because I feel like there should be a relatively easy tweak to make this work without using a library but please tell me if I am wrong. Thank you in advance.
// This is my target number of data
const targetDataNum = 150000000;
// Create readable stream
const readableStream = new Stream.Readable({
read() {}
});
// Create writable stream
const writableStream = fs.createWriteStream('./database/RDBMS/test.csv');
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Then, push a number of data to the readable stream (150M in this case)
for (var i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
const data = `${id},${body},${date},${dp}\n`;
readableStream.push(data, 'utf8');
};
// Pipe readable stream to writeable stream
readableStream.pipe(writableStream);
// End the stream
readableStream.push(null);
Since you're new to streams, maybe start with an easier abstraction: generators. Generators generate data only when it is consumed (just like Streams should), but they don't have buffering and complicated constructors and methods.
This is just your for loop, moved into a generator function:
function * generateData(targetDataNum) {
for (var i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
yield `${id},${body},${date},${dp}\n`;
}
}
In Node 12, you can create a Readable stream directly from any iterable, including generators and async generators:
const stream = Readable.from(generateData(), {encoding: 'utf8'})
stream.pipe(writableStream)
i suggest to try a solution like the following:
const { Readable } = require('readable-stream');
class CustomReadable extends Readable {
constructor(max, options = {}) {
super(options);
this.targetDataNum = max;
this.i = 1;
}
_read(size) {
if (i <= this.targetDataNum) {
// your code to build the csv content
this.push(data, 'utf8');
return;
}
this.push(null);
}
}
const rs = new CustomReadable(150000000);
rs.pipe(ws);
Just complete it with your portion of code to fill the csv and create the writable stream.
With this solution you leave calling the rs.push method to the internal _read stream method invoked until this.push(null) is not called. Probably before you were filling the internal stream buffer too fast calling push manually in a loop getting the out memory error.
Try pipeing to the WritableStream before you start pumping data into the ReadableStream and yield before you write the next chunk.
...
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Pipe readable stream to writeable stream
readableStream.pipe(writableStream);
// Then, push a number of data to the readable stream (150M in this case)
for (var i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
const data = `${id},${body},${date},${dp}\n`;
readableStream.push(data, 'utf8');
// somehow YIELD for the STREAM to drain out.
};
...
The entire Stream implementation of Node.js relies on the fact that the wire is slow and that the CPU can actually have a downtime before the next chunk of data comes in from the stream source or till the next chunk of data has been written to the stream destination.
In the current implementation, since the for-loop has booked up the CPU, there is no downtime for the actual pipeing of the data to the writestream. You will be able to catch this if you watch cat test.csv which will not change while the loop is running.
As (I am sure) you know, pipe helps in guaranteeing that the data you are working with is buffered in memory only in chunks and not as a whole. But that guarantee only holds true if the CPU gets enough downtime to actually drain the data.
Having said all that, I wrapped your entire code into an async IIFE and ran it with an await for a setTimeout which ensures that I yield for the stream to drain the data.
let fs = require('fs');
let Stream = require('stream');
(async function () {
// This is my target number of data
const targetDataNum = 150000000;
// Create readable stream
const readableStream = new Stream.Readable({
read() { }
});
// Create writable stream
const writableStream = fs.createWriteStream('./test.csv');
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Pipe readable stream to writeable stream
readableStream.pipe(writableStream);
// Then, push a number of data to the readable stream (150M in this case)
for (var i = 1; i <= targetDataNum; i += 1) {
console.log(`Pushing ${i}`);
const id = i;
const body = `body${i}`;
const date = `date${i}`;
const dp = `dp${i}`;
const data = `${id},${body},${date},${dp}\n`;
readableStream.push(data, 'utf8');
await new Promise(resolve => setImmediate(resolve));
};
// End the stream
readableStream.push(null);
})();
This is what top looks like pretty much the whole time I am running this.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15213 binaek ** ** ****** ***** ***** * ***.* 0.5 *:**.** node
Notice the %MEM which stays more-or-less static.
You were running out of memory because you were pre-generating all the data in memory before you wrote any of it to disk. Instead, you need a strategy to write is as you generate so you don't have to hold large amounts of data in memory.
It does not seem like you need .pipe() here because you control the generation of the data (it's not coming from some random readStream).
So, you can just generate the data and immediately write it and handle the drain event when needed. Here's a runnable example (this creates a very large file):
const {once} = require('events');
const fs = require('fs');
// This is my target number of data
const targetDataNum = 150000000;
async function run() {
// Create writable stream
const writableStream = fs.createWriteStream('./test.csv');
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Then, push a number of data to the readable stream (150M in this case)
for (let i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
const data = `${id},${body},${date},${dp}\n`;
const canWriteMore = writableStream.write(data);
if (!canWriteMore) {
// wait for stream to be ready for more writing
await once(writableStream, "drain");
}
}
writableStream.end();
}
run().then(() => {
console.log(done);
}).catch(err => {
console.log("got rejection: ", err);
});
// placeholders for the functions that were being used
function randomDate(low, high) {
let rand = randomNumber(low.getTime(), high.getTime());
return new Date(rand);
}
function randomNumber(low, high) {
return Math.floor(Math.random() * (high - low)) + low;
}
const lorem = {
paragraph: function() {
return "random paragraph";
}
}

Javascript - process array in batches AND show progress

Dear Javascript Guru's:
I have the following requirements:
Process a large array in batches of 1000 (or any arbitrary size).
When each batch is processed, update the UI to show our progress.
When all batches have been processed, continue with the next step.
For example:
function process_array(batch_size) {
var da_len = data_array.length;
var idx = 0;
function process_batch() {
var idx_end = Math.min(da_len, idx + batch_size);
while (idx < idx_end) {
// do the voodoo we need to do
}
}
// This loop kills the browser ...
while (idx < da_len) {
setTimeout(process_batch, 10);
// Show some progress (no luck) ...
show_progress(idx);
}
}
// Process array ...
process_array(1000);
// Continue with next task ...
// BUT NOT UNTIL WE HAVE FINISHED PROCESSING THE ARRAY!!!
Since I am new to javascript, I discovered that everything is done on a single thread and as such, one needs to get a little creative with regard to processing and updating the UI. I have found some examples using recursive setTimeout calls, (one key difference is I have to wait until the array has been fully processed before continuing), but I cannot seem to get things working as described above.
Also -- I am in need of a "pure" javascript solution -- no third party libraries or the use of web workers (that are not fully supported).
Any (and all) guidance would be appreciated.
Thanks in advance.
You can make a stream from array and use batch-stream to make batches so that you can stream in batches to UI.
stream-array
and
batch-stream
In JavaScript when executing scripts in a HTML page, the page becomes unresponsive until the script is finished. This is because JavaScript is single thread.
You could consider using a web worker in JavaScript that runs in the background, independently of other scripts, without affecting the performance of the page.
In this case User can continue to do whatever he wants in the UI.
You can send and receive messages from the web worker.
More info on Web Worker here.
So part of the magic of recursion is really thinking about the things that you need to pass in, to make it work.
And in JS (and other functional languages) that frequently involves functions.
function processBatch (remaining, processed, batchSize,
transform, onComplete, onProgress) {
if (!remaining.length) {
return onComplete(processed);
}
const batch = remaining.slice(0, batchSize);
const tail = remaining.slice(batchSize);
const totalProcessed = processed.concat(batch.map(transform));
return scheduleBatch(tail, totalProcessed, batchSize,
transform, onComplete, onProgress);
}
function scheduleBatch (remaining, processed, batchSize,
transform, onComplete, onProgress) {
onProgress(processed, remaining, batchSize);
setTimeout(() => processBatch(remaining, processed, batchSize,
transform, onComplete, onProgress));
}
const noop = () => {};
const identity = x => x;
function processArray (array, batchSize, transform, onComplete, onProgress) {
scheduleBatch(
array,
[],
batchSize,
transform || identity,
onComplete || noop,
onProgress || noop
);
}
This can be simplified extremely, and the reality is that I'm just having a little fun here, but if you follow the trail, you should see recursion in a closed system that works with an arbitrary transform, on arbitrary objects, of arbitrary array lengths, with arbitrary code-execution when complete, and when each batch is completed and scheduling the next run.
To be honest, you could even swap this implementation out for a custom scheduler, by changing 3 lines of code or so, and then you could log whatever you wanted...
const numbers = [1, 2, 3, 4, 5, 6];
const batchSize = 2;
const showWhenDone = numbers => console.log(`Done with: ${numbers}`);
const showProgress = (processed, remaining) =>
`${processed.length} done; ${remaining.length} to go`;
const quintuple = x => x * 5;
processArray(
numbers,
batchSize,
quintuple,
showWhenDone,
showProgress
);
// 0 done; 6 to go
// 2 done; 4 to go
// 4 done; 2 to go
// Done with: 5, 10, 15, 20, 25, 30
Overkill? Oh yes. But worth familiarizing yourself with the concepts, if you're going to spend some time in the language.
Thank-you all for your comments and suggestions.
Below is a code that I settled on. The code works for any task (in my case, processing an array) and gives the browser time to update the UI if need be.
The "do_task" function starts an anonymous function via setInterval that alternates between two steps -- processing the array in batches and showing the progress, this continues until all elements in the array have been processed.
function do_task() {
const k_task_process_array = 1;
const k_task_show_progress = 2;
var working = false;
var task_step = k_task_process_array;
var batch_size = 1000;
var idx = 0;
var idx_end = 0;
var da_len = data_array.length;
// Start the task ...
var task_id = setInterval(function () {
if (!working) {
working = true;
switch (task_step) {
case k_task_process_array:
idx_end = Math.min( idx + batch_size, da_len );
while (idx < idx_end) {
// do the voodoo we need to do ...
}
idx++;
}
task_step = k_task_show_progress;
working = false;
break;
default:
// Show progress here ...
// Continue processing array ...
task_step = k_task_process_array;
working = false;
}
// Check if done ...
if (idx >= da_len) {
clearInterval(task_id);
task_id = null;
}
working = false;
}
}, 1);
}

For loop that calls a function with a promise works just once

I am trying to get the timbre.js recording function to create multiple buffers and store them in an object. The function that creates the recording and stores it in the 'songs' object is here:
var generateSong = function(songtitle){
T.rec(function(output) {
//... composition here, full at: http://mohayonao.github.io/timbre.js/RecordingMode.html
output.send(synth);
}).then(function(buffer) {
var songobject = {}
var song = T("buffer", {buffer:buffer});
songobject['song'] = song;
songobject['buffer'] = buffer.buffer[0];
songobject['samplerate'] = buffer.samplerate;
songs[songtitle] = songobject;
console.log('test message');
});
};
Then I try and call this multiple times with a for loop like this:
for(var i=0;i<5;i++){
var songtitle = generateSongTitle();
var songobject = generateSong(songtitle);
}
However I am only getting one test message logged to the console, so the loop seems to be running once then stopping. I've tried moving the 'then()' onwards part of the code to inside the for loop itself (ie generateSongTitle().then(...) ), but that had the same problem. How can I get the generateSong() function to run multiple times?
Thank you
So, I guess there's actually two possible problems there.
You are not returning the result of output.send
T.rec(function(output) {
//... composition here, full at: http://mohayonao.github.io/timbre.js/RecordingMode.html
return output.send(synth);
}
This will enable the next handler on the promise chain to actually get the buffer value (if it indeed is returned synchronously by send).
Can T.rec run in parallel?
Do you want to record 5 songs at the same time?
You should probably wait for one song to complete before recording the next.
function generate() {
var songtitle = generateSongTitle();
var songobject = generateSong(songtitle);
return songobject;
}
var nextSong = $.when(); // or any empty promise wrapper
for(var i=0;i<5;i++){
nextSong = nextSong.then(generate);
}
This will generate 5 songs sequentially... in theory! Do tell me if it works :)

node.js crashing after loop iteration

I'm trying to write a performance tool using node.js so I can automate it, and store the results in MySQL. The tool is supposed to gather how long took for the browser to load a particular webpage. I'm using HttpWatch to measure the performance, and the result is displayed in seconds. The browser utilized is Firefox.
Below is a piece of script I'm using to run the performance test:
var MyUrls = [
"http://google.com",
"http://yahoo.com"
];
try {
var win32ole = require('win32ole');
var control = win32ole.client.Dispatch('HttpWatch.Controller');
var plugin = control.Firefox.New();
for (var i=0; i < MyUrls.length; i++) {
var url = MyUrls[i];
console.log(url);
for(var j=0; j < 14; j++) {
// Start Recording HTTP traffic
plugin.Log.EnableFilter(false);
// Clear Cache and cookier before each test
plugin.ClearCache();
plugin.ClearAllCookies();
plugin.ClearSessionCookies();
plugin.Record();
// Goto to the URL and wait for the page to be loaded
plugin.GotoURL(url);
control.Wait(plugin, -1);
// Stop recording HTTP
plugin.Stop();
if ( plugin.Log.Pages.Count != 0 )
{
// Display summary statistics for page
var summary = plugin.Log.Pages(0).Entries.Summary;
console.log(summary.Time);
}
}
}
plugin.CloseBrowser();
} catch(e) {
console.log('*** exception cached ***\n' + e);
}
After the second iteration of the inner loop, I'm getting the following error:
C:\xampp\htdocs\test\browser-perf>node FF-load-navigation.js
http://localhost/NFC-performance/Bing.htm
[Number (VT_R8 or VT_I8 bug?)]
2.718
[Number (VT_R8 or VT_I8 bug?)]
2.718
OLE error: [EnableFilter] -2147352570 [EnableFilter] IDispatch::GetIDsOfNames Au
toWrap() failed
Have someone seen this before? Can you help me?
You have to remember that node is asynchronous
So that for loop runs simultaneously to plugin.CloseBrowser();, which is obviously not what you want because thats causing it to close, which will cause problems in the for loop.
rather you want that to run after the for loop finishes.
Look at async for a simple way to do this.
async.each(MyUrls, function (callback) {
...
callback()
}, function(err){
plugin.CloseBrowser();
});
The same has to be done for your inner for loop.

Categories

Resources