Retry failed pages with new proxyUrl - javascript

I have developed an Actor+PuppeteerCrawler+Proxy based crawler and want to rescrape failed pages. To increase the chance for the rescrape, I want to switch to another proxyUrl. The idea is, to create a new crawler with a modified launchPupperteer function and a different proxyUrl, and re-enque the failed pages. Please check the sample code below.
But unfortunately, it doesn't work, although I reset the request queue by using drop and reopening. Is it possible to rescraped failed pages by using PuppeteerCrawler with a different proxyUrl and how?
Best regards,
Wolfgang
for(let retryCount = 0; retryCount <= MAX_RETRY_COUNT; retryCount++){
if(retryCount){
// Try to reset the request queue, so that failed request shell be rescraped
await requestQueue.drop();
requestQueue = await Apify.openRequestQueue(); // this is necessary to avoid exceptions
// Re-enqueue failed urls in array failedUrls >>> ignored although using drop() and reopening request queue!!!
for(let failedUrl of failedUrls){
await requestQueue.addRequest({url: failedUrl});
}
}
crawlerOptions.launchPuppeteerFunction = () => {
return Apify.launchPuppeteer({
// generates a new proxy url and adds it to a new launchPuppeteer function
proxyUrl: createProxyUrl()
});
};
let crawler = new Apify.PuppeteerCrawler(crawlerOptions);
await crawler.run();
}

I think your approach should work but on the other hand it should not be necessary. I'm not sure what createProxyUrl does.
You can supply a generic proxy URL with auto username which will use all your datacenter proxies at Apify. Or you can provide proxyUrls directly to PuppeteerCrawler.
Just don't forget that you have to switch browser to get a new IP from the proxy. More in this article - https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler

Related

I want to know which user is online and which are not in real time

What I am trying to achieve is knowing when a user is logged in or not. So on login and logout I update a column in the DB. What I have been trying to make work, is when a user closes the browser or tab etc. I use the fetch API to post to the DB to set them offline (I have this working). I need to store this in the database as it will be used elsewhere for other functions. The fetch request that works is below:
I have scoured the internet and come across the understanding it is impossible to determine between a refresh of the page and closing the tab or browser (please prove me wrong if this isn't the case), so currently the code always runs, even on refresh always setting them to "offline".
I have come across aborting a fetch request but couldn't make it work so that the signal and abort are only used if onload is triggered. i.e. when closing the browser, onload shouldn't trigger.
Am I looking at this the wrong way? Is there an easier solution? Can anyone help with pointing me in the right direction, either with how to call abort only on load (if that will work), or how else I can differentiate between close tab/browser and otherwise?
const controller = new AbortController();
const signal = controller.signal;
async function onLeavingPage(){
var url = window.location.href;
var formData = new FormData();
formData.append('IsUserOnline', 0);
const response = await fetch(
(url),
{
keepalive: true,
method: 'POST',
credentials: 'include',
signal: signal,
body: formData
}
);
}
window.addEventListener('beforeunload', function(){
var WindowCount = sessionStorage.getItem('WindowCount');
if(!WindowCount){
WindowCount = 1;
}
sessionStorage.setItem('WindowCount', --WindowCount);
if(WindowCount == 0){
onLeavingPage();
}
});
window.onload = function(){
controller.abort();
var WindowCount = sessionStorage.getItem('WindowCount');
if(!WindowCount){
WindowCount = 0;
}
sessionStorage.setItem('WindowCount', ++WindowCount);
}

Chrome Extension: Modifying a built in function for all webworkers

So here's a bit of background. I need to intercept the return data from a fetch request. I already know how to do that. By injecting this code in the webpage I can read the response body of a fetch. Like this:
var proxyFetch = fetch;
window.fetch = async function () {
var response = proxyFetch(...arguments);
// a lot of other code as well
return response;
}
The problem I'm having is that I can't modify the fetch function for any web workers. I don't have access to their 'this' context. My best bet is something to do with WorkerGlobalScope but I don't know if that object can do what I want.
My second idea was to do something like before the page loads.
window.Worker = class extends window.Worker{
constructor(){
// Take the url they passed into webworker.
// Fetch the url to get the raw javascript.
// Append my javascript to it.
// Turn it into a blob url
// Now I have my javascript running in every web worker
//
// The problem here is that we are not allowed to use asynchronous calls in the constructor.
// I am stumped on this option as well.
}
}
I've looked into WebRequest but they don't allow us to look at the response body of network requests.
I played around with this a lot to see if I could modify the globalThis or WorkerGlobalScope for every webworker on the page so I can modify the fetch but I could not find a way.
var code = "postMessage('hello world');"
var blob = new Blob([code], { type: 'application/javascript' });
var url = URL.createObjectURL(blob);
var worker = new Worker(url);
worker.onmessage = function(e) {
console.log(e.data);
};
What you are trying to achieve is mean to be done by a Service Worker.
They work as proxy between the app and server. You can control whether the request is even made or what it returns.
Additionally you can send messages between your worker and the service worker with the postMessage interface.

Apify web scraper task not stable. Getting different results between runs minutes apart

I'm building a very simple scraper to get the 'now playing' info from an online radio station I like to listen too.
It's stored in a simple p element on their site:
data html location
Now using the standard apify/web-scraper I run into a strange issue. The scraping sometimes works, but sometimes doesn't using this code:
async function pageFunction(context) {
const { request, log, jQuery } = context;
const $ = jQuery;
const nowPlaying = $('p.js-playing-now').text();
return {
nowPlaying
};
}
If the scraper works I get this result:
[{"nowPlaying": "Hangover Hotline - hosted by Lamebrane"}]
But if it doesn't I get this:
[{"nowPlaying": ""}]
And there is only a 5 minute difference between the two scrapes. The website doesn't change, the data is always presented in the same way. I tried checking all the boxes to circumvent security and different mixes of options (Use Chrome, Use Stealth, Ignore SSL errors, Ignore CORS and CSP) but that doesn't seem to fix it unfortunately.
Scraping instable
Any suggestions on how I can get this scraping task to constantly return the data I need?
It would be great if you can attach the URL, it will help me to find out the problem.
With the information you provided, I guess that the data you want to are loaded asynchronously. You can use context.waitFor() function.
async function pageFunction(context) {
const { request, log, jQuery } = context;
const $ = jQuery;
await context.waitFor(() => !!$('p.js-playing-now').text());
const nowPlaying = $('p.js-playing-now').text();
return {
nowPlaying
};
}
You can pass the function to wait, and I will wait until the result of the function will be true. You can check the doc.

How to create a cross domain HTTP request

I have a website, and I need a way to get html data from a different website via an http request, and I've looked around for ways to implement it and most say via an ajax call instead.
An ajax call is blocked by linked in so I want to try a plain cross domain http request and hope it's not blocked one way or another.
If you have a server running and are able to run code on it, you can make the HTTP call server side. Keep in mind though that most sites only allow so many calls per IP address so you can't serve a lot of users this way.
This is a simple httpListener that downloads an websites content when the QueryString contains ?site=http://linkedin.com:
// setup an listener
using(var listener = new HttpListener())
{
// on port 8080
listener.Prefixes.Add("http://+:8080/");
listener.Start();
while(true)
{
// wait for a connect
var ctx = listener.GetContext();
var req = ctx.Request;
var resp = ctx.Response;
// default page
var cnt = "<html><body>click me </body></html>";
foreach(var key in req.QueryString.Keys)
{
if (key!=null)
{
// if the url contains ?site=some url to an site
switch(key.ToString())
{
case "site":
// lets download
var wc = new WebClient();
// store html in cnt
cnt = wc.DownloadString(req.QueryString[key.ToString()]);
// when needed you can do caching or processing here
// of the results, depending on your needs
break;
default:
break;
}
}
}
// output whatever is in cnt to the calling browser
using(var sw = new StreamWriter(resp.OutputStream))
{
sw.Write(cnt);
}
}
}
To make above code work you might have to set permissions for the url, if you'r on your development box do:
netsh http add urlacl url=http://+:8080/ user=Everyone listen=yes
On production use sane values for the user.
Once that is set run the above code and point your browser to
http://localhost:8080/
(notice the / at the end)
You'll get a simple page with a link on it:
click me
Clicking that link will send a new request to the httplistener but this time with the query string site=http://linkedin.com. The server side code will fetch the http content that is at the url given, in this case from LinkedIn.com. The result is send back one-on-one to the browser but you can do post-processing/caching etc, depending on your requirements.
Legal notice/disclaimer
Most sites don't like being scraped this way and their Terms of Service might actually forbid it. Make sure you don't do illegal things that either harms site reliability or leads to legal actions against you.

Should I open an IDBDatabase each time or keep one instance open?

I have a SPA application that will make multiple reads/writes to IndexedDB.
Opening the DB is an asynchronous operation with a callback:
var db;
var request = window.indexedDB.open("MyDB", 2);
request.onupgradeneeded = function(event) {
// Upgrade to latest version...
}
request.onerror = function(event) {
// Uh oh...
}
request.onsuccess = function(event) {
// DB open, now do something
db = event.target.result;
};
There are two ways I can use this db instance:
Keep a single db instance for the life of the page/SPA?
Call db.close() once the current operation is done and open a new one on the next operation?
Are there pitfalls of either pattern? Does keeping the indexedDB open have any risks/issues? Is there an overhead/delay (past the possible upgrade) to each open action?
I have found that opening a connection per operation does not substantially degrade performance. I have been running a local Chrome extension for over a year now that involves a ton of indexedDB operations and have analyzed its performance hundreds of times and have never witnessed opening a connection as a bottleneck. The bottlenecks come in doing things like not using an index properly or storing large blobs.
Basically, do not base your decision here on performance. It really isn't the issue in terms of connecting.
The issue is really the ergonomics of your code, how much you are fighting against the APIs, and how intuitive your code feels when you look at it, how understable you think the code is, how welcoming is it to fresh eyes (your own a month later, or someone else). This is very notable when dealing with the blocking issue, which is indirectly dealing with application modality.
My personal opinion is that if you are comfortable with writing async Javascript, use whatever method you prefer. If you struggle with async code, choosing to always open the connection will tend to avoid any issues. I would never recommend using a single global page-lifetime variable to someone who is newer to async code. You are also leaving the variable there for the lifetime of the page. On the other hand, if you find async trivial, and find the global db variable more amenable, by all means use it.
Edit - based on your comment I thought I would share some pseudocode of my personal preference:
function connect(name, version) {
return new Promise((resolve, reject) => {
const request = indexedDB.open(name, version);
request.onupgradeneeded = onupgradeneeded;
request.onsuccess = () => resolve(request.result);
request.onerror = () => reject(request.error);
request.onblocked = () => console.warn('pending till unblocked');
});
}
async foo(bar) {
let conn;
try {
conn = await connect(DBNAME, DBVERSION);
await storeBar(conn, bar);
} finally {
if(conn)
conn.close();
}
}
function storeBar(conn, bar) {
return new Promise((resolve, reject) => {
const tx = conn.transaction('store');
const store = tx.objectStore('store');
const request = store.put(bar);
request.onsuccess = () => resolve(request.result);
request.onerror = () => reject(request.error);
});
}
With async/await, there isn't too much friction in having the extra conn = await connect() line in your operational functions.
Opening a connection each time is likely to be slower just because the browser is doing more work (e.g. it may need to read data from disk). Otherwise, no real down sides.
Since you mention upgrades, either pattern requires a different approach to the scenario where the user opens your page in another tab and it tries to open the database with a higher version (because it downloaded newer code form your server). Let's say the old tab was version 3 and the new tab is version 4.
In the one-connection-per-operation case you'll find that your open() on version 3 fails, because the other tab was able to upgrade to version 4. You can notice that the open failed with VersionError e.g. and inform the user that they need to refresh the page.
In the one-connection-per-page case your connection at version 3 will block the other tab. The v4 tab can respond to the "blocked" event on the request and let the user know that older tabs should be closed. Or the v3 tab can respond to the versionupgrade event on the connection and tell the user that it needs to be closed. Or both.

Categories

Resources