Chrome Extension: Modifying a built in function for all webworkers - javascript

So here's a bit of background. I need to intercept the return data from a fetch request. I already know how to do that. By injecting this code in the webpage I can read the response body of a fetch. Like this:
var proxyFetch = fetch;
window.fetch = async function () {
var response = proxyFetch(...arguments);
// a lot of other code as well
return response;
}
The problem I'm having is that I can't modify the fetch function for any web workers. I don't have access to their 'this' context. My best bet is something to do with WorkerGlobalScope but I don't know if that object can do what I want.
My second idea was to do something like before the page loads.
window.Worker = class extends window.Worker{
constructor(){
// Take the url they passed into webworker.
// Fetch the url to get the raw javascript.
// Append my javascript to it.
// Turn it into a blob url
// Now I have my javascript running in every web worker
//
// The problem here is that we are not allowed to use asynchronous calls in the constructor.
// I am stumped on this option as well.
}
}
I've looked into WebRequest but they don't allow us to look at the response body of network requests.
I played around with this a lot to see if I could modify the globalThis or WorkerGlobalScope for every webworker on the page so I can modify the fetch but I could not find a way.
var code = "postMessage('hello world');"
var blob = new Blob([code], { type: 'application/javascript' });
var url = URL.createObjectURL(blob);
var worker = new Worker(url);
worker.onmessage = function(e) {
console.log(e.data);
};

What you are trying to achieve is mean to be done by a Service Worker.
They work as proxy between the app and server. You can control whether the request is even made or what it returns.
Additionally you can send messages between your worker and the service worker with the postMessage interface.

Related

How to get the resource initator like chrome-dev-tools?

I am making a javascript tracker using CefSharp.
While searching for a lot of solution, I could use a service worker to implement code like this that hijacking http/https requests, but it didn't work:
navigator.serviceWorker.addEventListener('fetch', event => {
event.respondWith(async function(e) {
var err = new Error();
ccw.hhh(err.stack); // print call-stack info using my own js-object
}());
});
Since I didn't create own web pages, so couldn't use code like navigator.serviceWorker.register('sw.js').
And also https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Intercept_HTTP_requests is not working on CefSharp.
And IResourceHandler::Open(IRequest ...) IRequest provided by CefSharp not contains callstack and initator informations.
How can I get to the requests call stack informations in CefSharp?

Javascript Workers - why is the worker message treated so lately and can I do something against it?

I have a Worker that shares a SharedArrayBuffer with the "main thread". To work correctly, I have to make sure that the worker has access to the SAB before the main thread accesses to it. (EDIT: The code creating the worker has to be in a seperate function (EDIT2: which returns an array pointing to the SAB).) (Maybe, already this is not possible, you'll tell me).
The initial code looks like this:
function init() {
var code = `onmessage = function(event) {
console.log('starting');
var buffer=event.data;
var arr = new Uint32Array(buffer);// I need to have this done before accessing the buffer again from the main
//some other code, manipulating the array
}`
var buffer = new SharedArrayBuffer(BUFFER_ELEMENT_SIZE);
var blob = new Blob([code], { "type": 'application/javascript' });
var url = window.URL || window.webkitURL;
var blobUrl = url.createObjectURL(blob);
var counter = new Worker(blobUrl);
counter.postMessage(buffer);
let res = new Uint32Array(buffer);
return res;
}
function test (){
let array = init();
console.log('main');
//accessing the SAB again
};
The worker code is always executed after test(), the console shows always main, then starting.
Using timeouts does not help. Consider the following code for test:
function test (){
let array = [];
console.log('main');
setTimeout(function(){
array = initSAB();
},0);
setTimeout(function(){
console.log('main');
//accessing the SAB again
},0);
console.log('end');
};
The console shows end first, followed by main, followed by starting.
However, assigning the buffer to a global array outside the test() function does the job, even without timeouts.
My questions are the following:
why does the worker does not start directly after the message was send (= received?). AFAIK, workers have their own event queue, so they should not rely on the main stack becoming empty?
Is there a specification detailing when a worker starts working after sending a message?
Is there a way to make sure the worker has started before accessing the SAB again without using global variables? (One could use busy waiting, but I beware...) There is probably no way, but I want to be sure.
Edit
To be more precise:
In a completly parallel running scenario, the Worker would be able to
handle the message immediately after it was posted. This is obviously
not the case.
Most Browser API (and Worker is such an API) use a callback queue to handle calls to the API. But if this applied, the message would be
posted/handled before the timeout calbacks were executed.
To go even further: If I try busy waiting after postMessage by reading from the SAB until it changes one value will block the
program infinitely. For me, it means that the Browser does
not posts the message until the call stack is empty As far as
I know, this behaviour is not documentated and I cannot explain it.
To summerize: I want to know how the browser determines when to post the message and to handle it by the worker, if the call of postMessage is inside a function. I already found a workaround (global variables), so I'm more interested in how it works behind the scenes. But if someone can show me a working example, I'll take it.
EDIT 2:
The code using the global variable (the code that works fine) looks like this
function init() {
//Unchanged
}
var array = init(); //global
function test (){
console.log('main');
//accessing the SAB again
};
It prints starting, then main to the console.
What is also worth noticing : If I debug the code with the Firefox Browser (Chrome not tested) I get the result I want without the global variable (starting before main) Can someone explain?
why does the worker does not start directly after the message was sen[t] (= received?). AFAIK, workers have their own event queue, so they should not rely on the main stack becoming empty?
First, even though your Worker object is available in main thread synchronously, in the actual worker thread there are a lot of things to do before being able to handle your message:
it has to perform a network request to retrieve the script content. Even with a blobURI, it's an async operation.
it has to initialize the whole js context, so even if the network request was lightning fast, this would add up on parallel execution time.
it has to wait the event loop frame following the main script execution to handle your message. Even if the initialization was lightning fast, it will anyway wait some time.
So in normal circumstances, there is very little chances that your Worker could execute your code at the time you require the data.
Now you talked about blocking the main thread.
If I try busy waiting after postMessage by reading from the SAB until it changes one value will block the program infinitely
During the initialization of your Worker, the message are temporarily being kept on the main thread, in what is called the outside port. It's only after the fetching of the script is done that this outside port is entangled with the inside port, and that the messages actually pass to that parallel thread.
So if you do block the main thread before the ports have been entangled it won't be able to pass it to the worker's thread.
Is there a specification detailing when a worker starts working after sending a message?
Sure, and more specifically, the port message queue is enabled at the step 26, and the Event loop is actually started at the step 29.
Is there a way to make sure the worker has started before accessing the SAB again without using global variables? [...]
Sure, make your Worker post a message to the main thread when it did.
// some precautions because all browsers still haven't reenabled SharedArrayBuffers
const has_shared_array_buffer = window.SharedArrayBuffer;
function init() {
// since our worker will do only a single operation
// we can Promisify it
// if we were to use it for more than a single task,
// we could promisify each task by using a MessagePort
return new Promise((resolve, reject) => {
const code = `
onmessage = function(event) {
console.log('hi');
var buffer= event.data;
var arr = new Uint32Array(buffer);
arr.fill(255);
if(self.SharedArrayBuffer) {
postMessage("done");
}
else {
postMessage(buffer, [buffer]);
}
}`
let buffer = has_shared_array_buffer ? new SharedArrayBuffer(16) : new ArrayBuffer(16);
const blob = new Blob([code], { "type": 'application/javascript' });
const blobUrl = URL.createObjectURL(blob);
const counter = new Worker(blobUrl);
counter.onmessage = e => {
if(!has_shared_array_buffer) {
buffer = e.data;
}
const res = new Uint32Array(buffer);
resolve(res);
};
counter.onerror = reject;
if(has_shared_array_buffer) {
counter.postMessage(buffer);
}
else {
counter.postMessage(buffer, [buffer]);
}
});
};
async function test (){
let array = await init();
//accessing the SAB again
console.log(array);
};
test().catch(console.error);
According to MDN:
Data passed between the main page and workers is copied, not shared. Objects are serialized as they're handed to the worker, and subsequently, de-serialized on the other end. The page and worker do not share the same instance, so the end result is that a duplicate is created on each end. Most browsers implement this feature as structured cloning.
Read more about transferring data to and from workers
Here's a basic code that shares a buffer with a worker. It creates an array with even values (i*2) and it sends it to the worker. It uses Atomic operations to change the buffer values.
To make sure the worker has started you can just use different messages.
var code = document.querySelector('[type="javascript/worker"]').textContent;
var blob = new Blob([code], { "type": 'application/javascript' });
var blobUrl = URL.createObjectURL(blob);
var counter = new Worker(blobUrl);
var sab;
var initBuffer = function (msg) {
sab = new SharedArrayBuffer(16);
counter.postMessage({
init: true,
msg: msg,
buffer: sab
});
};
var editArray = function () {
var res = new Int32Array(sab);
for (let i = 0; i < 4; i++) {
Atomics.store(res, i, i*2);
}
console.log('Array edited', res);
};
initBuffer('Init buffer and start worker');
counter.onmessage = function(event) {
console.log(event.data.msg);
if (event.data.edit) {
editArray();
// share new buffer with worker
counter.postMessage({buffer: sab});
// end worker
counter.postMessage({end: true});
}
};
<script type="javascript/worker">
var sab;
self['onmessage'] = function(event) {
if (event.data.init) {
postMessage({msg: event.data.msg, edit: true});
}
if (event.data.buffer) {
sab = event.data.buffer;
var sharedArray = new Int32Array(sab);
postMessage({msg: 'Shared Array: '+sharedArray});
}
if (event.data.end) {
postMessage({msg: 'Time to rest'});
}
};
</script>

Retry failed pages with new proxyUrl

I have developed an Actor+PuppeteerCrawler+Proxy based crawler and want to rescrape failed pages. To increase the chance for the rescrape, I want to switch to another proxyUrl. The idea is, to create a new crawler with a modified launchPupperteer function and a different proxyUrl, and re-enque the failed pages. Please check the sample code below.
But unfortunately, it doesn't work, although I reset the request queue by using drop and reopening. Is it possible to rescraped failed pages by using PuppeteerCrawler with a different proxyUrl and how?
Best regards,
Wolfgang
for(let retryCount = 0; retryCount <= MAX_RETRY_COUNT; retryCount++){
if(retryCount){
// Try to reset the request queue, so that failed request shell be rescraped
await requestQueue.drop();
requestQueue = await Apify.openRequestQueue(); // this is necessary to avoid exceptions
// Re-enqueue failed urls in array failedUrls >>> ignored although using drop() and reopening request queue!!!
for(let failedUrl of failedUrls){
await requestQueue.addRequest({url: failedUrl});
}
}
crawlerOptions.launchPuppeteerFunction = () => {
return Apify.launchPuppeteer({
// generates a new proxy url and adds it to a new launchPuppeteer function
proxyUrl: createProxyUrl()
});
};
let crawler = new Apify.PuppeteerCrawler(crawlerOptions);
await crawler.run();
}
I think your approach should work but on the other hand it should not be necessary. I'm not sure what createProxyUrl does.
You can supply a generic proxy URL with auto username which will use all your datacenter proxies at Apify. Or you can provide proxyUrls directly to PuppeteerCrawler.
Just don't forget that you have to switch browser to get a new IP from the proxy. More in this article - https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler

How to create a cross domain HTTP request

I have a website, and I need a way to get html data from a different website via an http request, and I've looked around for ways to implement it and most say via an ajax call instead.
An ajax call is blocked by linked in so I want to try a plain cross domain http request and hope it's not blocked one way or another.
If you have a server running and are able to run code on it, you can make the HTTP call server side. Keep in mind though that most sites only allow so many calls per IP address so you can't serve a lot of users this way.
This is a simple httpListener that downloads an websites content when the QueryString contains ?site=http://linkedin.com:
// setup an listener
using(var listener = new HttpListener())
{
// on port 8080
listener.Prefixes.Add("http://+:8080/");
listener.Start();
while(true)
{
// wait for a connect
var ctx = listener.GetContext();
var req = ctx.Request;
var resp = ctx.Response;
// default page
var cnt = "<html><body>click me </body></html>";
foreach(var key in req.QueryString.Keys)
{
if (key!=null)
{
// if the url contains ?site=some url to an site
switch(key.ToString())
{
case "site":
// lets download
var wc = new WebClient();
// store html in cnt
cnt = wc.DownloadString(req.QueryString[key.ToString()]);
// when needed you can do caching or processing here
// of the results, depending on your needs
break;
default:
break;
}
}
}
// output whatever is in cnt to the calling browser
using(var sw = new StreamWriter(resp.OutputStream))
{
sw.Write(cnt);
}
}
}
To make above code work you might have to set permissions for the url, if you'r on your development box do:
netsh http add urlacl url=http://+:8080/ user=Everyone listen=yes
On production use sane values for the user.
Once that is set run the above code and point your browser to
http://localhost:8080/
(notice the / at the end)
You'll get a simple page with a link on it:
click me
Clicking that link will send a new request to the httplistener but this time with the query string site=http://linkedin.com. The server side code will fetch the http content that is at the url given, in this case from LinkedIn.com. The result is send back one-on-one to the browser but you can do post-processing/caching etc, depending on your requirements.
Legal notice/disclaimer
Most sites don't like being scraped this way and their Terms of Service might actually forbid it. Make sure you don't do illegal things that either harms site reliability or leads to legal actions against you.

Execute web worker from different origin

I am developing a library which I want to host on a CDN. The library is going to be used on many different domains across multiple servers. The library itself contains one script (let's call it script.js for now) which loads a web worker (worker.js).
Loading the library itself is quite easy: just add the <script type="text/javascript" src="http://cdn.mydomain.com/script.js"></script> tag to the domain on which I want to use the library (www.myotherdomain.com). However since the library is loading a worker from http://cdn.mydomain.com/worker.js new Worker('http://cdn.mydomain.com/worker.js'), I get a SecurityException. CORS is enabled on cdn.mydomain.com.
For web workers it is not allowed to use a web worker on a remote domain. Using CORS will not help: browsers seem to ignore it and don't even execute the preflight check.
A way around this would be to perform an XMLHttpRequest to get the source of the worker and then create a BLOB url and create a worker using this url. This works for Firefox and Chrome. However, this does not seem to work for Internet Explorer or Opera.
A solution would be to place the worker on www.myotherdomain.com or place a proxy file (which simply loads the worker from the cdn using XHR or importScripts). I do not however like this solution: it requires me to place additional files on the server and since the library is used on multiple servers, updating would be difficult.
My question consists of two parsts:
Is it possible to have a worker on a remote origin for IE 10+?
If 1 is the case, how is it handled best to be working cross-browser?
The best is probably to generate a simple worker-script dynamically, which will internally call importScripts(), which is not limited by this cross-origin restriction.
To understand why you can't use a cross-domain script as a Worker init-script, see this answer. Basically, the Worker context will have its own origin set to the one of that script.
// The script there simply posts back an "Hello" message
// Obviously cross-origin here
const cross_origin_script_url = "https://greggman.github.io/doodles/test/ping-worker.js";
const worker_url = getWorkerURL( cross_origin_script_url );
const worker = new Worker( worker_url );
worker.onmessage = (evt) => console.log( evt.data );
URL.revokeObjectURL( worker_url );
// Returns a blob:// URL which points
// to a javascript file which will call
// importScripts with the given URL
function getWorkerURL( url ) {
const content = `importScripts( "${ url }" );`;
return URL.createObjectURL( new Blob( [ content ], { type: "text/javascript" } ) );
}
For those who find this question:
YES.
It is absolutely possible: the trick is leveraging an iframe on the remote domain and communicating with it through postMessage. The remote iframe (hosted on cdn.mydomain.com) will be able to load the webworker (located at cdn.mydomain.com/worker.js) since they both have the same origin.
The iframe can then act as a proxy between the postMessage calls. The script.js will however be responsible from filtering the messages so only valid worker messages are handled.
The downside is that communication speeds (and data transfer speeds) do take a performance hit.
In short:
script.js appends iframe with src="//cdn.mydomain.com/iframe.html"
iframe.html on cdn.mydomain.com/iframe.html, executes new Worker("worker.js") and acts as a proxy for message events from window and worker.postMessage (and the other way around).
script.js communicates with the worker using iframe.contentWindow.postMessage and the message event from window. (with the proper checks for the correct origin and worker identification for multiple workers)
It's not possible to load a web worker from a different domain.
Similar to your suggestion, you could make a fetch call, then take that JS and base64 it. Doing so allows you to do:
const worker = new Worker(`data:text/javascript;base64,${btoa(workerJs)}`)
You can find out more info about data URIs here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs.
This is the workaround I prefer because it doesn't require anything crazy like an iframe with a message proxy and is very simple to get working provided you setup CORS correctly from your CDN.
Since #KevinGhadyani answer (or blob techniques) require to lessen your CSPs (by adding a worker-src data: or blob: directive, for example), there is a little example of how you can take advantage of importScripts inside a worker to load another worker script hosted on another domain, without lessening your CSPs.
It may help you to load a worker from any CDN allowed by your CSPs.
As far as I know, it works on Opera, Firefox, Chrome, Edge and all browsers that support workers.
/**
* This worker allow us to import a script from our CDN as a worker
* avoiding to have to reduce security policy.
*/
/**
* Send a formated response to the main thread. Can handle regular errors.
* #param {('imported'|'error')} resp
* #param {*} data
*/
function respond(resp, data = undefined){
const msg = { resp };
if(data !== undefined){
if(data && typeof data === 'object'){
msg.data = {};
if(data instanceof Error){
msg.error = true;
msg.data.code = data.code;
msg.data.name = data.name;
msg.data.stack = data.stack.toString();
msg.data.message = data.message;
} else {
Object.assign(msg.data, data);
}
} else msg.data = data;
}
self.postMessage(msg);
}
function handleMessage(event){
if(typeof event.data === 'string' && event.data.match(/^#worker-importer/)){
const [
action = null,
data = null
] = event.data.replace('#worker-importer.','').split('|');
switch(action){
case 'import' :
if(data){
try{
importScripts(data);
respond('imported', { url : data });
//The work is done, we can just unregister the handler
//and let the imported worker do it's work without us.
self.removeEventListener('message', handleMessage);
}catch(e){
respond('error', e);
}
} else respond('error', new Error(`No url specified.`));
break;
default : respond('error', new Error(`Unknown action ${action}`));
}
}
}
self.addEventListener('message', handleMessage);
How to use it ?
Obviously, your CSPs must allow the CDN domain, but you don't need more CSP rule.
Let's say that you domain is my-domain.com, and your cdn is statics.your-cdn.com.
The worker we want to import is hosted at https://statics.your-cdn.com/super-worker.js and will contain :
self.addEventListener('message', event => {
if(event.data === 'who are you ?') {
self.postMessage("It's me ! I'm useless, but I'm alive !");
} else self.postMessage("I don't understand.");
});
Assuming that you host a file with the code of the worker importer on your domain (NOT your CDN) under the path https://my-domain.com/worker-importer.js, and that you try to start your worker inside a script tag at https://my-domain.com/, this is how it works :
<script>
window.addEventListener('load', async () => {
function importWorker(url){
return new Promise((resolve, reject) => {
//The worker importer
const workerImporter = new Worker('/worker-importer.js');
//Will only be used to import our worker
function handleImporterMessage(event){
const { resp = null, data = null } = event.data;
if(resp === 'imported') {
console.log(`Worker at ${data.url} successfully imported !`);
workerImporter.removeEventListener('message', handleImporterMessage);
// Now, we can work with our worker. It's ready !
resolve(workerImporter);
} else if(resp === 'error'){
reject(data);
}
}
workerImporter.addEventListener('message', handleImporterMessage);
workerImporter.postMessage(`#worker-importer.import|${url}`);
});
}
const worker = await importWorker("https://statics.your-cdn.com/super-worker.js");
worker.addEventListener('message', event => {
console.log('worker message : ', event.data);
});
worker.postMessage('who are you ?');
});
</script>
This will print :
Worker at https://statics.your-cdn.com/super-worker.js successfully imported !
worker message : It's me ! I'm useless, but I'm alive !
Note that the code above can even work if it's written in a file hosted on the CDN too.
This is especially usefull when you have several worker scripts on your CDN, or if you build a library that must be hosted on a CDN and you want your users to be able to call your workers without having to host all workers on their domain.

Categories

Resources