I'm trying to scrape values from a page that have been generated by a JS script. When I inspect the page, i see the values there, but my selectors return null/undefined.
The purpose of the extension is to allow people, on click of a button, to scrape their personalised data from a page that requires login WITHOUT having to provide any login details to the extension.
In chrome-console, the static "title" values return, so i'm pretty sure my selectors are fine and it's just that accessing the document doesn't count for the executed scripts.
From reading, I might need to use something like pupeteer or selenium, but it seems they fire up their own browser instance (bad, as I'd need to take user login details to mock the sign in process) or i'd need to modify how the chrome browser starts with --remote-debugging-port=A_PORT_NUMBER which i want to avoid.
From chrome console and my extension, I can retrieve the values highlighted green, (so it is not an issue with iframes as some posts suggest) and can't retrieve values highlighted red.
HTML structure in image
From popup.html
document.addEventListener("DOMContentLoaded", function () {
...
document.querySelector('button[id="scrape"]').addEventListener("click", function onclick() {
chrome.tabs.query({ currentWindow: true, active: true },
function (activeTab) {
chrome.tabs.sendMessage(activeTab[0].id, { action: "putSource_scrapeSalePage", index: activeTab[0].index })
}
)
})
...
}, false)
From content.js
//Need to import pupeteer/selenium here? How else to use it for active tab?
chrome.runtime.onMessage.addListener(
function (request, sender, sendResponse) {
...
else if (request.action === "putSource_scrapeSalePage") {
let htmlvar = $(document)
console.log(htmlvar);
let test = $('td[desc= "transactionType"]').text().trim() //returns fine
let tableData1raw = $('table.tableDataOne tbody tr').find("tbody").find("tr")
let tableData1raw_almost = $(tableData1raw).each(function (i, element) {
console.log(element)
const $element = $(element).find("td")
console.log($element)
...
The Question:
If there is no better way to do this, how can I do this from content-script with something like pupeteer?
In the end I was able to use the Value i know i COULD get ("transaction type" title) and use it to traverse to it's sibling element (+) and retrieve whatever Div was there, instead of trying to target the Div class directly.
$('td[desc= "transactionType"] + td').find("div").text();
What I want:
My propose is to check if new content was added in a page (that I do not own), so I was thinking to make a script that save the last content added in a cookie and refresh the page every minute: If the cookie doesn't match the last content added, that would mean there is new content and I would receive a notification.
Let's try with pseudocode:
main_file:
include: functions.js;
cookie last_content_added= get_first_paragraph();
//Refresh script
do (every_minute){
page_reload();
}
when.page.reload.complete {
run script_check_content
}
functions.js
script_check_content{
var content_check = get_first_paragraph();
if (content_check == cookie[last_content_added])
{
//do nothing
}
else{
//new content was added
play.notification.mp3
cookie[last_content.added] = get_first_paragraph();
}
}
Am I not thinking in an easier solution for what I'm looking for?
I'm new to chrome extensions, if you could separate the code in different files like it was a real extension, I would appreciate very much.
I recommend to use 'chrome.tabs.query', use this to get all tabs that have the specified properties or all tabs if no properties are specified and 'chrome.tabs.executeScript' to inject the javascript code into a page that calls 'window.location.reload(). to refresh the page.
Here's a sample code to get the current tab and reload it using chrome.tab methods:
chrome.tabs.query({active: false, currentWindow: true}, function (arrayOfTabs) {
var code = 'window.location.reload();';
chrome.tabs.executeScript(arrayOfTabs[0].id, {code: code});
});
Also, include 'onCompleted' listener to listen when it is completely loaded and initialized.
chrome.webNavigation.onCompleted.addListener(function callback).
Take a look at MutationObserver, it provides a way to react to changes in a DOM. You can provide a callback to react to DOM changes and don't need to use a timer.
I'm using an <iframe> (I know, I know, ...) in my app (single-page application with ExtJS 4.2) to do file downloads because they contain lots of data and can take a while to generate the Excel file (we're talking anything from 20 seconds to 20 minutes depending on the parameters).
The current state of things is : when the user clicks the download button, he is "redirected" by Javascript (window.location.href = xxx) to the page doing the export, but since it's done in PHP, and no headers are sent, the browser continuously loads the page, until the file is downloaded. But it's not very user-friendly, because nothing shows him whether it's still loading, done (except the file download), or failed (which causes the page to actually redirect, potentially making him lose the work he was doing).
So I created a small non-modal window docked in the bottom right corner that contains the iframe as well as a small message to reassure the user. What I need is to be able to detect when it's loaded and be able to differenciate 2 cases :
No data : OK => Close window
Text data : Error message => Display message to user + Close window
But I tried all 4 events (W3Schools doc) and none is ever fired. I could at least understand that if it's not HTML data returned, it may not be able to fire the event, but even if I force an error to return text data, it's not fired.
If anyone know of a solution for this, or an alternative system that may fit here, I'm all ears ! Thanks !
EDIT : Added iframe code. The idea is to get a better way to close it than a setTimeout.
var url = 'http://mywebsite.com/my_export_route';
var ifr = $('<iframe class="dl-frame" src="'+url+'" width="0" height="0" frameborder="0"></iframe>');
ifr.appendTo($('body'));
setTimeout(function() {
$('.dl-frame').remove();
}, 3000);
I wonder if it would require some significant changes in both frontend and backend code, but have you considered using AJAX? The workflow would be something like this: user sends AJAX request to start file generating and frontend constantly polls it's status from the server, when it's done - show a download link to the user. I believe that workflow would be more straightforward.
Well, you could also try this trick. In parent window create a callback function for the iframe's complete loading myOnLoadCallback, then call it from the iframe with parent.myOnLoadCallback(). But you would still have to use setTimeout to handle server errors/connection timeouts.
And one last thing - how did you tried to catch iframe's events? Maybe it something browser-related. Have you tried setting event callbacks in HTML attributes directly? Like
<iframe onload="done()" onerror="fail()"></iframe>
That's a bad practice, I know, but sometimes job need to be done fast, eh?
UPDATE
Well, I'm afraid you have to spend a long and painful day with a JS debugger. load event should work. I still have some suggestions, though:
1) Try to set event listener before setting element's src. Maybe onload event fires so fast that it slips between creating element and setting event's callback
2) At the same time try to check if your server code plays nicely with iframes. I have made a simple test which attempts to download a PDF from Dropbox, try to replace my URL with your backed route's.
<script src="https://code.jquery.com/jquery-1.11.1.min.js"></script>
<iframe id="book"></iframe>
<button id="go">Request downloads!</button>
<script>
var bookUrl = 'https://www.dropbox.com/s/j4o7tw09lwncqa6/thinkpython.pdf';
$('#book').on('load', function(){
console.log('WOOT!', arguments);
});
$('#go').on('click', function(){
$('#book').attr('src', bookUrl);
});
</script>
UPDATE 2
3) Also, look at the Network tab of your browser's debugger, what happens when you set src to the iframe, it should show request and server's response with headers.
I've tried with jQuery and it worked just fine as you can see in this post.
I made a working example here.
It's basically this:
<iframe src="http://www.example.com" id="myFrame"></iframe>
And the code:
function test() {
alert('iframe loaded');
}
$('#myFrame').load(test);
Tested on IE11.
I guess I'll give a more hacky alternative to the more proper ways of doing it that the others have posted. If you have control over the PHP download script, perhaps you can just simply output javascript when the download is complete. Or perhaps redirect to a html page that runs javascript. The javascript run, can then try to call something in the parent frame. What will work depends if your app runs in the same domain or not
Same domain
Same domain frame can just use frame javascript objects to reference each other. so it could be something like, in your single page application you can have something like
window.downloadHasFinished=function(str){ //Global pollution. More unique name?
//code to be run when download has finished
}
And for your download php script, you can have it output this html+javascript when it's done
<script>
if(parent && parent.downloadHasFinished)
parent.downloadHasFinished("if you want to pass a data. maybe export url?")
</script>
Demo jsfiddle (Must run in fullscreen as the frames have different domain)
Parent jsfiddle
Child jsfiddle
Different Domains
For different domains, We can use postMessage. So in your single page application it will be something like
$(window).on("message",function(e){
var e=e.originalEvent
if(e.origin=="http://downloadphp.anotherdomain.com"){ //for security
var message=e.data //data passed if any
//code to be run when download has finished
}
});
and in your php download script you can have it output this html+javascript
<script>
parent.postMessage("if you want to pass data",
"http://downloadphp.anotherdomain.com");
</script>
Parent Demo
Child jsfiddle
Conclusion
Honestly, if the other answers work, you should probably use those. I just thought this was an interesting alternative so I posted it up.
You can use the following script. It comes from a project of mine.
$("#reportContent").html("<iframe id='reportFrame' sandbox='allow-same-origin allow-scripts' width='100%' height='300' scrolling='yes' onload='onReportFrameLoad();'\></iframe>");
Maybe you should use
$($('.dl-frame')[0].contentWindow.document).ready(function () {...})
Try this (pattern)
$(function () {
var session = function (url, filename) {
// `url` : URL of resource
// `filename` : `filename` for resource (optional)
var iframe = $("<iframe>", {
"class": "dl-frame",
"width": "150px",
"height": "150px",
"target": "_top"
})
// `iframe` `load` `event`
.one("load", function (e) {
$(e.target)
.contents()
.find("html")
.html("<html><body><div>"
+ $(e.target)[0].nodeName
+ " loaded" + "</div><br /></body></html>");
alert($(e.target)[0].nodeName
+ " loaded" + "\nClick link to download file");
return false
});
var _session = $.when($(iframe).appendTo("body"));
_session.then(function (data) {
var link = $("<a>", {
"id": "file",
"target": "_top",
"tabindex": "1",
"href": url,
"download": url,
"html": "Click to start {filename} download"
});
$(data)
.contents()
.find("body")
.append($(link))
.addBack()
.find("#file")
.attr("download", function (_, o) {
return (filename || o)
})
.html(function (_, o) {
return o.replace(/{filename}/,
(filename || $(this).attr("download")))
})
});
_session.always(function (data) {
$(data)
.contents()
.find("a#file")
.focus()
// start 6 second `download` `session`,
// on `link` `click`
.one("click", function (e) {
var timer = 6;
var t = setInterval(function () {
$(data)
.contents()
.find("div")
// `session` notifications
.html("Download session started at "
+ new Date() + "\n" + --timer);
}, 1000);
setTimeout(function () {
clearInterval(t);
$(data).replaceWith("<span class=session-notification>"
+ "Download session complete at\n"
+ new Date()
+ "</span><br class=session-notification />"
+ "<a class=session-restart href=#>"
+ "Restart download session</a>");
if ($("body *").is(".session-restart")) {
// start new `session`,
// on `.session-restart` `click`
$(".session-restart")
.on("click", function () {
$(".session-restart, .session-notification")
.remove()
// restart `session` (optional),
// or, other `session` `complete` `callback`
&& session(url, filename ? filename : null)
})
};
}, 6000);
});
});
};
// usage
session("http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf", "ECMA_JS.pdf")
});
jsfiddle http://jsfiddle.net/guest271314/frc82/
In regards to your comment about to get a better way to close it instead of setTimeout. You could use jQuery fadeOut option or any of the transitions and in the 'complete' callback remove the element. Below is an example you can dump right into a fiddle and only need to reference jQuery.
I also wrapped inside listener for 'load' event to not do the fade until the iFrame has been loaded as question originally was asking.
// plugin your URL here
var url = 'http://jquery.com';
// create the iFrame, set attrs, and append to body
var ifr = $("<iframe>")
.attr({
"src": url,
"width": 300,
"height": 100,
"frameborder": 0
})
.addClass("dl-frame")
.appendTo($('body'))
;
// log to show its part of DOM
console.log($(".dl-frame").length + " items found");
// create listener for load
ifr.one('load', function() {
console.log('iframe is loaded');
// call $ fadeOut to fade the iframe
ifr.fadeOut(3000, function() {
// remove iframe when fadeout is complete
ifr.remove();
// log after, should no longer exist in DOM
console.log($(".dl-frame").length + " items found");
});
});
If you are doing a file download from a iframe the load event wont fire :) I was doing this a week ago. The only solution to this problem is to call a download proxy script with a tag and then return that tag trough a cookie then the file is loaded. min while yo need to have a setInterval on the page witch will watch for that specific cookie.
// Jst to clearyfy
var token = new Date().getTime(); // ticks
$('<iframe>',{src:"yourproxy?file=somefile.file&token="+token}).appendTo('body');
var timers = [];
timers[timers.length+1] = setInterval(function(){
var _index = timers.length+1;
var cookie = $.cooke(token);
if(typeof cookie != "undefined"){
// File has been downloaded
$.removeCookie(token);
clearInterval(_index);
}
},400);
in your proxy script add the cookie with the name set to the string sent bay the token url parameter.
If you control the script in server that generates excel or whatever you are sending to iframe why don't you put a UID flag and store it in session with value 0, so... when iframe is created and server script is called just set UID flag to 1 and when script is finished (the iframe will be loaded) just put it to 2.
Then you only need a timer and a periodic AJAX call to the server to check the UID flag... if it's set to 0 the process doesn't started, if it's 1 the file is creating, and finally if it's 2 the process has been ended.
What do you think? If you need more information about this approach just ask.
What you are saying could be done for images and other media formats using $(iframe).load(function() {...});
For PDF files or other rich media, you can use the following Library:
http://johnculviner.com/jquery-file-download-plugin-for-ajax-like-feature-rich-file-downloads/
Note: You will need JQuery UI
You can use this library. The code snippet for you purpose would be something like:
window.onload = function () {
rajax_obj = new Rajax('',
{
action : 'http://mywebsite.com/my_export_route',
onComplete : function(response) {
//This will only called if you have returned any response
// instead of file from your export script
// In your case 2
// Text data : Error message => Display message to user
}
});
}
Then you can call rajax_obj.post() on your download link click.
Download
NB: You should add some header to your PHP script so it force file download
header('Content-Disposition: attachment; filename="'.$file.'"');
header('Content-Transfer-Encoding: binary');
There is two solutions that i can think of. Either you have PHP post it's progress to a MySQL table where from frontend will be pulling information from using AJAX calls to check up on the progress of the generation. Using somekind of unique key that is being generated when accessing the page would be ideal for multiple people generating excel files at the same time.
Another solution would be to use nodejs & then in PHP post the progress of the excel file using cURL or a socket to a nodejs service. Then when receiving updates from PHP in nodejs you simply write the progress of the excel file for the right socket. This will cut off some browser support though. Unless you go through with it using external libraries to bring websocket support for pretty much all browsers & versions.
Hope this answer helped. I was having the same issue previous year. Ended up doing AJAX polling having PHP post progress on the fly.
Try this:
Note: You should be on the same domain.
var url = 'http://mywebsite.com/my_export_route',
iFrameElem = $('body')
.append('<iframe class="dl-frame" src="' + url + '" width="0" height="0" frameborder="0"></iframe>')
.find('.dl-frame').get(0),
iDoc = iFrameElem.contentDocument || iFrameElem.contentWindow.document;
$(iDoc).ready(function (event) {
console.log('iframe ready!');
// do stuff here
});
In background page we're able to detect extension updates using chrome.runtime.onInstalled.addListener.
But after extension has been updated all content scripts can't connect to the background page. And we get an error: Error connecting to extension ....
It's possible to re-inject content scripts using chrome.tabs.executeScript... But what if we have a sensitive data that should be saved before an update and used after update? What could we do?
Also if we re-inject all content scripts we should properly tear down previous content scripts.
What is the proper way to handle extension updates from content scripts without losing the user data?
If you've established a communication through var port = chrome.runtime.connect(...) (as described on
https://developer.chrome.com/extensions/messaging#connect), it should be possible to listen to the runtime.Port.onDisconnect event:
tport.onDisconnect.addListener(function(msg) {...})
There you can react and, e.g. apply some sort of memoization, let's say via localStorage. But in general, I would suggest to keep content scripts as tiny as possible and perform all the data manipulations in the background, letting content only to collect/pass data and render some state, if needed.
Once Chrome extension update happens, the "orphaned" content script is cut off from the extension completely. The only way it can still communicate is through shared DOM. If you're talking about really sensitive data, this is not secure from the page. More on that later.
First off, you can delay an update. In your background script, add a handler for the chrome.runtime.onUpdateAvailable event. As long as the listener is there, you have a chance to do cleanup.
// Background script
chrome.runtime.onUpdateAvailable.addListener(function(details) {
// Do your work, call the callback when done
syncRemainingData(function() {
chrome.runtime.reload();
});
});
Second, suppose the worst happens and you are cut off. You can still communicate using DOM events:
// Content script
// Get ready for data
window.addEventListener("SendRemainingData", function(evt) {
processData(evt.detail);
}, false);
// Request data
var event = new CustomEvent("RequestRemainingData");
window.dispatchEvent(event);
// Be ready to send data if asked later
window.addEventListener("RequestRemainingData", function(evt) {
var event = new CustomEvent("SendRemainingData", {detail: data});
window.dispatchEvent(event);
}, false);
However, this communication channel is potentially eavesdropped on by the host page. And, as said previously, that eavesdropping is not something you can bypass.
Yet, you can have some out-of-band pre-shared data. Suppose that you generate a random key on first install and keep it in chrome.storage - this is not accessible by web pages by any means. Of course, once orphaned you can't read it, but you can at the moment of injection.
var PSK;
chrome.storage.local.get("preSharedKey", function(data) {
PSK = data.preSharedKey;
// ...
window.addEventListener("SendRemainingData", function(evt) {
processData(decrypt(evt.detail, PSK));
}, false);
// ...
window.addEventListener("RequestRemainingData", function(evt) {
var event = new CustomEvent("SendRemainingData", {detail: encrypt(data, PSK)});
window.dispatchEvent(event);
}, false);
});
This is of course proof-of-concept code. I doubt that you will need more than an onUpdateAvailable listener.
I'm writing an extension that checks every document a user views on certain data structures, does some back-end server calls and displays the results as a dialog.The problem is starting and continuing the sequence properly with event listeners. My actual idea is:
Load: function()
{
var Listener = function(){ Fabogore.Start();};
var ListenerTab = window.gBrowser.selectedTab;
ListenerTab.addEventListener("load",Listener,true);
}
(...)
ListenerTab.removeEventListener("load", Listener, true);
Fabogore.Load();
The Fabogore.Load function is first initialized when the browser gets opened. It works only once I get these data structures, but not afterwards. But theoretically the script should initialize a new listener, so maybe it's the selectedTab. I also tried listening to focus events.
If someone has got an alternative solution how to access a page a user is currently viewing I would feel comfortable as well.
The common approach is using a progress listener. If I understand correctly, you want to get a notification whenever a browser tab finished loading. So the important method in your progress listener would be onStateChange (it needs to have all the other methods as well however):
onStateChange: function(aWebProgress, aRequest, aFlag, aStatus)
{
if ((aFlag & Components.interfaces.nsIWebProgressListener.STATE_STOP) &&
(aFlag & Components.interfaces.nsIWebProgressListener.STATE_IS_WINDOW) &&
aWebProgress.DOMWindow == aWebProgress.DOMWindow.top)
{
// A window finished loading and it is the top-level frame in its tab
Fabogore.Start(aWebProgress.DOMWindow);
}
},
Ok, I found a way which works from the MDN documentation, and achieves that every document a user opens can be accessed by your extension. Accessing every document a user focuses is too much, I want the code to be executed only once. So I start with initializing the Exentsion, and Listen to DOMcontentloaded Event
window.addEventListener("load", function() { Fabogore.init(); }, false);
var Fabogore = {
init: function() {
var appcontent = document.getElementById("appcontent"); // browser
if(appcontent)
appcontent.addEventListener("DOMContentLoaded", Fabogore.onPageLoad, true);
},
This executes the code every Time a page is loaded. Now what's important is, that you execute your code with the new loaded page, and not with the old one. You can acces this one with the variable aEvent:
onPageLoad: function(aEvent)
{
var doc = aEvent.originalTarget;//Document which initiated the event
With the variable "doc" you can check data structures using XPCNativeWrapper etc. Thanks Wladimir for bringing me in the right direction, I suppose if you need a more sophisticated event listening choose his way with the progress listeners.