Why won't JavaScript let me scrape webpages?

Why won't JavaScript let me scrape webpages? - javascript

In trying to advance my skills in Javascript, I am attempting to make a little webpage that will scrape static articles from Wikipedia. I have the page all set up, and everything works in creating links and an iframe. But when I try to access the contents of the iframe:
function getIframeContent(frameId) {
var frameObj =
document.getElementById(frameId);
var frameContent = frameObj.
contentWindow.document.body.innerHTML;
alert("frame content : " + frameContent);
}
It throws this error:
Uncaught DOMException: Blocked a frame with origin "https://nottimtam.github.io" from accessing a cross-origin frame.
at getIframeContent (https://nottimtam.github.io/wikipedia-scraper/main.js:44:23)
at trigger_download (https://nottimtam.github.io/wikipedia-scraper/main.js:50:25)
at HTMLButtonElement.onclick (https://nottimtam.github.io/wikipedia-scraper/:22:54)
So I researched the cross-origin error and I understand why it exists. What I don't understand is why I cannot download the static text and images on this page? I can do something very similar, without issue, with BeautifulSoup in Python...
I am VERY new to Javascript so it is likely that I am just overlooking something.
Here's my code: https://github.com/NotTimTam/wikipedia-scraper/tree/gh-pages
The scraper is running from here: https://nottimtam.github.io/wikipedia-scraper/

Related

How to access the DOM of a Contentful page with a bookmarklet?

After a page like this is fully loaded
https://app.contentful.com/spaces/.../entries/...
(an internal customized company page for editing a page on a site), I easily can access elements and text like this from the console:
document.querySelectorAll("p[data-test-id='cf-ui-paragraph']")[8].textContent;
But using the same query in a bookmarklet fails on the very first line -- even after the page is fully loaded.
javascript:(function() {
var path = document.querySelectorAll("p[data-test-id='cf-ui-paragraph']")[8].textContent;
//
//
})();
Uncaught TypeError: Cannot read properties of undefined (reading 'textContent')
This seems to be a common issue but all I've seen here is the fact that it's 'not the same DOM'.
Any wisdom on how to get this (or any query at all) working on a Contentful backend editor page?

SecurityError: Permission denied to access property "document" on cross-origin object when acces info of a iFrame

I have a html page with an iFrame on another domain.
<iframe id="myframe"src="https://www.example.com" height="1000" width=100%>
I want to get some info of that website and used them on my html page. For example I want to extract the tag and set it as title of my html.
I used this code I found on W3School
var iframe = document.getElementById("myframe");
var y = (iframe.contentWindow || iframe.contentDocument);
if (y.document)y = y.document;
var strHeader = y.body.getElementsByClassName("header");
I wrote in with IDE HelloWebFree (similar to notepad++). When I try my code within the IDE it works fine. But when I try to open it with safari, Google Chrome or Firefox it doesn't work.
I get the error:"SecurityError: Permission denied to access property "document" on cross-origin object".
I don't need to change anything to the iframe content, I just want to read it and use it in my own html. So I don't see why it is a security issue. Why does it works within my IDE and not in the browser?
I have read a lot of post about this issue, but the most ones or more then 5 years old and not applicable anymore. someone an idea?

The reason this happens in browsers is to protect the user's security.
It is in place to protect the user from any random website reading his personal info from his Facebook, or Banking website or whatever.
If you want to access public information from a website, then I suggest scraping it on the backend.

print a pdf via iframe (cross domain)

I need to print a PDF... But I get an error
Is there a workaround? I just need to print a PDF file with one click
error:
Uncaught SecurityError: Blocked a frame with origin "https://secure.domain.com" from accessing a frame with origin "https://cdn.domain.com". Protocols, domains, and ports must match.
code:
var iframe = $('<iframe src="'+url+'" style="display:none"></iframe>').appendTo($('#main')).load(function(){
iframe.get(0).contentWindow.print();
});

The error you are dealing with is related to cross-domain protection and the same-origin policy.
In your case, you can print an cross-domain iframe if you nest this iframe in another local iframe that we can call a proxy iframe.
Since the proxy iframe is local and have the same origin, you can print it without any issue and it'll also print the cross-domain iframe.
See below for an example:
index.html (container)
$(function() {
var url = 'proxy.html'; // We're not loading the PDF but a proxy which will load the PDF in another iframe.
var iframe = $('<iframe src="' + url + '"></iframe>').appendTo($('#main'));
iframe.on('load', function(){
iframe.get(0).contentWindow.print();
});
});
proxy.html (proxy)
<body>
<iframe src="http://ANOTHER_DOMAIN/PDF_NAME.pdf"></iframe>
</body>
With this solution, you no longer have cross-domain issues and you can use the print() function. The only things you need to deal with are a way to pass the PDF url from the container to the proxy and a way to detect when the iframe with the PDF is actually loaded but these depends on the solution / languages you're using.

I needed to print a PDF embedded through a data:application/pdf;base64,… iframe, and I ran into the same cross-origin issue.
The solution was to convert the Base64 contents that I had into a blob, and then use put blob's object URL into the iframe src. After doing that I was able to print that iframe.
I know link-only answers are discouraged, but copying someone else's answers into my own didn't feel right either.

There is a workaround for this.
Create an endpoint in your server to return the HTML content of the external url. (because you can't get external content from the browser - same-origin policy)
Use $.get to fetch the external content from your URL and append it to an iframe.
Something similar to this:
HTML:
<div id="main">
<iframe id="my-iframe" style="display:none"></iframe>
</div>
JS:
$.get('https://secure.domain.com/get-cdndomaincom-url-content', function(response) {
var iframe = $('#my-iframe')[0],
iframedoc = iframe.contentDocument || iframe.contentWindow.document;
iframedoc.body.innerHTML = response;
iframe.contentWindow.print();
});
C# implementation for get-cdndomaincom-url-content:
Easiest way to read from a URL into a string in .NET

You do not need proxy server for workaround. You can create proxy iframe and then dynamically create another iframe inside the proxy iframe. Then attach onload="print()" to it.
Something like this
/**
* Load iframe from cross-origin via proxy iframe
* and then invokes the print dialog.
* It is not possible to call window.print() on the target iframe directly
* because of cross-origin policy.
*
* Downside is that the iframe stays loaded.
*/
function printIframe(url) {
var proxyIframe = document.createElement('iframe');
var body = document.getElementsByTagName('body')[0];
body.appendChild(proxyIframe);
proxyIframe.style.width = '100%';
proxyIframe.style.height = '100%';
proxyIframe.style.display = 'none';
var contentWindow = proxyIframe.contentWindow;
contentWindow.document.open();
// Set dimensions according to your needs.
// You may need to calculate the dynamically after the content has loaded
contentWindow.document.write('<iframe src="' + url + '" onload="print();" width="1000" height="1800" frameborder="0" marginheight="0" marginwidth="0">');
contentWindow.document.close();
}

--Issue--
HiDeo is right this is a cross-domain issue. It is a part of CORS which is a great thing for the security of the web but also a pain.
--Philosophy--
There are ways to work around CORS but I believe in finding a solution that works for most to all cases and keep using it. This creates easier code to reason about and keeps code consistent rather then changing and creating code for edge cases. This can create a harder initial solution but as you can reuse the method regardless of use case you end up saving time.
--Answer--
The way our team handles cross domain request issues is bypassing it completely. CORS is something for browsers only. So the best way to solve all cases of this issue is don't give the reasonability to the browser. We have the server fetch the document and giving it to the browser on the same domain.
(I'm an Angular guy)
The client would say something like
$http.get('/pdf/x').then(function(){
//do whatever you want with any cross-domain doument
});
The Server would have something like what you see here HTTP GET Request in Node.js Express

It is a CORS issue . There is a library that acts as a CORS alternative , you can find it here Xdomain CORS alternative Github . It kind of by-passes CORS request seamlessly to render cross domain resources effectively.
It has a Javascript, AngularJS, as well as a JQuery wrapper too . I think this will provide you a more elegant solution, and is simple to integrate. Give it a try .

Cross-origin issue loading images for skybox

This is driving me nuts. I am trying to get a skybox working with images I got from the internet. My code:
As global variables:
var path = "file:///C:/Users/Tyler/Desktop/ComS_336_workspace/";
var imageNames = [
path + "meadow_bk.jpg",
path + "meadow_ft.jpg",
path + "meadow_up.jpg",
path + "meadow_dn.jpg",
path + "meadow_rt.jpg",
path + "meadow_lf.jpg"
];
In the main function:
// load the six images
//THREE.ImageUtils.crossOrigin = "Anonymous";
var ourCubeMap = THREE.ImageUtils.loadTextureCube(imageNames);
But get this error:
DOMException: Failed to execute 'texImage2D' on 'WebGLRenderingContext': The cross-origin image at file:///C:/Users/Tyler/Desktop/ComS_336_workspace/meadow_ft.jpg may not be loaded.
And when I add the THREE.ImageUtils.crossOrigin = "Anonymous"; I get this error:
Image from origin 'file://' has been blocked from loading by Cross-Origin Resource Sharing policy: Invalid response. Origin 'null' is therefore not allowed access.
I saw some posts saying to run chrome with --allow-files-access-from-files or whatever, but don't want to risk any security issues. Then I saw others say to put images on webserver? This I had absolutely no idea how to do and why must this be that complicated??
My html and js files are in the same folder (ComS_336_workspace) as my images.
I am just really needing help on what to do and if I must do a webserver, how do I put my images on them and set the path to my images? That I don't get.

To solve this problem you need either save remote files to your site or use some proxy script from your site (PHP in example) to get images through your site.
Anyway you need a fully qualified domain name to get rid of CORS-compliant errors. Loading webgl and images from file:/// will not work.

What are you using to edit your code? If you are trying to run the page from your desktop for testing purposes then I would recommend brackets.io.
Its great, free, and will get you around the CORS issue as you build your code up.

Load Wikipedia page and print locally

This is a weird one. I am attempting the following.
I have a local HTML and JavaScript file which generates a random Wikipedia page. When I get the URL for the random Wikipedia page I want to send it to the printer. However, both Chrome and Firefox seem to have a real problem with this.
In Chrome I get an error:
Unsafe JavaScript attempt to access frame with URL https://secure.wikimedia.org/wikipedia/en/w/index.php?title=Popran_National%20Park&printable=yes from frame with URL my local
file. Domains, protocols and ports must match. </br>
gol.js:99Uncaught TypeError: Object [object DOMWindow] has no method 'print'
In Firefox:
Permission denied to access property 'print' </br>
[Break On This Error] </br>
infoWindow.print();
Do you think this could be a because I am running things locally?
My code for spawning the new window is:
var printURL = "https://secure.wikimedia.org/wikipedia/en/w/index.php?"
infoWindow = window.open(printURL,'wiki');
setTimeout ( "printWin()", 2000 );
where printWin() is:
function printWin(){
infoWindow.print();
infoWindow.close();
}

It's the security policy stuff that you are running into. Read this and this.
What you need to do is run the GET request for the Wiki page through a server. So the server acts as a proxy. The browser will allow this because, from it's perspective, the content is from the same origin as your hosting page.
You might get broken links still. You might have to come up with way to proxy all of that as well -- or rewrite the HTML. If you do that, now you are getting into the land of copyright and I'm not sure what's what when it comes to all that.
Are you allowed to proxy Wikipedia content through a server, thereby masking its origin? Maybe you are as long as you don't change the content. But if you adjust the HTML to make it look like it was meant to look, then are you being a bad boy or a good boy? I have no idea whatsoever on this.
I think I answered your technical question though.

Develop Reference

JavaScript is the programming language of the Web.

Why won't JavaScript let me scrape webpages? - javascript

Related

How to access the DOM of a Contentful page with a bookmarklet?

SecurityError: Permission denied to access property "document" on cross-origin object when acces info of a iFrame

print a pdf via iframe (cross domain)

Cross-origin issue loading images for skybox

Load Wikipedia page and print locally

Categories

Resources