How can I load multiple sites in different context? - javascript

I want to create a page that "silently" queries other pages. It then crawls them for a result. When it's done querying all the pages, then it should calculate its own result according to the results retrieved.
What I meant with silent is,
the other web site's code shall not appear on my page, nor affect it in any way
I want each other page to be queried in a different session (like when I open a new tab for each in my browser) or something similar. So that there will be no namespace problems.
I heard chrome would offer something that might be helpful for that?
edit This is NOT about crawling web pages. It is for fetching data from other local stand-alone projects
edit2 I am just looking for an alternative to simply looping URLs and querieng them, because there are namespace issues

You can't access pages from other domains with JavaScript unless the domain explicitly allows your domain.
Typically you would use a server-side language for this, or better, use the website's public API. If they don't have a public API they probably wont appreciate you crawling their site.

Related

how to limit access to my iframe widget using CSP cookies and http referer

I am developing a web application (like a widget) that my potential clients will use on their websites for the benefit of their users. I was thinking about the best way to deliver the application to them and at the same time be able to control who is using my widget so that I can bill them correctly.
I checked a few previous posts like iframe for a widget and iframe best practices limitations and JS to load iframe but they are 7-10yr old and not exactly what I'm trying to do.
That being said, so far ... the best way to deliver seems to be a combination of:
iframe
Content-Security-Policy frame-ancestors HTTP header
cookies + $http_referer checks on the server side to avoid sneaky users
On the load I'm going to send a secret key with URL to deliver a customized/branded version and I'm planing to rely on cookies for subsequent calls
I have a few questions here:
Should I use an iframe tag with specific URL directly, like
<iframe src="https://superwidget.com/SecretKey=12345678"></iframe>
or should I use a JavaScript to load/create iframe element using the same URL? Is there any benefit from using one or another except being able to defer a load of an iframe in the JS version?
So I'm planing to use iframe / CSP / http referer / cookie combo ... Is there any other (better) way to deliver a widget and make sure only allowed audience using it?
Anything else I'm missing here
Any help appreciated!
My recommendation would be to use javascript.
That way, in your javascript, you can validate if the DOMAIN NAME for the page that the javascript is called from is authorized for that client's token.
If it is, load the IFrame with the custom content.
This will also allow you to have greater control over user experience.
If I were you I would use a simple iframe. The page should be retrived with a key (eg. ?key=some-special-key-in-base-16-58-or-64).
You backend should later on verify that the Refer: not-your-site.com header is whitelisted for that specific API key.
If, instead, you need to use a js widget, you could use the key as a param when requesting the js file and let the verification backend use the classic Host: not-your-site.com header.
You could send a custom widget that asks them to pay/renew the system if the key or the refer is not valid. Some people visitng the site might not like this idea so think carefully about implementing it. If you are not on top of the pyramid of the team let someone with more responsabilty choose.
The advantage of using an iframe widget over a js one is that it has a sandbox and therfor cannot be accesed by the parent site. Please note that it might be a disvantage if you want to let your consumers to modify the widget with their own js.
Please note that SCP has to always be set correctly if you want all of this to work.
Last tip: Using the hosts file to fake two sites on the same machine won't work, on Windows 10 at least, so you'll have to use two different machines.

How to stop users to manipulate the popup and at the same time let googlebot crawl my page

I have a very confusing problem.
I have a page which only allow paid users to view it. So if the user is not valid I use a pop up with grey backgroud to block users to view the page however there is a potential flaw with this and if a user is clever he can find a workaround and by using the inspect element bypass the popup. Another solution which comes to my mind is to redirect the user to another page instead of pop up like:
window.location = "http://www.example.com";
However there is a potential problem with this or may be I am wrong on this:
I think this way google bots wont be able to crawl that page since redirection happens however in the first approach google will definitely be able to crawl the page.
Now my question is if I use the first approach is there anyway to stop user from manipulating the popup or is there anyway I can distinguish if a user is browsing the page or google?
Also if I use the second approach will google bot be able to crawl the page?
You can't implement a paid block or any types of truly secure/working blocking on the frontend. I would suggest prevent accessing to that said page on the backend.
There's no real clean and 100% working way to this on the frontend. The user can always bypass.
For google, it will be able to crawl the page since the content is still accessible via the rendered html, as it does not care how the page is shown. It gets access to the content anyway, just like you would by fetching the html via a get request without a browser.
You could indeed just redirect, but still do it on the backend not the frontend.
Your current solution does not make the page private - as you rightly point anyone can manipulate the page using the dev tools, and crawlers can read the whole source anyway. Using server-side scripts to block access, and/or vary the content based on an authorisation token is the only way to secure it properly and ensure that only your legitimate paying users get privileged access.
You state a concern about the inability for Google (and other search engines, I assume) to crawl the page if you employ better security. But your logic is flawed: If you make it so that a google bot can still crawl the page, then by definition it must be readable without authorisation. Anyone could view it in the google cache, and parts of its content could show up in google searches. This means it isn't private. Once that's the case, then what are your users paying for, exactly?
What you might realistically want to do is have a cut-down version of the page that is displayed when the user is not authorised, containing enough information for search engines to get an idea of the overall content, and for visitors to be tempted into paying for the rest. Then if the user logs in, the server recognises that and displays the rest of the content as well when the page refreshes. That appears to be roughly what paid-content news sites do, for instance.

can local storage store the whole page

so I have seen a lot of people using local storage to store certain parts of a web page but not an entire web page is it possible? , if so how? , if not is there a way to store an entire web pages data so the user can come back to it how they left it?
This can be done if you use javascript to save document.body.innerHTML into the webstorage and you use javascript to load it back from the storage when the page is loaded next time. If the web page is not in the webstorage, you could redirect the user to the web page.
But this depends on the design of your web page and if there is session index etc in the body of the web page.
You should also think of some way to handle versions. You dont want your users only use the cached version of your web page, but it should be updated once you update your web page.
The session storage is ~5mbit, so you cant save very much, especially not pictures.
Since LocalStorage allows you to store about 5MB~ you can store a full webpage there and then just call it into a document.write().
The following code does it:
Storing it:
var HTML = ""; //html of the page goes here
localStorage.setItem("content", HTML);
Retrieving it:
document.write(localStorage['content']);
Although this is possible it is common practice you only save settings and load them up into the right elements rather than the entire web page.
This is not really answering your question, but, if you are only curious how this can be done and don't need to have wide browser support, I suggest you look into Service Workers, as making websites offline is something that they solve very well.
One of their many capabilities is that they can act as a proxy for any request your website makes, and respond with locally saved data, instead of going to the server.
This allows you to write your application code exactly the same way as you would normally, with the exception of initializing the ServiceWorker (this is done only once)
https://developers.google.com/web/fundamentals/getting-started/primers/service-workers
https://jakearchibald.github.io/isserviceworkerready/
Local storage it's actually just an endpoint: has an IP address and can be accessed from the web.
First of all, you need to make sure that you're DNS service points on your Index page.
For example, if your Local-storage's ip is 10.10.10.10 and the files on that local-storage is organized like:
contants:
pages:
index.html
page2.html
images:
welcome.png
So you can point your DNS like:
10.10.10.10/index -> /contants/pages/index.html
In most of the web frameworks (web framework it's a library that provide built in tools that enable you to build your web site with more functionality and more easily) their is a built in module called 'route' that provide more functionality like this.
In that way, from you index.html file you can import the entire web site, for example:
and in your routes you define for example:
For all the files with the .html extension, route to -> 10.10.10.10/contants/pages/
For all the files with the .png/.jpg extension, route to -> 10.10.10.10/contants/images/
Local storage is usually for storing key and value pairs, storing a whole page will be a ridiculous idea. Try instead a Ajax call which Returns an partial view. Use that for the purpose of manipulation in DOM

Why is it impossible to gather parent.window.location from iframe when different sites are involved on Chrome?

Let's suppose that I am developing a website which will be used as a widget inside iframes of the users. Let's suppose that I intend to analyze things done by users on my little widget and I want to have some aggregated data by hosts (for instance, how many users visited from website1 and how many users visited from website2).
To achieve this, I would prefer to be able to read parent.window.location. Unfortunately, I am not able to do so because of Chrome's restrictions. Currently, to resolve the issue, I need to do some unintuitive hacks, involving postMessage inside the iframe to ask the parent to send its host and to receive a response from the parent to be able to do these analizations. However, this is obviously much inferior to my intention, because:
the outer page will need to execute the scripts specified which makes the usage of the iframe more complicated and confusing
nobody forces the outer page to execute the scripts designed for it and there is absolutely no technical guarantee that the received data will be accurate
As a result, I wonder why parent.window.location is unreachable in chrome from the page inside the iframe?

Google Analytics: how to use custom dimension on different website to identify intranet users

SITUATION
I have a main public Liferay website, that is therefore accessible both by intranet and not-intranet (i.e. public) users.
I also have a Liferay intranet website, which is accessible only to intranet users because is protected via a login page.
The login page to the intranet website is public.
After you successfully login, the intranet website is loaded.
EXPECTED:
In my Google Analytics account for the main website, I want to differentiate intranet users from public users (e.g. in order to understand how the 2 categories behave).
Questions
Can I use a custom dimension to solve this problem, or is there a better way?
Custom dimension data has to be sent via hits (UPDATE: by "hits" I meant either pageview or event hits, I am not referring to the dimension scope, cfr. https://developers.google.com/analytics/devguides/collection/analyticsjs/custom-dims-mets), therefore I should:
load the Google Analytics tracking code of the main website on the intranet website (the site displayed after successfully logging in)
send a pageview hit from this Intranet website to the main website together with a custom dimension, e.g.
ga('send', 'pageview', {
'dimension1': 'I am a intranet user'
});
Is this correct?
Does the above mentioned solution have any impact on my Analytics data in the main website (e.g. more pageviews due to the tracking code added to the intranet website, or strange behaviours in counting user sessions, etc.)?
Thanks a lot.
UPDATE:
Actually, the solutions proposed below would not work because the 2 websites (intranet and not-intranet) are considered different domains.
So, even if I had the following domains
intranet website: http://intranet.mycompany.com
company website: http://www.mycompany.com
and I sent data to the same UA account (i.e. the company website UA account), they would be counted as different visits.
Quoting Google (see https://developers.google.com/analytics/devguides/collection/gajs/gaTrackingSite#profilesKey)
If a user independently visits two sites that are tracking in the same
view (profile), such as through a bookmark, these visits will still be
counted under separate sessions. In this scenario, the linking methods
are not invoked, and thus there is no way to determine the initiating
session for a given user.
So, how could I solve my problem?
Would it be possible to solve it by implementing cross-domain tracking (https://support.google.com/analytics/answer/1034342?hl=en), and how?
Thanks a lot.
Can I use a custom dimension to solve this problem, or is there a better way?
Yes, custom dimension is perfect for this.
Custom dimension data has to be sent via hits
The User-level scope is more appropriate than the hit-level one for what you want to achieve. The linked document explains in detail why, and gives an example similar to your use case.
Does the above mentioned solution have any impact on my Analytics data in the main website
Yes, impact is mainly that you will have extra data corresponding to the visits to the intranet.
A custom dimension works well for your purpose. You will get additional hits for visits on your intranet site, but you can segment them out via the custom dimension to separate between inter/intranet.
Since the intranet requires a login there is one other way you could try, which would have the additional benefit of allowing for cross-device tracking (if that is beneficial to you).
Google calls this "userID", despite the fact that it must not be used to identify individual users. On login you pass in a unique value per user that is set by your backend system (UUID format is suggest but any unique string would work). Since it is not assigned by the tracking code but set by your system it will be the same id on every device. It is used to de-duplicate users, i.e. persons that log in from multiple devices will be recognized as single users (also useful if people delete their cookies - the userID can be used to aggregate sessions into unique visitors).
To make this work you need to set up a special view that contains only data from visits where the userId is set (so you would have a view for your public site and a view only for your logged-in users). You get a few special reports, for example one to tell you how many users log in from different device categories.
What the userID should not do, and in fact must not do according to Googles terms of service, is to identify individuals. The userId is not exposed in the Interface, and you must not store it as a custom dimension. If you store it on the client side in a cookie you must unset it once the users log out. It is merely there to allow continuous tracking of users independently from cookies (plus you need to amend your privacy policy if you want to use this).
Of course you can combine both approaches to get even more insights.

Categories

Resources