Does Facebook crawler currently interpret javascript before parsing the DOM?

Does Facebook crawler currently interpret javascript before parsing the DOM? - javascript

Following link seems to tell that it can't: How does Facebook Sharer select Images and other metadata when sharing my URL?
But I wanted to know if it is still the case at current date...
(The documentation on facebook dev site doesn't give any precision about this point)

In the tests I've run I've never seen it interpret the JS, but that might be contextual / domain-specific (who knows).
To test your specific case, use the Facebook linter: https://developers.facebook.com/tools/debug
(log into FB first)
That's the only way to be sure 100% sure how FB will parse your page (what properties it will infer)

Yes, that is still the case (and I wouldn’t expect it to change anytime soon).
The Open Graph meta information must be provided by the server, so that it can be read from the HTML code when the URL is fetched.

Related

copying html from another website in javascript

How can I copy the source code from a website (with javascript)? I want to copy the text that is showing the temperature from this website: http://www.accuweather.com/
I want to copy only the number that is displaying the temperature. Is there a way of copying that exact line from source code on the website? I heard about html scraping. if not javascript, what would be simplest way of doing it? Just copying the temeprature, and displaying it on my webpage.

Well the way you could do something like that in a simple way by loading the site into a hidden HTML element via AJAX and then search DOM for the element you want.
There is also a jQuery command that allows that directly. It would be something like:
<div id='temp'></div>
<script>
$('div#temp').load('https://www.accuweather.com/ #popular-locations-ul .large-temp', { limit: 1 });
</script>
#popular-locations-ul .large-temp is a css locator for the specific elements that contain the temperature.
However for some time web has a security feature called CORS. To be able to load something from one site via AJAX, the target site has to allow CORS headers explicitly. In the case of this particular site, CORS headers aren't present in the site configuration, so that means that any connection that tries to load something via AJAX won't be allowed.
You can only use a command like the above mentioned in a site you control and that you specify to allow CORS headers or in a site who already has this specification.
But as people have told you that's not a good thing from the start due to web sites impermanent nature. Things change a lot. So even if you could get a value in the way I mentioned from some other site, sometime later, the site would change and your code would be broken.
The reason I answered is because you are just learning and need guidance and not trying to do 'serious work'. Serious work would be using an API as people told you.
An web api is a special url you access (something like https://www.accuweather.com:1234/api/temperature/somecity) normally with some kind of security and that responds with the result you need for the function you want. For this kind service CORS is allowed because you are accessing in a secure and 'official' way.
Hope I clarified a bit.

Why can Chrome execute javascript on other pages but I can't?

Apologies if this is a roundabout way of asking this question, but I am a little confused about how the web and javascript work.
What I want to do: execute javascript on all pages of a list of urls I have found. (Specifically use jquery to pull info from them)
Problem I can't execute Javascript on these pages because they aren't mine and don't have the Access-Control-Allow-Origin header. So I can't load them (with AJAX) in order to use JQuery on them.
BUT Google Chrome can both load pages and execute javascript on them (with their developer's console). So if I wanted too, I could go to each page, open the developers console, and pull the information from there. If there's nothing stopping Chrome from accessing these, then why am I stopped? And, is there a way around this?
Thank you, and I hope my description makes sense. I've been researching this for a while but have found nothing that explains how seemingly inconsistent CORS is.

I could go to each page, open the developers console, and pull the information from there. If there's nothing stopping Chrome from accessing these, then why am I stopped?
You're not stopped. You, the human at the keyboard, can do exactly as you say, by visiting each page as a top-level page.
What is stopped -- happily -- is any and all scripts on the Web you happen to run having the same level of visibility that you do. Based on your cookies and your network topology, you have a unique view into the Web. You can see your home router's control interface (on 192.168.1.1 or similar). You can see any local web server you're running on 127.0.0.1. No one else can see these. If the same-origin policy were not in place, then any script that you loaded on the Web could inspect these.
And, is there a way around this?
If you have some scripts that you trust absolutely (hopefully a significant subset of "all scripts that exist on the Web") that you want to be able to bypass the same-origin policy and see your full, cross-domain view of the Web, you could load them as an extension, which can act with elevated permissions beyond the abilities of normal web pages. (See How does Same Origin Policy apply to browser extensions?)

I'm going to assume that you are looking to grab data from these pages that aren't yours and store it somewhere. I have done this before with curl using php. If you are looking to display these sites for users to interact in a different way, but starting from a page that is yours, you may be able to render these pages by grabbing the source html using curl and rendering it as a sort of proxy.
I've used this tutorial for something similar https://www.youtube.com/watch?v=_kQN-3aNCeI . Hopefully this gives you a start. I think you should be a little more detailed in your question though to get more help.

how to disable view source option in firefox and chrome ?/

I have created a webpage but my friends or collegues always copy the source code and copy all the data easily, so is there any way to hide page source option from browser ?

As a rule, if you are putting information on another user's computer (whether because you made a document or they viewed your webpage), you really can't control what they do with it.
This is an issue that larger companies deal with often. Have you heard of DRM? It's a mechanism that companies like to try to use to control how people can connect to their services, use their content and in general, try to exert control over their data while it's on your system.
Now, a web page is a relatively simple container for holding information. You expressed an urge to prevent your friends from copying the source code. You could try to encrypt it, but if it's using local data to decrypt itself, there still isn't going to be anything that stops them from just copying what's in the View Source window and running it again (even if they can't really read it).
I'd suggest that you don't worry about it. If what you have on your page is so important that others shouldn't be able to see it, don't put it on a webpage.
Finally, Google doesn't much care that you're able to view the source to their home page. Why not? Because the value of the search engine isn't in what the home page looks like, but in the data on the back-end that you don't have direct access to. The value is in the algorithms that execute on the server when you hit that Google Search button that queries that data and returns the information you're looking for. There's very little relative value in the generated HTML that you see in the page. Take a leaf from their book and don't stress that they copy your HTML.

No , there isnt any way to do it, however you can disable right clicking in browser via javascript, but still they can use shortkeys to open developer view (in chrome F12) and see the source. You cannot hide html or javascript from client, but maybe you can make it harder to read.

No. Your HTML output is in the user's realm. Even if there was a way to disable view source in one client, a user could use a different one
Always assume that your site's HTML is fully available to end users.

Yes and no. You can definitely make HTML and JS harder to intrepret by obfuscating your code - that is, taking your code and making it look confusing. Here is a tool that can do that: http://www.colddata.com/developers/online_tools/obfuscator.shtml
However, these things all use code, and code can be decrypted through any number of methods. If you post a song to the internet, even if they cannot find the mp3, they can simply record their speakers. If you upload an image and prevent users from downloading it, they can take a screenshot or use their camera. In order for HTML and Javascript to work, it has to be intrepreted by their computer, and even if you do find a way to disable "View Source" there are others ways, like a DOM inspector (F12 in IE/Chrome, Ctrl+Shift+K in Firefox).
As a workaround, use copyright, warn your users they will be punished if they copy your code, and put watermarks, labels and logos over any mp3s or images you don't want stolen. In the end, disabling right clicking (which is also possible, see How do I disable right click on my web page? ) or disabling selection (also possible) does nothing, because there is more than one way to get your code, like searching through temporary internet files.
However, you ask "what if I want a site where my users can log in and I need security? How can I make it so nobody can see my code then? Doesn't it have to be secure and not out in the open?"
And the answer is, yes, it needs to be secure. That's what server-side languages, like PHP, are for. PHP does all the work on the server itself so the user cannot see it. PHP is like a pre-rendered language - rather than doing it in real-time, PHP does all the work beforehand so the user's computer doesn't have to, making the code safe. The code is never put onto the user's computer, because the user's computer doesn't need it. The work is done by the website itself before the page is sent. SSL is often paired with PHP to make absolutely sure that websites have not been hacked.
But HTML and Javascript have to be done in real time on the user's computer, so you cannot disable View Source because it is useless. There are many, many ways that users could get around it, even if View Source is disabled, and even if right clicking is disabled.
If your code doesn't need to be secure, however, I'd recommend you consider keeping it open source. :)

How to check the authenticity of a Chrome extension?

The Context:
You have a web server which has to provide an exclusive content only if your client has your specific Chrome extension installed.
You have two possibilities to provide the Chrome extension package:
From the Chrome Web Store
From your own server
The problem:
There is a plethora of solutions allowing to know that a Chrome extension is installed:
Inserting an element when a web page is loaded by using Content Scripts.
Sending specific headers to the server by using Web Requests.
Etc.
But there seems to be no solution to check if the Chrome extension which is interacting with your web page is genuine.
Indeed, as the source code of the Chrome extension can be viewed and copied by anyone who want to, there seems to be no way to know if the current Chrome extension interacting with your web page is the one you have published or a cloned version (and maybe somewhat altered) by another person.
It seems that you are only able to know that some Chrome extension is interacting with your web page in an "expected way" but you cannot verify its authenticity.
The solution?
One solution may consist in using information contained in the Chrome extension package and which cannot be altered or copied by anyone else:
Sending the Chrome extension's ID to the server? But how?
The ID has to be sent by you and your JavaScript code and there seems to be no way to do it with an "internal" Chrome function.
So if someone else just send the same ID to your server (some kind of Chrome extension's ID spoofing) then your server will consider his Chrome extension as a genuine one!
Using the private key which served when you packaged the application? But how?
There seems to be no way to access or use in any way this key programmatically!
One other solution my consist in using NPAPI Plugins and embed authentication methods like GPG, etc. But this solution is not desirable mostly because of the big "Warning" section of its API's doc.
Is there any other solution?
Notes
This question attempts to raise a real security problem in the Chrome extension's API: How to check the authenticity of your Chrome extension when it comes to interact with your services.
If there are any missing possibilities, or any misunderstandings please feel free to ask me in comments.

I'm sorry to say but this problem as posed by you is in essence unsolvable because of one simple problem: You can't trust the client. And since the client can see the code then you can't solve the problem.
Any information coming from the client side can be replicated by other means. It is essentially the same problem as trying to prove that when a user logs into their account it is actually the user not somebody else who found out or was given their username and password.
The internet security models are built around 2 parties trying to communicate without a third party being able to imitate one, modify or listen the conversation. Without hiding the source code of the extension the client becomes indistinguishable from the third party (A file among copies - no way to determine which is which).
If the source code is hidden it becomes a whole other story. Now the user or malicious party doesn't have access to the secrets the real client knows and all the regular security models apply. However it is doubtful that Chrome will allow hidden source code in extensions, because it would produce other security issues.
Some source code can be hidden using NPAPI Plugins as you stated, but it comes with a price as you already know.
Coming back to the current state of things:
Now it becomes a question of what is meant by interaction.
If interaction means that while the user is on the page you want to know if it is your extension or some other then the closest you can get is to list your page in the extensions manifest under app section as documented here
This will allow you to ask on the page if the app is installed by using
chrome.app.isInstalled
This will return boolean showing wether your app is installed or not. The command is documented here
However this does not really solve the problem, since the extension may be installed, but not enabled and there is another extension mocking the communication with your site.
Furthermore the validation is on the client side so any function that uses that validation can be overwritten to ignore the result of this variable.
If however the interaction means making XMLHttpRequests then you are out of luck. Can't be done using current methods because of the visibility of source code as discussed above.
However if it is limiting your sites usability to authorized entities I suggest using regular means of authentication: having the user log in will allow you to create a session. This session will be propagated to all requests made by the extension so you are down to regular client log in trust issues like account sharing etc. These can of course be managed by making the user log in say via their Google account, which most are reluctant to share and further mitigated by blocking accounts that seem to be misused.

I would suggest to do something similar to what Git utilises(have a look at http://git-scm.com/book/en/Git-Internals-Git-Objects to understand how git implements it), i.e.
Creating SHA1 values of the content of every file in your
chrome-extension and then re-create another SHA1 value of the
concatenated SHA1 values obtained earlier.
In this way, you can share the SHA1 value with your server and authenticate your extension, as the SHA1 value will change just in case any person, changes any of your file.
Explaining it in more detail with some pseudo code:
function get_authentication_key(){
var files = get_all_files_in_extension,
concatenated_sha_values = '',
authentication_key;
for(file in files){
concatenated_sha_values += Digest::SHA1.hexdigest(get_file_content(file));
}
$.ajax({
url: 'http://example.com/getauthkey',
type: 'post'
async: false,
success:function(data){
authentication_key = data;
}
})
//You may return either SHA value of concatenated values or return the concatenated SHA values
return authentication_key;
}
// Server side code
get('/getauthkey') do
// One can apply several type of encryption algos on the string passed, to make it unbreakable
authentication_key = Digest::<encryption>.hexdigest($_GET['string']);
return authentication_key;
end
This method allows you to check if any kind of file has been changed maybe an image file or a video file or any other file. Would be glad to know if this thing can be broken as well.

HTML 5 / Site config causing FB block

I created a new website using a newly registered domain.
When trying to share it as a link in Facebook, it is classed as "spammy" and I'm unable to share it.
After a few weeks of research and reporting to FB I copied the site entirely and placed on a new TLD.
This has instantly become blocked on facebook which made me think there's something within the structure of the site which is causing it to be marked as spam.
Using object debugger on the original URL has given a number of various responses such as:
"Error parsing input URL, no data was scraped"
Response code 206
Response code 203
I read that using chrome can bug it out so I used firefox and safari to check.
Does anyone have any idea why the response codes vary for a static site?
Are there any specific site setups which are currently causing FB to block?
I have read that certain .htaccess configs, such as www>non-www can upset FB, is this true?
The sites in question are:
Link 1 (this was intended to be the only domain)
Link 2 (this was setup only when original domain was blocked)
These domain are new, never been used for spamming or mail.
I have checked all the blacklists I could possibly search and have not found anything that indicates problems.
It really does seem that there is something in the configuration of the site that is causing it to be blocked. Does anyone have any idea or experience in this??

I was able to scrape the page correctly using the Debug Lint tool:
http://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Fwww.sophie-mcelligott.com%2F
Maybe facebook block newly registered domains for a fixed time before letting you share it on Facebook, presumably to stop spammers. Are you able to scrape both sites correctly?

Develop Reference

JavaScript is the programming language of the Web.