Does HTML snapshot required by googlebot need to be styled - javascript

To make an ajax site web crawlable by googlebot, it requries that your website send back a HTML snapshot when a page with the _escaped_fragment variable set. (for more info see here)
Does this HTML snapshot need to be styled correctly, i.e. does googlebot use the snapshot to give a preview of your site (as you see on the search results page). I ask as some of my pages use javascript to correctly resize images dynamically as this cant by done in the CSS.
Thanks

If you need to take a snapshot of the page as it is rendered on the client side, then it might make sense to generate these snapshots with a headless browser such as HtmlUnit if you are using Java or php. It may be less work to try and move your image processing to the server side if feasible.
While it is unlikely that your snapshots will be used as a preview (all the #! crawled sites I checked all said "No preview available"), it could technically be considered "cloaking" if you have differences in your snapshot and in your served page. I doubt Google would get mad over a few CSS differences but it is worth noting.

Related

Print Portion of the Browser Window to PDF and Send To Server to be Faxed

Like the title says, I have a requirement to print a part of the browser window to PDF and then send it to the server so that it can be faxed. I have already found a faxing service so the real problem is in figuring out how to generate the pdf to begin with. I have come up with several options to do this, however all of them come with significant downsides. They are:
Use window.print() on a button click along with print media queries and have the user download the resulting PDF and re-upload it to be faxed. The problem with this is that it is a multi-step process and my users would prefer to just click a single button
Use a library like jspdf to generate the pdf, output to a byte array and upload it to the server. This will work, however it looks terrible because I lose all styling and my print media queries will not be applied. So far, this is my best option
Render the HTML server side and generate a PDF from that. This will work fine, however it requires duplicating all of the work I have done client side (this is a SPA app) along with duplicating the maintenance.
Use a rendering service or library to run the client side application in a headless browser and generate the PDF from that output. This would be very complicated from a security perspective as the application lives behind a login page.
I would appreciate any suggestions not listed above as well as any advice on how to eliminate the cons posed by these options. Thanks in advance!
Looks like using jspdf to render the PDF along with a library such as html2canvas to convert the printable region into an image to preserve styles will get the job done. It has taken some work to configure everything properly (and I'm not through yet) but I am confident this solution will be effective.

How do I get the data of a website as shown in INSPECT ELEMENT and not in VIEW PAGE SOURCE?

I want to get the INSPECT ELEMENT data of a website. Let's say Truecaller. So that i can get the Name of the person who's mobile number I searched.
But whenever i make a python script it gives me the PAGE SOURCE that does not contain the required information.
Kindly help me. I am a beginner so kindly excuse me of any mistake in the question.
TL;DR: Use Selenium (and PhantomJS)
The view page source will give you the html that was loaded when you made a request for the page (which is most likely what you are getting when you make a request from python.
Since nowadays a lot of pages load things and modify the DOM after the initial html was loaded, you will not get most of the information you want just by looking into that initial response.
To get the inspect element information you will need some sort of web browser to actually go to the page, wait for the information you want to load, and then use it. However you still want to do this in your python script.
Enter selenium, which is a tool for browser automation (mostly used for testing webpages). You can create a python script that opens a browser page and executes whatever code you write for it to do (even wait for a while and search for an after load DOM element!). Your script will still open a browser (which is kind of weird I would guess).
Enter PhantomJS, another library that you can use to have a headless browser to do all your web testing without having to rely on the actual browser UI.
Using selenium only you might achieve your goals, but with phantomjs you can do that in an even cleaner way! Good Luck.
INSPECT ELEMENT and VIEW PAGE SOURCE are not the same.
View source shows you the original HTML source of the page. When you view source from the browser, you get the HTML as it was delivered by the server, not after javascript does its thing.
The inspector shows you the DOM as it was interpreted by the browser. This includes for example changes made by javascript which cannot be seen in the HTML source.
what you see in the element inspector is not the source-code anymore.
You see a javascript manipulated version.
Instead of trying to execute all the scripts on your own which may lead into multiple problems like cross origin security and so on,
search the network tab for the actual search request and its parameters.
Then request the data from there, that is the trick.
Also it seems like you need to be logged in to search on the url you provided so you need to eventually adapt cookie/session/header and stuff, just like a request from your browser would.
So what i want to say is, better analyse where the data you look for is coming from if it is not in the source

How does Google render it's results on the Search engine results page

I've first noticed that the results are not in the HTML received from the server therefore I opened Dev Tools and started looking at the other files that my browser downloaded.
I've noticed that there was only one xhr request named gcosuc and no json( nor xml or other data files downloaded) therefore I thought the code was embedded in the JS itself.
I then searched all the .js files downloaded by chrome and still I could not find where the search results came from.
The I thought that the search results might be inside and iframe element and because of this they are not shown in Dev Tools. With this hypothesis in mind I looked at the HTML generated by the JS after the DOM was loaded thinking that the result might be embedded in an iframe and again I was wrong.
Does anyone have any idea how Google gets and renders it's search results?
As well as I've been able to figure out, it is all done using JavaScript and rendered dynamically as you scroll.
Google+ works similarly. Only what you see on the screen is actually on the DOM and it all disappears as you scroll.
If you are trying to screen scrape the results, I would recommend rendering into a browser and looking at the resulting DOM. I use Chromium.
Made a quick try, and it's indeed an AJAX request. The URI is /s and it passes GET parameters, including what you've search.
The response is some kind of awkward JSON that includes text, styles, and some JS. It's like JSON objects, but separated by this chars: /*""*/
Here's an example (word wrapped because lines are really long):
Edit I add the DevTools requests made by Google:

how to disable view source option in firefox and chrome ?/

I have created a webpage but my friends or collegues always copy the source code and copy all the data easily, so is there any way to hide page source option from browser ?
As a rule, if you are putting information on another user's computer (whether because you made a document or they viewed your webpage), you really can't control what they do with it.
This is an issue that larger companies deal with often. Have you heard of DRM? It's a mechanism that companies like to try to use to control how people can connect to their services, use their content and in general, try to exert control over their data while it's on your system.
Now, a web page is a relatively simple container for holding information. You expressed an urge to prevent your friends from copying the source code. You could try to encrypt it, but if it's using local data to decrypt itself, there still isn't going to be anything that stops them from just copying what's in the View Source window and running it again (even if they can't really read it).
I'd suggest that you don't worry about it. If what you have on your page is so important that others shouldn't be able to see it, don't put it on a webpage.
Finally, Google doesn't much care that you're able to view the source to their home page. Why not? Because the value of the search engine isn't in what the home page looks like, but in the data on the back-end that you don't have direct access to. The value is in the algorithms that execute on the server when you hit that Google Search button that queries that data and returns the information you're looking for. There's very little relative value in the generated HTML that you see in the page. Take a leaf from their book and don't stress that they copy your HTML.
No , there isnt any way to do it, however you can disable right clicking in browser via javascript, but still they can use shortkeys to open developer view (in chrome F12) and see the source. You cannot hide html or javascript from client, but maybe you can make it harder to read.
No. Your HTML output is in the user's realm. Even if there was a way to disable view source in one client, a user could use a different one
Always assume that your site's HTML is fully available to end users.
Yes and no. You can definitely make HTML and JS harder to intrepret by obfuscating your code - that is, taking your code and making it look confusing. Here is a tool that can do that: http://www.colddata.com/developers/online_tools/obfuscator.shtml
However, these things all use code, and code can be decrypted through any number of methods. If you post a song to the internet, even if they cannot find the mp3, they can simply record their speakers. If you upload an image and prevent users from downloading it, they can take a screenshot or use their camera. In order for HTML and Javascript to work, it has to be intrepreted by their computer, and even if you do find a way to disable "View Source" there are others ways, like a DOM inspector (F12 in IE/Chrome, Ctrl+Shift+K in Firefox).
As a workaround, use copyright, warn your users they will be punished if they copy your code, and put watermarks, labels and logos over any mp3s or images you don't want stolen. In the end, disabling right clicking (which is also possible, see How do I disable right click on my web page? ) or disabling selection (also possible) does nothing, because there is more than one way to get your code, like searching through temporary internet files.
However, you ask "what if I want a site where my users can log in and I need security? How can I make it so nobody can see my code then? Doesn't it have to be secure and not out in the open?"
And the answer is, yes, it needs to be secure. That's what server-side languages, like PHP, are for. PHP does all the work on the server itself so the user cannot see it. PHP is like a pre-rendered language - rather than doing it in real-time, PHP does all the work beforehand so the user's computer doesn't have to, making the code safe. The code is never put onto the user's computer, because the user's computer doesn't need it. The work is done by the website itself before the page is sent. SSL is often paired with PHP to make absolutely sure that websites have not been hacked.
But HTML and Javascript have to be done in real time on the user's computer, so you cannot disable View Source because it is useless. There are many, many ways that users could get around it, even if View Source is disabled, and even if right clicking is disabled.
If your code doesn't need to be secure, however, I'd recommend you consider keeping it open source. :)

Pre-rendering html page into image

I want to pre-cache next web page into a thumbnail. Is it possible to pre-render a html page (with css) into an image on-the-fly with javascript/jQuery? And how to persist that temporary image on the client?
You could do an ajax request requesting an image or a linkt o an image from a script.
This srcipt needs to request the data needed from the website and render it using a rendering mechanism.
The returned information could be a link to the generated image on the server.
Performance could be pretty low depending on the data to be retrieved and rendered.
This question will show you a solution to render a website and produce a pdf.
You could use this approach and convert the pdf into an image usinf ImageMagick (needs to be installed on your server).
Afaik, that's not possible on the client-side, because it raises security concerns. Even the <canvas> element cannot render HTML-elements (only browser plugins are allowed to use the methods provided for that purpose).
What is the site written in?? If you have server side capabilities you could probably do it and send the image to be cached. Is not possible from jquery or javascript as far as I know.
Unless your page is absurdly complex, then it's more likely your bottleneck is in the network, rather than rendering. You can easily preload the html page and all its important resources (e.g. images, multimedia, etc), so that when the user go to the next page, you don't need to hit the network anymore and will load it from local cache.
There are a few techniques you can use to preload HTML files, invisible iframe is probably the easiest (though I never tried it myself).

Categories

Resources