I have signed up(paid) for Google site search. They have me a url of a sort of web service where I can send a query to it, it searches my site, and it returns XML of the search results. Well I am trying to load this XML via Ajax from a page on my site but I cannot. I can load from any of my pages on my domain so I am assuming it is because of the XML being on Google's domain. So there has got to be a way to load it though, I don't think they would have given me the URL if I couldn't do anything with it lol. Does anyone know how to do this?
Thanks!
UPDATE:
this is what the page says on google that gave me the XML:
How to get XML
You can get XML results for your
search engine by replacing query+terms
with your search query in this URL:
http://www.google.com/cse?cx=MY_UNIQUE_KEY&client=google-csbe&output=xml_no_dtd&q=query+terms
Where MY_UNIQUE_KEY = my unique key.
You can't load external files with AJAX. However, you can set up a file on your own server that makes the content available on your server. For instance in PHP, you could write a file googlexml.php:
<?php
#readfile("http://www.google.com/cse?cx=MY_UNIQUE_KEY&client=googlecsbe&output=xml_no_dtd&q=query+terms");
?>
And then you could access that with AJAX. I'm not sure if Google's terms of use will let you do that, but if they do, then this is an option.
Does google not offer the ability to forward a DNS address to the IP of your service, folding it into your domain? This way you can do in AJAX
googleAlias.mydomain.com
Google should support this, but I don't know for sure. I imagine they would in the same way they do with GMail and external-domain mail.
Removes your cross-domain javascript issues
edit I expanded below and another user helpfully pointed out this should work (thanks Stobor)
Well, to get my company mail into GMail, if I recall, I needed to change the MX record on my DNS to point to a google IP. You may be able, if google supports it, to add an A record to your domain so an AJAX request to foo.yourdomain.com is the same as search.google.com or whatever. Google needs to recognize requests from your hostname in the A record and say "Oh yes, that's me, on my client's behalf"
For those coming across this now, the AJAX Search API may be what you want: http://code.google.com/apis/ajaxsearch/documentation/
EDIT: Actually, upon further review, that may not hook in with the site search...
Related
I have been looking for Javascript code to access the Google Sites API, but I can't find anything definite.
All I want to do is be able to take the contents of a page I made on Google Sites and display them on another website I own.
Is this possible, and if so, are there tutorials or example code available?
You can retrieve contents of Google Sites via API:
https://developers.google.com/google-apps/sites/docs/1.0/developers_guide_protocol#ContentFeedGET
You could create a PHP/Python script on your own server to execute the API commands on demand and return the result. Via JavaScript/AJAX you could access it local on your server, without cross origin problems.
I have a few single page web apps on multiple domains that heavily rely on javascript/ajax to fetch and show content. Based on logs and search results I can tell that googlebot runs javascript on some of the domains but not on others. On some it indexes everything thats only available with js on others it doesn't even seem to run js at all.
Can anybody tell me how googlebot decides what js to run and if I can to anything to get it to run js on my other domains?
PS: I know that normally I should use something like serverside rendering for this, but I'm not at all depended on search results and rankings, so its not really worth the effort. I'm just curious how googlebot decides whether it should run js or not and if there's anything easy I can do to change that on my other domains.
You can learn more about how Google render ajax based website and a list of best practice directly from Google developer website here:
https://webmasters.googleblog.com/2014/10/updating-our-technical-webmaster.html
https://developers.google.com/webmasters/ajax-crawling/
Regarding your specific problem as first thing, I suggest you to analyse each domain using Google Webmaster tool with functionality "Fetch as Google" and go trough every technical aspects mentioned in Google guide.
https://support.google.com/webmasters/answer/158587?hl=en
I think Google Updated Research on the Subject
http://searchengineland.com/tested-googlebot-crawls-javascript-heres-learned-220157
Now the functionality to fetch your page by Google Bot and see the results has moved into Google Search Console.
You can use URL Inspection Tool to analyze your live URL.
I've tested it on AngularJS App and Google Bot was able to crawl page content with data fetched from AJAX request.
One very important restriction is that the Googlebot does not allow AJAX requests while the page is loaded.
In my blog post I am explaining how to adapt a Single Page Application so that it becomes crawlable – without the need to render HTML snapshots on the server.
Is there a method or is it even possible to get a products details by using a URL. Let's say I paste a URL of a product from a store like Walmart Or bestbuy, would it be possible to write something to retrieve the product info (price, name, info, etc..) does this exist? Or would this have to be something site specific that I can write for each specific store?
One solution I see is to parse the HTML code of the page the URL redirects to using for example Tika, but I'm not sure the e-commerce website in question will like that very much :) Maybe you could ask them if they have implemented an API to access their products data?
Yes, it is possible, but not using JavaScript due to same-origin-policy. You must send that URL to the server, read that external page on the server side and return results back to the server.
On the server side (in whichever language you are using) download the web page, parse it (using xml/xpath if you can) and extract relevant information.
As already noted watch out, some websites forbid such access (called web-scraping), other might actively try to prevent that, e.g. by discovering fake clients.
What you're talking about is website scraping and yes, it's possible and there are loads of tools out there to help you with it. Some websites aren't happy with you doing it though.
You could do it in C# using the HttpWebRequest class to request data from a url and then parse it with something like XmlReader or the http://html-agility-pack.net/
I need to figure out the best way to determine if someone is the actual owner of a website. I don't just mean the domain although in a lot of cases that might be the case.
My first inclination was to have them put a special comment in their HTML that my program can scrape. e.g.:
<!-- #webcode:1234 -->
One possible problem with that approach is someone in theory could add it in the comments on their page or some other way to add content. Although I'm not sure anything I have them do couldn't be gotten that way.
My other idea was since I was planning on also offering a JavaScript widget was to just scrape that although I didn't want to necessarily force them to add the widget.
<script type="text/javascript" src="http://yoursite.com/widget/widget/A4923D2342JF"></script>
What other mechanisms could be employed to determine ownership/control of a website?
Here are the options that Google uses for Domain verification:
Create a CNAME or TXT record in your
domain's DNS settings. These methods
require accessing DNS settings for
your domain at your domain host's
website. Which method you can choose
(CNAME or TXT record) depends on
what's offered in your Google Apps
control panel. We're currently
rolling out the TXT record method but
still ask many customers to create a
CNAME record, instead.
Upload an HTML file to your domain's
web server This method requires being
able to upload files to your domain's
web server. Try doing this if you
don't have access to your domain's
DNS settings.
Add a tag to your home page
This method is available only for
some customers (it's another new
method we're rolling out). It
requires accessing your domain's web
server but not uploading to it. Try
doing this if you have write access
to files on the server but can't
upload new files.
CNAME/TXT or uploading an HTML file to the root of the domain is the most secure, since it requires full control of the domain. If you want to be a bit more lax you could use a Meta tag in the head node, which would prevent someone from adding a comment to a page. All depends on how secure you want to be.
Do what Google does for their Webmaster Tools. Generate a unique key, and have them put it in a meta tag in the head of their front page. It's pretty unlikely that a user who does not own the site will be able to change the contents within the <head></head> tags. If they can, the site is vulnerable to almost any kind of vandalism, and is hopeless.
You could have them add your original idea but only accept the comment in, say, the <header> tag of the website. This way you could avoid having them past the comment into a 'comments' section like you originally suggested.
In fact, I subscribed to a service that did just that: include the special comment in the header section of your page.
Make part of the requirement be that comment be inside of the <head> tag. Typically, even user generated content wouldn't make it's way into the head.
Also, your concern about the comment hack are probably unnecessary. Any comment system worth it's weight knows to escape comments so that the comment is not displayed as actual HTML markup.
Have them put a file with a hard to guess name on the server?
such as http://www.example.com/5gdbadcab234g3.txt
The only true way is to be able to access their fileserver. Anything transferred through HTTP can be reproduced.
If you don't have access to their server, then the best way would be to have an encrypted string embedded on the page (or in an image or some binary file on that page).
The string should be comprised of the URI, author, and timestamp. That way, even if someone does copy this string to their website, you would still be able to determine the author and the page. An added bonus is you'll be able to determine if there was a theft.
Granted, this is only as good as the algorithm that encrypts the page/author combination; hackers that are good at decrypting could get around this. Additionally, a dishonest author could create his own key for his page, thus you'd need to host the encryption so that no one could tinker with the timestamp. Also, this requires that all authors places the code on their page.
I know you mentioned that it isn't necessarily domain dependent but that would help. You could hash the domain (as they are unique) and send the person that string to put somewhere on their site either .txt or in the header as others have mentioned.
Then you store all their domains and their hashes in a database and your scraper would check that the domain it is scraping matches the hashed comment string, if it checks out then its fine.
This flickr blog post discusses the thought behind their latest improvements to the people selector autocomplete.
One problem they had to overcome was how to parse and otherwise handle so much data (i.e., all your contacts) client-side. They tried getting XML and JSON via AJAX, but found it too slow. They then had this to say about loading the data via a dynamically generated script tag (with callback function):
JSON and Dynamic Script Tags: Fast but Insecure
Working with the theory that large
string manipulation was the problem
with the last approach, we switched
from using Ajax to instead fetching
the data using a dynamically generated
script tag. This means that the
contact data was never treated as
string, and was instead executed as
soon as it was downloaded, just like
any other JavaScript file. The
difference in performance was
shocking: 89ms to parse 10,000
contacts (a reduction of 3 orders of
magnitude), while the smallest case of
172 contacts only took 6ms. The parse
time per contact actually decreased
the larger the list became. This
approach looked perfect, except for
one thing: in order for this JSON to
be executed, we had to wrap it in a
callback method. Since it’s executable
code, any website in the world could
use the same approach to download a
Flickr member’s contact list. This was
a deal breaker. (emphasis mine)
Could someone please go into the exact security risk here (perhaps with a sample exploit)? How is loading a given file via the "src" attribute in a script tag different from loading that file via an AJAX call?
This is a good question and this exact sort of exploit was once used to steal contact lists from gmail.
Whenever a browser fetches data from a domain, it send across any cookie data that the site has set. This cookie data can then used to authenticate the user, and fetch any specific user data.
For example, when you load a new stackoverflow.com page, your browser sends your cookie data to stackoverflow.com. Stackoverflow uses that data to determine who you are, and shows the appropriate data for you.
The same is true for anything else that you load from a domain, including CSS and Javascript files.
The security vulnerability that Flickr faced was that any website could embed this javascript file hosted on Flickr's servers. Your Flickr cookie data would then be sent over as part of the request (since the javascript was hosted on flickr.com), and Flickr would generate a javascript document containing the sensitive data. The malicious site would then be able to get access to the data that was loaded.
Here is the exploit that was used to steal google contacts, which may make it more clear than my explanation above:
http://blogs.zdnet.com/Google/?p=434
If I was to put an HTML page on my website like this:
<script src="http://www.flickr.com/contacts.js"></script>
<script> // send the contact data to my server with AJAX </script>
Assuming contacts.js uses the session to know which contacts to send, I would now have a copy of your contacts.
However if the contacts are sent via JSON, I can't request them from my HTML page, because it would be a cross-domain AJAX request, which isn't allowed. I can't request the page from my server either, because I wouldn't have your session ID.
In plain english:
Unauthorised computer code (Javascript) running on people's computers is not allowed to get data from anywhere but the site on which it runs - browsers are obliged to enforce this rule.
There is no corresponding restriction on where code can be sourced from, so if you embed data in code any website the user visits can employ the user's credentials to obtain the user's data.