2 years back I shelved a working Ruby web scraper that automatically download TV movie listings for a 1 week period at a time. Started to work on it again today and found that Ruby can neither access the controls or data of the web page being displayed.
Debugging shows that scripts are now generating the documents that load subsequent web pages. Also the initial scripts modify the current document when run (remove links). Any url used results in the same initial web page being loaded.
I am looking for suggestions on how to proceed to access the data in displayed web pages. I am not very knowledgeable about javascript but would pursue if I had a definite plan to follow. I believe I found the href to load the second web page but it only loads the initial page again so other mechanisms are in play (ie there are cookies mentioned in the script).
To download the information there are minimum of 28 web pages required and normally with downloading movie information there are several hundred web pages processed.
As you've discovered you can't scrape webpages with dynamic content with simple HTTP requests. You need to simulate the webpage actually being used in order for the Javascript to run and generate the content you need. This tutorial will probably help you do what you're trying to accomplish.
Related
I don't know if this is a PHP or JavaScript code, but what do you call this technique about changing web content? For an instance, the MDC Web demo site. It has an empty content if you view the source, but completely contains all elements if you inspect the page.
Regarding PHP, I think it is done with a PHP code in MDC Web's case, but how exactly? Is this a common technique? I wanna know this method coz it's useful in some cases where there's actually no need to reload the page, but able to change the content and URL.
This is called Single Page Applications (a.k.a SPA).
A single-page application (SPA) is a web application or web site that interacts with the user by dynamically rewriting the current page rather than loading entire new pages from a server. This approach avoids interruption of the user experience between successive pages, making the application behave more like a desktop application more.
In traditional web application i generally write JSP which renders html code to browser and communicate to server using form submit or through Java script. This generally involves page transition from one to another using browser refresh many times.
Now with the improved HTML5 i still can use the same approach but i want to achieve more of a desktop application look and feel which means no browser refresh. But i am really confused how it can be achieved.
Do i need to write a big single html5 file which contains all the web application code and show or hide divisions using java script that we need to show at that point of time. Communicate to server using java script.
Or, Just have a minimal first html5 page where user lands for the first time. Later on create all the HTML5 content dynamically using java script and communicate to server using java script. This looks more difficult.
Or, is there a way we can move from one page to other without the effect of page loading/refresh etc.
In general using HTML5 what should be the approch?
For example of a shopping cart, the first view to the user is list of items to purchase. Then user moves to next view such as details of an item. The next view can be payment.
If you have some resource or example to explain it, it would be great.
I'm working on a web app which uses Backbone's HTML5 History option. In order to avoid having to code everything on the client and on the server, I'm using this method to route every request to index.html
I was wondering if there is a way to get Twitter Cards to work with this setup, as currently it can't read the page as everything is loaded in dynamically with Javascript.
I was thinking about using User Agents to detect whether it's the TwitterBot, and if it is, serving a static version of the page with the required meta-tags. Would this work?
Thanks.
Yes.
At one job we did this for all the SEO/search/facebook stuff etc.
We would sniff the user-agent, and if it was one of the following sniffers
Facebook Open Graph
Google
Bing
Twitter
Yandex
(a few others I can't remember)
we would redirect to a special page that was written to dump all the relevant data about the page for SEO purposes into a nicely formatted (but completely unstyled) page.
This allowed us to retain our google index position and proper facebook sharing even though our site was a total single-page app in backbone.
Yes, serving a specific page for Twitterbot with the right meta data markup will work.
You can test your results while developing using the card's preview tool.
https://dev.twitter.com/docs/cards/preview (with your static URL or just the tags).
I'm making a Django page that has a sidebar with some info that is loaded from external websites(e.g. bus arrival times).
I'm new to web development and I recognize this as a bottleneck. As it is, the page hangs for a fraction of a second as it loads the data from the other sites. It doesn't display anything until it gets this info because it runs python scripts to get the data before baking it into the html.
Ideally, it would display the majority of the page loaded directly off my web server and then have a little "loading" gif or something until it actually manages to grab the data before displaying that.
How can I achieve this? I presume javascript will be useful? How can I get it to integrate with my existing poller scripts?
You probably don't need up-to-the-second information, so have another process load the data into a cache, and have your website read it from the local cache.
The easiest but not most beautiful way to integrate something like this would be with iframes. Just make iframes for the secondary stuff, and they will load themselves in due time. No javascript required.
I am in the process of developing an online music magazine. We have a html5/flash music player, and this forms a major part of the website. But the site also has a lot of articles and stuff. So basically, I want seamless music playback across page loads, but I also want to avoid a complete javascript application because I want all the content to be spider friendly and indexable in Google.
I use html5 history api with the hashbang (#!) fallback for loading various content within the main page on clicks. And the URLs loaded also point to pages with the content.
For example:
munimkazia.com/page1.html link in my index page munimkazia.com will load the content from page1.html and insert it. The URL will change to munimkazia.com/#!/page1.html in firefox and IE, and munimkazia.com/page1.html in chrome..
Since the href link is munimkazia.com/page1.html, the spider will follow the link and fetch the content.
I have the page set up properly at page1.html, ready for viewing. But now, I have problems.
If I decide to use ajax loads at this page, the URLs appearing on the browser location bar will not be consistent with the hashbang fallback (http://munimkazia.com/page1.html/#!/page2.html)
If I decide to redirect all clicks to the main container page at http://munimkazia.com and load page2.html, everything will work fine after this, but this page load will interrupt the music playback before it, if any.
Also, I don't want to rewrite all http://munimkazia.com/page1.html to http://munimkazia.com/#!/page1.html, because I want all the content to be present and not fetched and written by javascript for search engines spiders to read.
I am aware that Google has a spec to read the content from #! URLs, but I want the page to load with the article content for the user even if JS is disabled
Any ideas/advice/workarounds?
Edit: Those URLs are just examples to explain my point. There is no javascript code to fetch pages at munimkazia.com
Hash-bang #! URL's can be indexed by Google, that's kinda the whole point of them otherwise people would just use hash # on it's own.
I think the idea is that Google see's the #! URL and converts it into a querystring parameter, eg. example.com/#!/products/123/ipod-nano-32gb becomes example.com/?_escaped_fragment_=/products/123/ipod-nano-32gb but users still use the hash-bang URL. You program the server to response to the ?_escaped_fragment_ parameter, but JavaScript user get redirected to the proper #! URL.
Check out Google specification here http://code.google.com/web/ajaxcrawling/docs/getting-started.html
I don't think it's a good idea to use both types of URL, as you'd have two URL's being posted on blogs, Twitter etc. by users for the same page, would also be a nightmare to write the code to handle it reliably. You'd probably have to settle for hash-bangs for now until HTML5 History API is more broadly supported.