Scraping and External Resources - javascript

I've just started to learn about scraping and I just had a quick question.
Scraping images and files through the DOM is no problem but I was curious if it was possible to scrape external resources linked to a document such as web fonts(sorry couldn't think of another example off the top of my head). Things like this are used within the page but not linked through typical means.
If anyone could tell me if such things are possible? I only know Ruby and a bit of JS. Also if you can give me other examples of resources like web fonts that aren't linked normally that would be cool to.
Thanks.

Related

How do I scrape data generated with javascript using BeautifulSoup?

I'm trying to migrate some comments from a blog using web scraping with python and BeautifulSoup. The content I'm looking for isn't in the HTML itself and seems to have been generated in a script tag (which I can't find). I've seen some answers regarding this but most of them are specific to a certain problem and I can't seem to figure out how to apply it to my site. I'm just trying to scrape comments from pages like this one:
http://www.themasterpiececards.com/famous-paintings-reviewed/bid/92327/famous-paintings-duccio-s-maesta
I've also tried Selenium, but I'm using a Cloud9-based IDE currently and it doesn't seem to support web drivers.
I apologize if I botched any of the lingo, I'm pretty new to programming. If anyone has any tips, that would be helpful. Thanks!
You have many ways to scrap such content. One would be to find out how comments are loaded on this website. On quick lookup in chromium developer tools, comments for the page mentioned are loaded via this api call.
This may not be a suitable way for you as you may not generate this url for every different page.
Another more reliable way would be to render such js content using GUIless browser, for ease of implementation i would suggest using scrapy with splash .Splash is a python framework which renders most of the content for your requests.

Using JavaScript for scanning wikipedia articles

I recently saw http://www.histography.io/ - system that uses HTML, CSS and Javascript to scan Wikipedia articles when you hover over a point and grabs the articled and the related youtube video so it can it be displayed to you.
I was exploring the system in the past two hours but can't seem to find the way it fetches the big data that it's in use.
Any pointers to the technique or functions used to fire the events in JS would be highly helpful.
In terms of how the videos are being rendered, it looks like there is a large manifest file with video Ids which I presume correspond to youtube video Ids. Refer to: http://www.histography.io/int.js
The main scripts are uglified so it's hard to tell you exactly what each function is doing.
For future reference I'd suggest checking out the network tab of a website in dev-tools to get a better understand of where resources are coming from.
Also when you request a wiki page it is making a request to a sever which takes the following inputs at the endpoint /wiki_page.php:
link
title
year

API development, Bootstrap and jQuery. Good idea to inject them all in user's code?

We are developing an API, and that API injects HTML code in the user's code. The main idea is that the user has to define a couple of divs and the API just inject the HTML code within them. The idea is simple and it's already developed. It works nice.
The problem is that we are using jQuery and Bootstrap in that HTML code, and we are a little lost in how to treat those frameworks regarding the user's code. Should we inject them inside of his ? We think that could cause some kind of trouble if the user is already using them in his own code... or are we wrong?
Anyway, in my opinion, i consider this solution inelegant and even a little bit crappy. Any smarter way to accomplish it?
Thanks!
If you have to have jquery and bootstrap, the only acceptable solution for an embeddable widget would be an iframe.
You could write a loader script which places the iframe with your main content onto the page.
I've written a quite extensive article about how to build embeddable widgets and their best practices on my blog here:
http://codeutopia.net/blog/2012/05/26/best-practices-for-building-embeddable-widgets/
(based on a Stack Overflow answer I gave some time ago, but couldn't find the link to it)

How to reuse parts of WordPress site e.g. header, footer, part of header for multiple WordPress sites?

I am looking for a solution to reuse the header and footer navigation links (with style, of course) in one of my WordPress website for several other WordPress sites.
Please note that I'm trying to share header and footer among WordPress sites, not from WordPress site to a PHP page.
The sites I'm referring to are on the same server. I have the following directory structure:
example.com/ #main site is here
some-other-site/
wp-admin/
wp-content/
wp-include/
...
wp-admin/
wp-content/
wp-include/
...
I would really appreciate some direction on how to achieve this goals and best practices, if possible since I am still new to WordPress.
I have a few ideas in mind but I am not sure which one is best programming practice or how much effort each approach requires (for cost benefit analysis)
1) Write a custom get_header() function in the main site's functions.php to allow extraction of navigation links
file_get_contents() to get the navigation links from wp-content/themes/my-theme/inc/footer.php
in some-other-site/ I use
define('WP_USE_THEMES', false);
require($_SERVER['DOCUMENT_ROOT'] . '/wp-blog-header.php');
Currently, I get "Background" as output so it doesn't work for me yet.
I found one similar topic but the question is a bit unclear to me and the solution of using absolute urls is not a good practice, I was told.
2) Expose those navigation links as web service. I have a feeling that web service is not even relevant here but I still put it here just in case.
3) Use Multisite settings or create a network for all my WordPress sites. While this appears to be the best way, it seems quite complicated and there are actually issues with my main site being setup in a network currently. I doubt it's necessary to got through this complication to achieve my goal.
As far as I know, sites in WordPress network shares certain databases and therefore I'm so afraid of losing some or whole of the huge data in my main site.
It would definitely be relevant to point out the best practices in sharing CSS stylesheets and Javascripts file among WordPress sits as well, if you are kind enough :)
Updates
I've decided to
Stick to wordpress multisite as much as possible
Abandoned the poor practice hierarchy mentioned above - nested WordPress directories
If you'd like to include the entire header.php and footer.php files in the parent and all child wordpress blogs, the only way of doing that without uploading copies would be to use an absolute path, or create a nested relative path individually for each area that you'd like to include it in.
The way your site hierarchy is laid out is generally considered... poor practice. Perhaps I am misunderstanding the reasoning behind it, but I would suggest evaluating the current situation with the site owner and suggesting an alternative (with some major benefits - ux and structurally).

Firefox extension that interacts with existing page content

I am not a web programming expert but I would like to create a Firefox extension that rewrites pages' html and javascript code. This is a personal project so I can take my time and learn things as I go.
I haven't been able to locate a tutorial or existing extension that does both tasks.
Would you be able to point me in the right direction?
Thanks you so much!
-CxT
You're trying to accomplish two different things. My advice is to learn to do both independently. For extensions, these are great tutorials:
https://developer.mozilla.org/en/building_an_extension
http://www.rietta.com/firefox/Tutorial/overview.html
For "rewriting" a pages html, css, js:
http://ejohn.org/blog/hacking-digg-with-firebug-and-jquery/
Anything you don't understand in any of the tutorials, either google or ask here.
Enjoy!

Categories

Resources