how to scrape a page with javascript effects - javascript

I am new to web scraping and so far I only know how to scrape basic html page using python beautiful soup. What I want is to extract the information on this page. Specifically, I would like to get the following data from all the fellows (around 700 of them)
name
background
insight project
current employer
However, that page is rendered by javascript and the desired information only show up as a separate box when mouseover event is triggered on each fellows picture.
How to extract text in this case? Any information (books, web resources) is appreciated. Python solutions are preferred if possible. Many thanks.

Check the page source of the website.
The information is already present in the in the DOM, just hidden using CSS. On a first glance, it seems like the JavaScript logic is only doing CSS manipulations.
The fact that the information is hidden by CSS will not prevent you from scraping it from the source using a web scraping tool.

Related

Inserting user provided content into a document--validating HTML string, insertAdjacentHTML and iframe usage?

If I want to accept HTML built by a user of an extension, and not from a web source, and display it within an existing extension document, is there an alternative to using an iframe?
For example, if a user provides mathematical expresions using MathML and they are to be displayed in the current web page, and the user may add a <div> or <p> tag inaccurately and have incomplete HTML, how can it be added to the page without corrupting the layout of the page, apart from an iframe?
Does insertAdjacentHTML really accomplish this? This MDN article seems to imply so, where it reads, "It does not reparse the element it is being used on, and thus it does not corrupt the existing elements inside that element."
Or, is there a way to validate the HTML string before inserting into the DOM, such as DOMParser?
Also, for users that are knowledgeable in HTML, CSS, JS and can construct a small interactive document rather than just an expression, to be dislayed within the page, is an iframe the only option? The user provided code will be stored in indexedDB and rendered only on the user's machine and within this extension tool. So, something similar to a snippet on stackoverflow. I have this working in an iframe now but the user could add about a dozen of these to the page at any one time and I wondered if there is a better way of accomplishing this, regarding memory usage and in general.
Thank you.

Photo gallieries that have support for multiple albums and for comments?

I'm looking for a photo gallery(jquery/JavaScript probably?) that supports having multiple albums, and for comments. Just needs to display the comments and have a box to enter the comment, and i can handle storing them in db easy enough.
Any that are similar to facebook would be great since everyone uses it and it would be intuitive to them.
edit: This will be for a web app using asp.net mssql server 2008 r2 but i could use any platform as long as i can communicate with the mssql db
I know this is a little late- but I think this may be what you were looking for:
http://www.jqueryrain.com/?t2bpYeU2
EDIT-
In an effort to keep this question current- this just came out recently and may be better than original plugin: http://codecanyon.net/item/jquery-facebook-gallery/full_screen_preview/3879813?ref=jqueryrain
EDIT [Sept. 7th 2013] - Albumize recently came out. Another option for anyone reading. http://palerdot.github.io/albumize/
At the time of this writing minishowcase and qGallery seem to be the more popular multi-album jquery/javascript libraries. I believe they both have some php running somewhere inside as well.
If you are averse to php for any reason and want straight javascript, css, and html try checking out JonDesign's Smooth Gallery. It too is open source and allows for multiple albums (albeit not as aesthetically pleasing as the first two and does not have native comment plug-in, it is very straight forward code and can easily integrate with a jquery comment plug in's such as Easy Comment).
Edit: Also check out TN3. They apparently support multiple albums but I have never used it as it is kind of expensive for a jQuery license.
I do know that it is an older post but I would like to add a pure javascript solution for any following visitors of this post.
The gambarize plugin is running solely with client side javascript and there is no need for any backend language being used. (Therefore it is not possible to support comments)
Basically it is possible to create a div structure as the following:
<div class="gmbz" data-title="Album1">
<div class="gmbz" data-title="Nested">
<div class="gmbz" data-title="Inner">
</div>
</div>
<div class="gmbz" data-title="Empty" data-cover="data/Black-287.png"></div>
</div>
It will end up in:
Explanation
Every "div" represents an album (including nested album support).
Every "a" represents a picture link, which will be the image itself: The attribute "href" is the image wich will be shown when you click on the thumbnail inside the gallery. Whereas the thumbnail image can be specified through the "data-img-thumb" tag
Visit the example described: generate image gallery with albums using javascript
Link to the plugin home page

Hot to build a SEO friendly JavaScript menu that is used in over 20 pages

I built a javascript menu list from a xml file and has used it as the navigation menu in over 20 pages.I used jQuery's ajax functoinality to implement this,the reason I used this technique was because if there is an update in the menu list I only have to edit the xml file for the changes to reflect in the menu list. I only realized later the technique I have used is not SEO friendly,since SE doesnt index dynamic Javascript content.Saying that I have provided a fall back for users that have diabled their java script by linking the xml file to a object tag in a noscript tag
<noscript>
<div>
<object data="menu/Menu.xml" type="all"></object>
</div>
</noscript>
Im not too sure if this is SEO friendly.
So my question really is how do one go about creating a menu list that is user friendly and that can be updated easily? If questions similar to mine have been answered before please point me to the links.I have done some searching and was not happy with the results I found but Im still looking for answers.
JavaScript is not SEO friendly. Anyway, you should be using a server side programming language like PHP's includes or Server Side Includes to do this.

How do I build an 'embeddable widget'?

My webapp uses both Rails and JS and I would like users to be able to embed the images they upload to any blog/site.
What do I need to know, from a development point-of-view to allow me to create the functionality that generates an 'embed' link. It can be either a link like YouTube does, or a JS snippet or anything.
Just want to get a high-level overview of what I need to be able to do and how to proceed.
Thanks.
I would try using iframe. I created a widget which used javascript and I put it all into a single html file hosted on my website. Then I gave away an iframe snippet like this for example...
<iframe src="http://mywesbite.com/myWidget.html"></iframe>
The user can simply place the iframe snippet into their website and that's it!
I'm a little bit late to the party here, but I just wanted to add to Jacob's answer.
You can easily allow the user to customize the embedded content (perhaps choose light on dark vs. dark on light text to more closely match the page's environment/design) by using query params within the iframe src:
<iframe src="http://___.com/widget?theme=light&size=large"></iframe>
of course you'd probably want to build a UI to allow the user to make these distinct changes... you can't expect average user's to do that by hand:)
Vimeo's UI for customizing embedded videos is pretty nice if you want a best case scenario.

If I'm adding content in page through JavaScript will it be crawlable by Search engine spider

If I'm adding content in page through JavaScript will it be crawl-able by Search engine spider and accessible by screen reader.
For example this
var tip = "<p>Most computers will open PDF documents ";
tip += "automatically, but you may";
tip += "need to download <a title='Link to Adobe website-opens in a new window'";
tip +=" href='http://www.adobe.com/products/acrobat/readstep2.html'
target='_blank'>Adobe Reader</a>.</p>";
$(document).ready(function(){
//IF NUMBER OF PDF LINKS IS MORE THAN ZERO INSIDE DIV WITH ID maincontent
//THEN THIS WILL PUT TIP PARAGRAPH AS LAST CHILD OF DIV
if($("div#maincontent a[href*='/pdf']").length>0){
$("div#maincontent").children(":last-child").after(tip);
}
});
Edit: I want to hide this from Search engine but at the same time keep accessible by screen reader is it possible?
It depends on the crawler, but don't expect most bots to interpret Javascript.
Short answer, probably not. But, Google is getting more sophisticated all the time, so I have my suspicions that they actually render Javascript as part of the indexing process.
Is there a particular reason to do it this way? I'd recommend doing this logic server-side if possible, then you know your HTML is readable by search engines.
Re: will content generated dynamically (on the browser) be crawlable by a search engine?
Normally, no.
But Google has invented a way to solve the problem. See ajax crawling
Note: they do it by crawling your urls with various query parameters representing the different states of the dynamic page. They do not attempt to run the js on your page.
No, most web crawlers do not execute JavaScript and older screen readers do not read it either. Your best bet would be to only use Javascript for presentation purposes and use the logic server side (PHP, Ruby, .NET, etc) and some CSS magic to achieve what you are trying to do above with the content. Always insert content via server side if you are concerned about web crawlers and screen readers, and use JavaScript for presentation only. Alternatively, you can use a Flash and JavaScript sniffer for screen readers to redirect the user to an alternate page that does not rely on dynamic content.

Categories

Resources