Is there anything for Python that is like readability.js?

Is there anything for Python that is like readability.js? - javascript

I'm looking for a package / module / function etc. that is approximately the Python equivalent of Arc90's readability.js
http://lab.arc90.com/experiments/readability
http://lab.arc90.com/experiments/readability/js/readability.js
so that I can give it some input.html and the result is cleaned up version of that html page's "main text". I want this so that I can use it on the server-side (unlike the JS version that runs only on browser side).
Any ideas?
PS: I have tried Rhino + env.js and that combination works but the performance is unacceptable it takes minutes to clean up most of the html content :( (still couldn't find why there is such a big performance difference).

Please try my fork https://github.com/buriy/python-readability which is fast and has all features of latest javascript version.

We've just launched a new natural language processing API over at repustate.com. Using a REST API, you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. And it's implemented in python. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.

hn.py via Readability's blog. Readable Feeds, an App Engine app, makes use of it.
I have bundled it as a pip-installable module here: http://github.com/srid/readability

I have done some research on this in the past and ended up implementing this approach [pdf] in Python. The final version I implemented also did some cleanup prior to applying the algorithm, like removing head/script/iframe elements, hidden elements, etc., but this was the core of it.
Here is a function with a (very) naive implementation of the "link list" discriminator, which attempts to remove elements with a heavy link to text ratio (ie. navigation bars, menus, ads, etc.):
def link_list_discriminator(html, min_links=2, ratio=0.5):
"""Remove blocks with a high link to text ratio.
These are typically navigation elements.
Based on an algorithm described in:
http://www.psl.cs.columbia.edu/crunch/WWWJ.pdf
:param html: ElementTree object.
:param min_links: Minimum number of links inside an element
before considering a block for deletion.
:param ratio: Ratio of link text to all text before an element is considered
for deletion.
"""
def collapse(strings):
return u''.join(filter(None, (text.strip() for text in strings)))
# FIXME: This doesn't account for top-level text...
for el in html.xpath('//*'):
anchor_text = el.xpath('.//a//text()')
anchor_count = len(anchor_text)
anchor_text = collapse(anchor_text)
text = collapse(el.xpath('.//text()'))
anchors = float(len(anchor_text))
all = float(len(text))
if anchor_count > min_links and all and anchors / all > ratio:
el.drop_tree()
On the test corpus I used it actually worked quite well, but achieving high reliability will require a lot of tweaking.

Why not try using Google V8/Node.js instead of Rhino? It should be acceptably fast.

I think BeautifulSoup is the best HTML parser for python. But you still need to figure out what the "main" part of the site is.
If you're only parsing a single domain, it's fairly straight forward, but finding a pattern that works for any site is not so easy.
Maybe you can port the readability.js approach to python?

Related

Javascript library to manage translation forms

Is anybody aware of any javascript tool (compatible with jQuery, tinymce or any other clientside library) able to manage the following requirements?
I need to show translation forms in which every field (either input or textarea) could contain some segment variables or code sections (mostly HTML).
For example:
"Hello {{firstname}}, this is your personal page."
or
"You improved your personal score of <strong>{{n}} points</strong>."
Of course I obtain these segments from a template parser and I need to show them to a set of translators that will perform localization towards many languages. I know that in many cases I can (and should!) avoid variables and code inside translation segments, but in many other cases I really can't.
The problem is: I would like to manage coherence about variables and code directly on the browser (I trust my translators but a bit more of UI/UX help is always a good thing!).
A nice approach could be providing the set of variables and code tags, ready to be inserted by means of a single click (in order to avoid mispelled variables or incorrect code syntax) and a bit of pre-submit validation to be sure everything was inserted.
I've seen this approach in other websites, such as Facebook or Freelancer.com (who have the power and the ability to reimplement the whole thing from scratch!).
Do you know about any almost-ready tool/library for this purpose?
Thank you all in advance for any suggestion.

If you are asking for a library to translate text - here is Google Translate API: https://developers.google.com/translate/?csw=1
If you are asking for a library which can take user input, perform validation, and insert into the DOM - then Jquery has everything you need.
If you are asking for something else, let me know and I'll edit my question.

Jquery/Javascript solution for converting wiki-text to HTML and vice versa?

For my web front end I have to implement subsets of the wiki-syntax in my system. Do I need to manually specify rules and reinvent the wheel? Is there an existing javascript library or jquery plugin that could help out with it?
For example a user enters == Header == Since this needs to get converted to a medium header for example (assuming medium is defined in this context as a span as below)
<span class="mediumHeader" id = "Header">Header</span>
Now when the user edits the above text I'm guessing it'll involve replacing the
<span...> ... </span> with ==...==
Now for every system I design this will be as per 'my rules' and will almost always have to reinvent the wheel. Is there something that I could use to ease this wiki to/from HTML transformation using Jquery/Javascript? I'm sure it's a problem with a known solution.
I would prefer to customize what's acceptable and what isn't i.e. I don't everything to be translated into wiki syntax (or HTML) only subsets of it. Should I just roll my own for my application?

It's been long enough that you may not need this, but yours was the top SO hit when I started looking into it.
There are a couple javascript options - you're probably looking at instaview (check out test/test.js), or maybe Wiky.js (the less fully documented).
If you aren't limited to Javascript, check out the exhaustive list of MediaWiki parsers at http://www.mediawiki.org/wiki/Alternative_parsers - lots of tools for C++, Java, Perl, ruby, and more. That's the link to watch for new developments.

At the time of writing, Parsoid seems to be the only one which translates in both directions. This one also powers the visual editor on Wikipedia. But this is no handy client-site lib to include in your app, but a full-blown parsing and transformation server suite. A production version of Parsoid on the Wikimedia cluster can be accessed at http://parsoid-lb.eqiad.wikimedia.org/.
Other JavaScript Libraries, which are translating from WikiText to HTML only (ordered by popularity), are:
Wiky.js - doesn't support full WikiText syntax. (by tanin47, not to be confused with Wiki.js from Requarks - a different project completely)
wtf_wikipedia - isn't directly translating to HTML but JSON, which results in much more powerful possiblities (e.g. info-boxes as key-value pairs). This is the most up-to-date library and "its a combination of instaview, txtwiki, and uses the inter-language data from Parsoid javascript parser."
instaview - no updates in the last 2 years.
Also checkout the current and full list of alternative MediaWiki parsers.

Fulltext search ignoring comments

I want fulltext search for my JavaScript code, but I'm usually not interested in matches from the comments.
How can I have fulltext search ignoring any commented match? Such a feature would increase my productivity as a programmer.
Also, how can I do the opposite: search within the comments only?
(I'm currently using Text Mate, but happy to change.)

See our Source Code Search Engine (SCSE). This tool indexes your code base using the langauge structure to guide the indexing; it can do so for many languages including JavaScript. Search queries are then stated in terms of abstract language tokens, e.g., to find identifiers involving the string "tax" multiplied by some constant, you'd write:
I=*tax* '*' N
This will search all indexed languages only for identifiers (in each language) following by a '*' token, followed by some kind of number. Because the tool understands language structure, it isn't confused by whitespace, formatting or interverning comments. Because it understands comments, you can search inside just comments (say, for authors):
C=*Author*
Given a query, the SCSE finds all the hits across the code base (possibly millions of lines), and offers these as set of choices; clicking on choice pulls up the file with the hit in the middle outlined where the match occurs.
If you insist on searching just raw text, the SCSE provides grep-style searches. If you have only a small set of files, this is still pretty fast. If you have a big set of files, this is a lot slower than language-structure based searches. In both cases, grep like searches get you more hits, usually at the cost of false positives (e.g., finding "tax" in a comment, or finding a variable named "Authorization_code"). But at least you have the choice.
While this doesn't operate from inside an editor, you can launch your editor (for most editors) on a file once you've found the hit you want.

Use ultraedit , It fully supports full text search ignoring comment or also within the comment search

How about NetBeans way (Find Symbol in the Navigate Menu),
It searches all variables,functions,objects etc.
Or you could customize JSLint and customize it if you want to integrate it in a web application or something like that.

I personnaly use Notepad++ wich is a great free code editor. It seems you need an editor supporting regular expression search (in one or many files). If you know Reg you can use powerfull search like in/out javascript comments...the work will be to build the right expression and test it with one file with all differents cases to be sure it will not miss things during real search, or maybe you can google for 'javascript comments regular expression' or something like...
Then must have a look at Notepad++ plugins, one is 'RegEx Helper' wich helps for building regular expressions.

Complex wolframalpha ajax query

I want to write formulas in Mathematica format in my blog, inside tag's formula. What js should I use (and what libary), to replace those tag's, with http://www.wolframalpha.com/ search result image, when Dom gets loaded?
For example:
<formula>Limit[((3 + h)^(-1) + -1/3)/h, h -> 0]</formula>
gets replaced with:
If it's to complex or can not be done with javascript, please explain why.

You need to use wolframalpha API service, but it's not free and you can't do this using javascript because of same origin policy, only using server-side language.

It seems that your real question is: "how do I get nice math on Blogger.com", so I will answer this instead.
A proven solution (see e.g. Psychedelic Geometry, no affiliation) is to use the MathTex system by John Forkosh (who also hosts a public cgi) together with the ReplaceMath script by Randall Farmer. Browse some source for modified versions of the script.
Installation is a matter of including a single javascript file, after which you can inline LaTeX in your html as $latex <some LaTeX to be png'ified> $.

If you want to format output from Wolfram Alpha then #antyrat is correct. But if you just want to write nice-looking mathematical expressions for your blog you should look at Presentation MathML and LaTeX. There is a variety of utilities for rendering LaTeX into gif or png or similar formats.

Add spell check to my website

I have an asp-based website which I would like to add spell checking capabilities to the textarea elements on the page. Most of the pages are generated from an engine, though I can add JavaScript to them. So my preferred solution is a JavaScript-based one. I have tried JavaScriptSpellCheck and it works okay, though I would like to see what some of my other options may be. I also found spellchecker.net but at $3500 for a server license it seems excessive.
Spell checking can be in a separate window and must support multiple languages (the more the better). Ultimately I would like to send the spell check object a collection or delimited string of textarea names or id's (preferably names as they already exist in the pages) and have it spell check all of them, updating the text as spelling is corrected.

Check out using Google's api for this: http://www.asp101.com/articles/jeremy/googlespell/default.asp

Here is a free, open source Javascript library for spell checking that I authored:
https://github.com/LPology/Javascript-PHP-Spell-Checker
There's a link to a live demo at the top. It's designed to have the feel of a spell checker in a desktop word processor. I wrote it after being dissatisified with these same options.
To use, just include the JS and CSS files into your page, and then add this:
var checker = new sc.SpellChecker(
button: 'spellcheck_button', // opens the spell checker when clicked
textInput: 'text_box', // HTML field containing the text to spell check
action: '/spellcheck.php' // URL of the server side script
);
It includes a PHP script for spell checking, but it could be ported to another language fairly easily as long as it returns the correct JSON response.

If I were you, I'd look into something like aspell - this is used as one of the supported spellchecking backends in TinyMCE. Personally, I use pspell because it's integrated into PHP.
EDIT
There's an aspell integration here that has a PHP or a Perl/CGI version; might be worth checking out.

If I am not wrong, Firefox's English dictionary for spell checking takes around 800KB of data.
If you like to do everything in JavaScript -- for a full-featured spell checking engine, it means you need to load that 800KB data in every page load. It's really not a good idea.
So, instead of doing that in JavaScript, send the data to the server with AJAX, check it server side, and return it back; that's the best way.

Well this is quite old question, but my answer might help people who are looking for latest options on this question.
"JavaScript SpellCheck" is the industry leading spellchecker plugin for javascript. It allows the developer to easily add and control spellchecking in almost any HTML environment. You can install it in about 5 minutes by copying a folder into your website.
http://www.javascriptspellcheck.com/
Also support multiple languages - http://www.javascriptspellcheck.com/Internationalization_Demo

I might be a bit late on the answer to this question. I found a solution a long while ago. You must have a spell checker installed on your browser first. Then create a bookmark with the following code as the link.
javascript:document.body.contentEditable='true'; document.designMode='on'; void 0

Develop Reference

JavaScript is the programming language of the Web.