Python 3, Web-scraping, and Javascript [Oh My]

Python 3, Web-scraping, and Javascript [Oh My] - javascript

I have come to the point of entering the melee on web-scraping webpages using Javascript, with Python3. I am well aware that my boot may be making contact with a dead horse, but I feel like drawing my six-shooter anyway. It's a spaghetti western; be my gray hat?
::Backstory::
I am using Python 3.2.3.
I am interested in gathering historical stock//etf//mutual_fund price data for YTD, 1-yr, 3-yr, 5-yr 10-yr... and/or similar timeframes for a user-defined stock, etf, or mutual fund. I set my sites on Morningstar.com, as they tend to provide as much data as possible without necessarily requiring a log-in; other folks such as finance.google.com &c tend to be inconsistent in what data they provide regarding stocks vs etfs vs mutual funds.
The trade-off in using Morningstar for this historical data, or "Trailing Total Returns" as they call it, is that for producing this data they use Javascript.
Here are some example links from Morningstar:
A Mutual Fund;
An ETF;
A Stock.
I am interested in the "Trailing Returns" portion, top row or so of numbers in the Javascript-produced chart.
::Attempted So Far::
I've confirmed that wget doesn't play with Javascript; even downloading all of the associated files [css, .js, &c] hasn't allowed me to locally render the javascript in browser or in script. Research here on StackOverflow confirmed this. Am willing to be corrected here.
My research informed me that Mechanize doesn't exist for Python3. I tried anyway, and turned into Policeman Javert crying out "I knew it!" at the error message "module does not exist".
::I've Heard Of...::
->Selenium. However, my understanding is that this requires Thy Favorite Browser to actually open up a webpage, navigate around, and then not close because there's no "close this tab//window" command//option for Selenium. What if I//my_user want to get historical data for many etfs, stocks, and/or mutual funds? That's a lot of tabs//windows opening up in a browser which was not necessarily desired to be opened.
->httplib2. I think this is nice, but I'm doubtful if it will play with Javascript. Does it, for example using the .cache and get options?
import httplib2
conn = httplib2.Http(".cache")
page = conn.request(u"http://the_url","GET")
->Windmill. See 'Selenium'. I am, however, off-key enough to sing 'Man of La Mancha'.
->Google's webscraping code. Would an attempt at downloading a Javascript-laden page result in ... positive results?
I've read chatter about having to "emulating a browser without a browser". Sounds like Mechanize, but not for Python3 as I currently understand.
::My Question::
Any suggestions, pointers, solutions, or "look over here" directions?
Many thanks,
Miles, Dusty Desert Villager.

When a page loads data via javascript, it has to make requests to the server to get that data via the XMLHttpRequest function (XHR). You can see what requests they are making, and then make them yourself, using wget!
To find out which requests they are making, use the Web Inspector (Chrome and Safari) or Firebug (Firefox). Here's how to do it in Chrome:
wrench/tools/developer tools/Network (tab at the top of the tools)/XHR filter at the bottom.
Here's an example request they make in javascript
If you look closely at the XHR request url, you notice that all trailing returns have the same format:
http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=
You just need to specify t. For example:
http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VAW
http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=INTC
http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VHCOX
Now you can wget those URIs and parse out the data directly.

Related

Reveal JavaScript decryption algorithm

I have been trying to understand how exchange rates are updated real-time on this website. With a quick look at the 'network' tab on developer tools, it became clear that website is getting responses periodically from this url. The problem is that the response text from the requests consist of sequences of random letters and numbers. It seems that actual content is encrypted and since exchange rates are displayed on client side, response data should be somehow decrypted with JavaScript on front end (I think).
So, my question is, what are some hints to explore JavaScript decryption algorithm, since all 'js' files are minified and variable names are just letters? What kind of tools and practices could you use to solve this kind problems?
Any suggestion or help on this matter would be very much appreciated.

The source code (not minfied) can be seen here. You will notice that it uses a function rc4decrypt to decrypt the data. rc4decrypt is defined as:
function rc4decrypt (a){
return rc4(key,hexDecode(a))
};
where key is a global (window) variable. Further steps should be easy.
(Please be aware of any legal implications of your actions).

Text-To-Speech Library (issue with pauses)

I've done quite a bit of snooping around the internet.
Right now I'm using the ResponsiveVoice library for which I pay ~$25/month.
https://code.responsivevoice.org/responsivevoice.js
The problem is that it seems to insert long breaks into text. The text is user generated, so it is out of my control (I can't optimize the sentence structure to sound good).
I'm assuming it's a problem with ResponsieVoice. They acknowledged the issue, but say they can't do anything about it. It's how text-to-speech behaves.
Here are some examples of text that's causing issues (inserts a pause).
A psychologist that takes a cross-cultural approach might consider
which of the |pause| following influences?
Who of the following first used scientific research methods to investigate
reaction |pause| times?
a method of investigation of thought processes and the |pause| mind
The ego uses defense mechanisms indirectly and |pause| unconsciously.
I'm not sure if text-to-speech has to insert random pauses, these sites seem to be able to handle text-to-speech without "strange" pauses.
I can't insert their links... because of my sucky reputation.
naturalreaders
acapela-box
oddcast
ttsreader
ivona
ispeech
It could also be an implementation issue, but ResponsiveVoice support said it's normal to get these long pauses.
Here is a screenshot from the console, which shows the "break" that is causing a pause.
screenshot of console in chrome
It would be great to get some insight from you guys (who understand the technology better).

I had the exact same problem and found the cause in my case. On our site the text to read out was generated by jQuery like so:
$('#text-to-read').text().trim().replace(/(?:\r\n|\r|\n)/g, '');
The regex at the end actually created tabs and spaces. I simply had to adjust the regex:
$('#text-to-read').text().trim().replace(/\s\s+/g, ' ');
I know this is a very rare cause maybe, but it might help others out there!

XML File Parse in javascript how to. Large File maybe use SAX?

G'day All,
I am pulling my hair out, getting headaches and my eyes hurt. I have been hither and thither and I seem to get whither.
This will be my first experience with xml and would really want to get this working. It is a large file. Well large in my eyes +-5mb. I can not imagine that this file would be loaded into memory to process. Users will get a bit peeved with this.
Basically we are using a 3rd parties site to do our ecommerce. So we have no access to the database other than via the admin area.
What we want to do is make sure that there is no stuff ups when it comes to addresses. Therefore we got this xml file put together listing all postcodes with areas and states:
<?xml version="1.0"?>
<POSTCODES>
<PostCode id="2035">
<Area>2035 1</Area>
<Area>2035 2</Area>
<Area>2035 3</Area>
<State>NSW</State>
</Postcode>
<PostCode id="2038">
<Area>2038 1</Area>
<Area>2038 2</Area>
<Area>2038 3</Area>
<State>NSW</State>
</Postcode>
<PostCode id="2111">
<Area>2111 1</Area>
<Area>2111 2</Area>
<Area>2111 3</Area>
<State>NSW</State>
</Postcode>
</POSTCODES>
Someone suggested SAX but suddenly died when asked how? The web is not helping unless I am not looking properly. I see a lot of examples. Either they do not show how to read the file but rather do it from a textarea or the example is in java.
What do we want? User enters a post code of 2038. We want to go to the javascript with that data and have returned to us all the suburbs that full within that post code.
Anyone out there that can please tell me what to download and how to use it to get what i need?
Please, please, please. It is hard to see a grown man begging and crying but I am.

Sounds like you want a script on the server which will suggest suburbs based on the users postcode selection? You could use jQuery's ajax functionality to do this.
You might also be able to use jQueryUI's autocomplete control to parse XML and make suggestions: http://jqueryui.com/demos/autocomplete/#xml
It's also possible to do this entirely in javascript without any script on the server side, but it would be pretty slow at loading if the XML file is 5MB. You might be able to get a significant reduction in file size thought by gzipping it before transmission from the server.

If you need to parse this in Javascript, you can use jQuery.
http://www.switchonthecode.com/tutorials/xml-parsing-with-jquery

C.S. Basics: Understanding Data Packets, Protocols, Wireshark

The Quest
I'm trying to talk to a SRCDS Server from node.js via the RCON Protocol.
The RCON Protocol seems to be explained enough, implementations can be found on the bottom of the site in every major programming language. Using those is simple enough, but understanding the protocol and develop a JS library is what I set out to do.
Background
Being a self taught programmer, I skipped a lot of Computer Science Basics - learned only what I needed, to accomplish what I wanted. I started coding with PHP, eventually wrapped my head around OO, talked to databases etc. I'm currently programming with JavaScript, more specifically doing web stuff with node.js ..
Binary Data?!?!
I've read and understood the absolute binary basics. But when it comes to the packet data I'm totally lost. I'd like to read and understand the wireshark output, but I can't make any sense if it. My biggest problem is probably that I don't understand what the binary representation of the various INT and STRING (char ..) from JS look like and how I convert from data I got from the server to something usable in the program.
Help
So I'd be more than grateful if someone can point me to a tutorial on these topics. Tutorial as in "explanation that mere mortals can understand, preferably not written by a C.S. professor". :)
When I'm looking at the PHP reference implementation I see (too much) magic happening there which I can't translate to JS. Sending and reading data from a socket is no problem, but I need to know how PHPs unpack function works respectively how I can do that in JS with node.js.
So I hope you can see what I'm trying to accomplish here. First and foremost is understanding the whole theory needed to make implementing the protocol a breeze. But because I'm only good with scripting languages it would be incredibly helpful if someone could guide me a bit in the HOWTO part in PHP/JS..
Thank you so much for your time!

I applaud the low level protocol pursuit.
I'll tell you the path I took. My approach was to use the client and server that already spoke the protocol and use libpcap to do analysis. I created a library that was able to unpack the custom protocol I was analyzing during this phase.
Its super helpful to start with diagrams like this one:
From the wiki on TCP. Its an incredibly useful way to visualize the structure of the binary data. Its tightly packed, so slicing it apart requires attention to detail.
Buffers and Binary
I read up on Buffer. Its the way you deal with Binary in node. http://nodejs.org/docs/v0.4.8/api/buffers.html -- the first thing to realize here is that buffers can be accessed bit by bit via array syntax, ie buffer[0] and such.
Visualization
Its helpful to be able to dump your binary data into a hex representation. I used https://github.com/a2800276/hexy.js to achieve this.
node_pcap
I grabbed https://github.com/mranney/node_pcap -- this is the equivalent to wireshark, but you can programmatically poke at all outgoing and incoming traffic. I added udp payload support: https://github.com/jmoyers/node_pcap/commit/2852a8123486339aa495ede524427f6e5302326d
I read through all mranney's "unpack" code https://github.com/mranney/node_pcap/blob/master/pcap.js#L116-171
I found https://github.com/rmustacc/node-ctype
I read through all their "unpack" code https://github.com/rmustacc/node-ctype/blob/master/ctio.js
Now, things to remember when you're looking through this stuff. Most of the time they're taking a binary Buffer representation and converting to a native javascript type, like say Number or String. They'll use advanced techniques to do so -- bitwise operations like shifts and such. You don't necessarily need to understand all that.
The key things are:
1) endianness -- the ordering of bits (network and host byte order can be reverse from each other) as this pertains to how things are unpacked
2) Javascript Number representation is quirky -- node-ctype goes into detail in the comments about how they convert the various number types in javascript's Number. Integer, float, double etc are all Number in javascript land.
In the end, its likely fine if you just USE these unpackers for your adventures. I ended up having to unpack things that weren't covered in these libraries, like GUIDs and such, and it was tremendously helpful to study the source.
Isolate the traffic you're looking at
Filter, filter, filter. Target one host. Target one direction. Target one message type. Focus on stripping off data that has a known fixed length first -- often times the header in a protocol is a good place to start. Once you get the header unpacking into a nice json structure from binary, you are well on your way.
After that, its one field at a time, top to bottom, one message at a time. You can use Buffer#slice and the unpack functions from node-ctype to grab each piece of data at a time.

What is the best way to filter spam with JavaScript?

I have recently been inspired to write spam filters in JavaScript, Greasemonkey-style, for several websites I use that are prone to spam (especially in comments). When considering my options about how to go about this, I realize I have several options, each with pros/cons. My goal for this question is to expand on this list I have created, and hopefully determine the best way of client-side spam filtering with JavaScript.
As for what makes a spam filter the "best", I would say these are the criteria:
Most accurate
Least vulnerable to attacks
Fastest
Most transparent
Also, please note that I am trying to filter content that already exists on websites that aren't mine, using Greasemonkey Userscripts. In other words, I can't prevent spam; I can only filter it.
Here is my attempt, so far, to compile a list of the various methods along with their shortcomings and benefits:
Rule-based filters:
What it does: "Grades" a message by assigning a point value to different criteria (i.e. all uppercase, all non-alphanumeric, etc.) Depending on the score, the message is discarded or kept.
Benefits:
Easy to implement
Mostly transparent
Shortcomings:
Transparent- it's usually easy to reverse engineer the code to discover the rules, and thereby craft messages which won't be picked up
Hard to balance point values (false positives)
Can be slow; multiple rules have to be executed on each message, a lot of times using regular expressions
In a client-side environment, server interaction or user interaction is required to update the rules
Bayesian filtering:
What it does: Analyzes word frequency (or trigram frequency) and compares it against the data it has been trained with.
Benefits:
No need to craft rules
Fast (relatively)
Tougher to reverse engineer
Shortcomings:
Requires training to be effective
Trained data must still be accessible to JavaScript; usually in the form of human-readable JSON, XML, or flat file
Data set can get pretty large
Poorly designed filters are easy to confuse with a good helping of common words to lower the spamacity rating
Words that haven't been seen before can't be accurately classified; sometimes resulting in incorrect classification of entire message
In a client-side environment, server interaction or user interaction is required to update the rules
Bayesian filtering- server-side:
What it does: Applies Bayesian filtering server side by submitting each message to a remote server for analysis.
Benefits:
All the benefits of regular Bayesian filtering
Training data is not revealed to users/reverse engineers
Shortcomings:
Heavy traffic
Still vulnerable to uncommon words
Still vulnerable to adding common words to decrease spamacity
The service itself may be abused
To train the classifier, it may be desirable to allow users to submit spam samples for training. Attackers may abuse this service
Blacklisting:
What it does: Applies a set of criteria to a message or some attribute of it. If one or more (or a specific number of) criteria match, the message is rejected. A lot like rule-based filtering, so see its description for details.
CAPTCHAs, and the like:
Not feasible for this type of application. I am trying to apply these methods to sites that already exist. Greasemonkey will be used to do this; I can't start requiring CAPTCHAs in places that they weren't before someone installed my script.
Can anyone help me fill in the blanks? Thank you,

There is no "best" way, especially for all users or all situations.
Keep it simple:
Have the GM script initially hide all comments that contain links and maybe universally bad words (F*ck, Presbyterian, etc.). ;)
Then the script contacts your server and lets the server judge each comment by X criteria (more on that, below).
Show or hide comments based on the server response. In the event of a timeout, show or reveal based on a user preference setting ("What to do when the filter server is down? (show/hide comments with links) ).
That's it for the GM script; the rest is handled by the server.
As for the actual server/filtering criteria...
Most important is do not dare to assume that you can guess what a user will want filtered! This will vary wildly from person to person, or even mood to mood.
Setup the server to use a combination of bad words, bad link destinations (.ru and .cn domains, for example) and public spam-filtering services.
The most important thing is to offer users some way to choose and ideally adjust what is applied, for them.

Develop Reference

JavaScript is the programming language of the Web.