If you're unfamiliar with Asirra, it's a CAPTCHA technique developed by microsoft that uses the identification of cats and dogs rather than a string of text for human verification.
I'd like to use their database of millions of pictures of cats and dogs for some machine learning experiments, and so I'm trying to write a script that will automatically refresh their site and download 12 images at a regular interval. Unfortunately, I'm a novice when it comes to JavaScript.
The problem is, for very obvious security reasons, it's hard to find the actual url of the image because it's all behind obfuscated javascript. I tried using Curl to see what html was returned using a terminal app, and it's the same deal - just javascript. So, using a script, how do I get access the actual images? Obviously the images are being transferred to my computer since they're showing up on my screen, but I don't know how to capture those images using a script.
Also a problem is that I don't want the smaller images that first load, I need the larger ones that only show up only when you mouse over them, so I guess I need to overwrite that javascript function to give the larger images to me via the script as well.
I'd prefer something in Python or C#, but I'll take anything - thanks!
Edit: Their public corpus doesn't have near enough images for my uses, so that won't work. Also, I'm not asking necessarily for you to write me my script, just some guidance on how to access the full-size images using a script.
Try using their public corpus http://research.microsoft.com/en-us/projects/asirra/corpus.aspx
While waiting for an answer here I kept digging and eventually figured out a sort of hacked way of getting done what I wanted.
First off, the reason this is a somewhat complicated problem (at least to a javascript novice like me) is that the images from ASIRRA are loaded onto the webpage via javascript, which is a client-side technology. This is a problem when you download the webpage using something like wget or curl because it doesn't actually run the javascript, it just downloads the source html. Therefore, you don't get the images.
However, I realized that using firefox's "Save Page As..." did exactly what I needed. It ran the javascript which loaded the images, and then it saved it all into the well-known directory structure on my hard drive. That's exactly what I wanted to automate. So... I found a firefox Add-on called "iMacros" and wrote this macro:
VERSION BUILD=6240709 RECORDER=FX
TAB T=1
URL GOTO=http://www.asirra.com/examples/ExampleService.html
SAVEAS TYPE=CPL FOLDER=C:\Cat-Dog\Downloads FILE=*
Set to loop 10,000 times, it worked perfectly. In fact, since it was always saving to the same folder, duplicate images were overwritten (which is what I wanted).
Related
I’m new at using PHP but I’m searching for a very specific function and I don’t know if PHP would be able to do what I want.
I would like to load a HTML page, wait several seconds (in order to allow JavaScript to make change on the page), and then download the content that changed.
For example there is a HTML doc which has a <video> tag that is changing its src attribute every 10 seconds (with JavaScript) and the thing I want to do is to grab using PHP all those src in one script.
I know that it’s possible to download the first attribute, I’ve done some research and it seems that I should use the get_file(url) function, but I don’t know if it is even possible to load the doc, wait until the attribute changes and then download the changed attribute.
This is not, as you've described it (that is, assuming that the src attribute really is changed by JavaScript), something that PHP can do on its own. PHP doesn't run JavaScript, browsers do. Once your PHP code downloads the HTML, what you have is simply a string of characters; PHP alone doesn't know any difference between that and "hello world". It's not going to change in memory, no matter how long you wait.
But all is not lost. You should look at the HTML and JavaScript of the page, this may give you some ideas about how to proceed. The JavaScript must be getting the new src from somewhere, right? The only obvious options are that it's already embedded in the source somewhere (an array of sources, for example, which it cycles through) or it's being retrieved from a server via Ajax. If it's the former, you can just directly extract that list right away, no waiting required. If it's the latter, you may be able to send your own queries to the server to get them all, though there are security things that could cause problems here.
To do what you're seeking, you'll need a browser engine that can execute JavaScript just like what would happen with real users.
Look into a headless browser, such as SlimerJS, or one of the many headless Chromium APIs. You can tell the browser engine to load a page and execute its scripts. After some time (or a certain trigger), you can use the DOM API just like you would in-browser.
Because of a system of dependencies that are less formalized than I'd like, I now have a very, very large "master" JS file that contains a pretty large kitchen sink for use in every page in my app; that is, once you've logged in.
It only takes a few seconds to load, but those few seconds aren't great for a first-time experience at the logged-in homepage. So, I'd like to find a way of loading my script on my login page, so that when the browser requests it on the homepage, it either gets a 304 Not Modified response, or simply knows not to re-request it. Here's what I've tried.
Just including the <script>
This unfortunately doesn't work because the script in question doesn't have "definition guards" in place. Including it on the page messes up the login page because of certain <div>s it expects to have present. It's built through dojo, and I don't want to hack the built file, so I don't want to surround its code in such a check.
Grab it with XHR
I actually had this fix in place for a while, and it appears to work okay in Chrome; once the login page is completely loaded, my script then sends out an XHR to "js/masterFile.js" and does nothing with it. The assumption is that as long as the cache headers are okay, the browser will hold onto it when it later needs that file as a script. As I said, it turns out most browsers don't seem to work this way. Some reuse the "text" they got from the XHR, others seem to cache scripts differently from other content; it's possible that's a security-related issue to them.
Load it in an iframe
This is kind of entering rocky territory, as I don't like iframes, and it's an extra request. But, doing this would at least let the browser cache the script in the right way. It introduces a lot of code complication though, and I'm hesitant to settle on this.
If it helps at all, the scripts are AMD-compatible; but, the master script in question is a "boot layer" that contains the basic definitions of require/define.
Well, I just found a possible way of doing this; I don't currently have time to thoroughly test it, but it appears to work on a basic level.
<script src="myMasterScript.js" type="text/definitelynotjavascriptnopenosir"></script>
This works in an interesting way. I can see that the request is made for the script source, AND if it's valid JavaScript, it will execute. (As found from just referring to the JQuery CDN) However, the browser's console will essentially squelch any kind of errors that occur from running it; ie, "Oh. Maybe I wasn't supposed to run that".
For my own purposes, I still need to figure out, based on the JQuery scenario, whether this might still mess up the page to some degree.
To start things off, I have basically no javascript experience and I am trying to modify a pre-existing .js file used in my office to quickly open up several web pages from one file. It is really helpful in a quick start, and though it was written by a different person who is, unfortunately, no longer around to maintain the file, I have been able to keep it up to date with the changes that happen around here.
The downside is that the script exclusively opens the pages in IE, and there are a couple of links that I want to convert to opening in Chrome instead as they run much more efficiently there. So far I haven't been able to find the right way to code it, nor the right sort of variables/definitions to use. The original script is as follows (page addresses removed, of course, minus the last one to provide reference);
var navOpenInBackgroundTab = 0x1000;
var oIE = new ActiveXObject("InternetExplorer.Application");
oIE.Navigate2("http://[address]", navOpenInBackgroundTab);
oIE.Navigate2("http://www.carfax.com/", navOpenInBackgroundTab);
oIE.Visible = true;
this section edited to update on issue progress
The .js file being used is a stand alone file resting on the windows desktop and is not being run within or embedded as a part of any html environment, and calls on using Active X objects to function. It has no user interaction other than your basic .exe style double click and it runs. Thanks to your assistance and suggestions so far it has been established that Active X does not connect/utilise Chrome unless an additional plugin is downloaded (ActiveXobject). I have been unable to clarify if this plugin allows chrome to utilize activex within the environment (IE; a web authoring tool), or if it allows activex itself to call on chrome as a valid object. However it is nonviable as a solution due to admin restrictions in my situation.
My question is now one of alternative ways I can target a link to chrome, such as through old-fasioned html coding or a javascript version thereof that would let me call a link and set a target without using axtiveX. Is this possible/does code exist that can be used within the same .js file without mucking things up? (preferrably something that can be done in one or two lines. I don't have the skill to be writing my own libraries. ... Also, I am lazy. ¬_¬)
I'm quite sure that you get the same result by calling window.open("website url", "_blank") multiple times, using Chrome.
It might give a warning that the page is trying to open other pages, but since it is from a "trusted source", you can disable this warning and it'll not annoy you anymore.
I believe that this question has been asked in a few different forms, but I've read quite a few different responses.
At first, I had a web-application written with mostly jQuery that would make use of servlets to retrieve information from various locations JavaScript could not access (ie. Feeds, images from a server, etc.). Now, however, I've been told to do away with the servlets and application configuration classes so that this project of mine contains only HTML, CSS, and JavaScript/jQuery. Rather than pulling the images off of the server, I need to retrieve them from a local file on the computer. I know that allowing this might seem like terrible design, but it's what I've been asked to do. At any rate, what I really need to do is count the number of image files in a directory and then perhaps compile an array of the filenames themselves. I could do this fine in Java when using the servlets, but without them, I'm not sure how or even if this can be done.
I'm basically trying to use the jQuery Cycle plug-in to cycle through these images like a slideshow. I inject (or $("#div").append()) these images into the div by using a loop based on the number of images present.
So, is there a way I can do this with using JavaScript, HTML, jQuery plug-in, etc? I'd like to avoid using PHP and Java at this point...
You can't just read a directory with JavaScript; however, there appears to be a way to "exploit" how browsers function using http://www.irt.org/articles/js014/. It may not be pretty, but the demo works in the latest Chrome and IE7-9 for me. I'm sure some of the techniques could be updated to use cleaner code if you'd like to improve upon it.
EDIT:
Another technique you could use can be found in Javascript read files in folder
It definitely looks to be a cleaner solution. What I'd recommend is extracting the body contents to inject into a hidden div or using the path for an iframe that you can read from.
i'm working on a web application...
The application is running fine but the problem is the first time wen i open the application in the browser it shows a blank page i have to hit refresh three or four times to load the page completely and correctly.....
I think my application is too heavy to load, However once it is loaded it's good to go....
i have probably 5 JavaScript files which is around 1.3mb in size and also some UI components.....
is there a possible way to control it so that wen i load the application it returns the entire application without the necessarily hitting refresh again and again....
is there a way to optimize this page....
please help...
Thank you in adavance...
hi again,
is there a way to automatically reload the page if it didn't load the first time?
Check whether you can optimize your code in the javascript. Do you need all the functions that are defined in those 5 javascript files?If not you can split it and load it when other pages load that need this functionality.
Try to find out which part of the code is making it too slow?
1.3 MB of javascript is too much. Try compressing your javascript.
http://jscompress.com/
After compression, try delay loading the javascript files which ever possible:
http://www.websiteoptimization.com/speed/tweak/defer/
Run YSlow addon to gather more information about optimizations possible
http://developer.yahoo.com/yslow/
The easiest method is to run YSlow on a Firefox Console
You should also compress your javascript files using YUI Compressor
Have you minified your javascript. This makes it more difficult for humans to understand but can significantly reduce the file size. If one of those scripts is jQuery you might consider referencing the copy hosted at google on your page rather than having it hosted on the serve. Google's server is probably faster than your, and a lot of users will have a copy of jQuery from google cached.
If the site is image heavy and PNGs are used you might consider removing some data from them to make them smaller using tools like pngcrush
As mentioned by a few other, running the page through YSlow is very likely to help find issues that could cause slow performance