How to scrape website data called by javascript

How to scrape website data called by javascript - javascript

First post here - apologies in advance if i've made any mistakes. I would like to scrape the median house price for all the suburbs on this map. The median house price appears when you roll your mouse over the suburb in the map. ideally the output would be in excel with first column suburb name, second column median price with about 100 or so suburbs (i.e. columns). I do not know any programming and have tried to use parsehub (https://www.parsehub.com/) to do this but have not had any luck.
If anyone has any suggestions, please let me know! pls find link to map below:
http://www.realestate.com.au/invest/2-bed-unit-in-st+marys,+nsw+2760?zoom=10
Thanks in advance...

You can use a PhantomJS
"PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG."
This will allow you to use JavaScript to scrape data from a page. Another suggestion is using Import.IO. The service is free and doesn't require any programming to scrap the data. The service will give you an API to call which returns JSON data. The data is stored on their servers and is relatively fast.

One of the founders of ParseHub here. At the moment, ParseHub can only extract data from SVG-based maps, not canvas-based maps. I realize this is a technical detail that probably doesn't matter to you. We're looking into a workaround for your problem, expect an email from me soon.
On a side note, while we may be able to solve this from a technical standpoint, we strongly encourage you to abide by the terms of service of any site that you extract data from.
-Serge

Related

Can a Google Analytics result be passed to a JavaScript function while the user is still on the page?

On my website, users enter some personal information, including ZIP code. This information will be passed to a function that will determine the display of the next page.
The problem is that the function utilizes an underlying statistical model, for which zip codes have too many possible values (~43,000) to be useful. I want to map zip codes to something broader, like designated market area (DMA has around 200 possible values).
But using Google Analytics and BigQuery, I already have the user's DMA before they even enter their ZIP code. Is there a way to access that information while they are still on the page so I can input it to the function?

In case you are wondering if you can use Google Analytics Information in realtime (not quite clear from the question), that will not work - GA does not work in realtime; data processing time is announced as 4 hours for the "premium" version to 24 hours to the standard version, and even if it's often faster you probably do not want to build you business on an undocumented feature that might or might not work as expected.
Also API limits make realtime data retrieval unfeasible even for smaller sites.
If however you have a stash of precomputed data that can be linked to the current user via an identifier (clientId or similar) it would probably be best to export this to external storage as suggested by Willian Fuks.
Since you mentioned personal data, keep in mind that this must not be stored within Google Analytics as per Google's TOS.

(not quite an answer, more like some thoughts of mine)
I don't think that running queries in BQ for each user you find in production is a good approach.
Costs will increase considerably, performance will not be satisfactory by any means in this scenario and you might start hitting quotas limits for jobs against a single table.
One possibility that might work is having your back-end use some google analytics client for retrieving data from G.A. Still, you should check if the quotas are appropriate for you.
Another possibility (I suspect this might be the best option) that you may consider for your scenario is using Google Datastore. It might suit your needs quite well; you could have some table from BigQuery being exported to Datastore and have your back-end system query it directly for the user DMA.

How to Display Direction list on Google Maps API?

enter image description here
When using directions in Google Maps
I would like to display a list of multiple paths using the api, like the red box in the attached photo, but I do not seem to have a good guide to look for the api document.
I want to know how to implement it.
I would appreciate your help.

This is a pretty complicated endeavor to simply have someone walk you through without having built your own code. That said the I would recomend reading through the google resource manual. And here is a walkthrough once you understand some basics of google maps javascript control.
tutorial
There are free resources to figure it out. Though there are even better paid resources. Just check out Udemy for a 10$ course (you have to sign in and search a coupon, but you can get access to any great course for $10)
Give it a shot and get back to us when you have tried some stuff specifically.

Building a geolocation photo index - crawling the web or relying on an existing API?

I'm developing a geo-location service which requires a photo per POI, and I'm trying to figure out how to match the right photo to a given location.
I'm looking for an image that will give an overview for the location rather than some arbitrary image from a given coordinate.
for example when searching for "nyc" in Google you get the following image, filtered out from http://www.filmsofcrawford.com/talesofnyctours/
Of course Google is Google, however I've found this similar approach on other sites , for example : https://roadtrippers.com/us/san-francisco-ca/attractions/conservatory-of-flowers?lat=37.81169&lng=-122.69478&z=11&a2=p!5
Q: For an index like [POI NAME] -> [Overview image URL], what would be your approach, (crawling, an API etc') ?
Please add your thoughts :)

I would highly suggest using an existing API. Matching images with locations is quite hard to achieve. To my point of view the Google Images search API gives too many irrelevant results. It's built that way, processing images based on metatags or bringing up results ranked by SEO ranking.
If you still considering building up a web crawler take a look at Scrapy , it's open source, well documented and pretty stable.
You should take a look at other open APIs providing a location based queries. Some examples are following:
FourSquare has a great API, you may fetch your results providing
each city as an endpoint.
Instagram uses the FourSquare API to map images with
locations.It's popularity should be considered.
Flickr has well curated image results. You should also give it a
try as you may index images based on what licence are you seeking
for.
Google Places provides an API too, I have never worked with this
service but I thought I had to add it to my list.

Writing your own image crawler would not be an easy task. What happens if your target sites change their format, terms of use, or take down links, or even replace an image altogether? There's a great answer on Quora regarding the complexity of web crawlers, and even if you simplify things by narrowing down your sources to a small list of sites, you'll have to figure out how to process images, not text, and that might entail having to save hundreds of images locally for processing, which won't be fun to maintain.
I would strongly suggest leveraging Google's image search API to do the heavy 'technical lifting' for you. Your job is then to find the right combination of filters that will get you the best results. Here are some to consider:
Keywords. You could try and search by location (coordinates), but then you would have to rely on the accuracy of image metadata. Instead, how about generalizing the location of coordinates and doing a lookup based on the relative location instead? For example, you could generalize (40.812694, -74.074177) as the New York Giants stadium rather than a generic skyline of New York .
Resolution. It's safe to assume higher resolution pictures are more likely to be overview shots and taken with professional equipment. You can also consider the aspect ratio: images taller than they are wide tend to focus on a single object of interest, while images wider than they are tall tend to have more variety.
Licensing. Google's image search is capable of filtering by license and can ensure (for the most part) that you can reuse the images it finds.

Of course you don't need to crawl the web for this. You can use an API from google to search for images and retrieve the image. Take a look at this article

Use javascript to generate spreadsheet including formulas and user inputs

Using Javascript, I want to let users download or generate a spreadsheet that contains several parameter values they've chosen, as well as formulas that perform calculations using those parameters. After they generate/download the spreadsheet, users need to have the ability to change parameters and potentially customize the formulas.
I've seen that this can probably be done server side, but I think that requires learning yet another new language (I'm learning to code just for this project) so I suspect that's not yet feasible for me in the near future.
My solution thus far has been to convert the spreadsheet into delimited format, then code it into an array of arrays so it can be downloaded as a .csv file when someone clicks on a link.
Unfortunately, I'll need to make somewhat frequent changes to the formulas, and it's a complicated spreadsheet with complicated and repetitive formulas so it's tedious to find where to make changes (to give a sense, the code for the array currently takes up 19 pages when copied into MS Word, 12pt TNR.)
So a few questions:
Is there an easier/better way to download or generate a spreadsheet with formulas and user input?
If the answer to #1 is no, is there a way to break up an array in the code and make it easier to find/edit an individual cell/value?
As an alternative, should I consider switching to Google spreadsheets? Haven't done this because a handful of users might not be able to access Google at work, but the majority can. Maybe the pros outweigh the cons?
Is there another idea I haven't considered?
Very much appreciate any suggestions or feedback!

Parsing Unstructured Data

I am working on writing a bookmarklet that will pull information from a site and send it off to a users account to be saved for later usage. This generally includes the problem of taking unstructured information and making it structured. Take for example a hobbyist wants to save a project for later. There are a number of parts that they need to obtain and instructions to follow. On one blog, the writer could refer to the instructions as directions or recipe or any number of synonyms. One person may list the information with <li> tags to order the steps, while another may not.
What are general strategies to turn unstructured data into structured information? Are there other strategies to determine which content is relevant? (i.e. Instapaper or Readability)

Hmm...maybe you could use this in conjunction with Google? Taking a look a head & meta tags is a good idea too. You could also take a listing of how often words are used. Heck, you could even have a popup alert that asks the user to enter data about the page.

It doesn't seem like there is a good computer science answer to this question, so I have decided to change the approach and have users organize the data as they see fit.

Develop Reference

JavaScript is the programming language of the Web.