is it possible to do some simple web scraping in chrome extension? - javascript

thanks in advance and Im sorry if this might not be a well formed question, I am relatively new to CS and stackoverflow.
I am hoping to make a simple chrome extension, which overrides the new tab page to display some simple data collected from a couple websites. I am wondering if it is possible to web scrape within the basic JS or chrome API's? Any information or guidance would be greatly appreciated, I have been trying to do research on the subject and haven't found any recent or clear answers to this.
Thanks for your help!
Here is an older stackoverflow question asking the same question but I wasn't able to make any progress from the answers.
Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs)

Absolutely and not just simple scraping.
If you think about it using the browser itself is as close as possible to replicating a real user session. You don't have to care about manually setting cookies, discover and construct json http requests. The browser does it all for you. After a page has been rendered (with or without javascript) you can access the DOM and extract any content you like.
Take a look at https://github.com/get-set-fetch/extension, an open source browser extension that does more than just basic scraping. It supports infinite scrolling, clicking and extracting content from single page javascript apps.
Disclaimer: I'm the extension author.
If you're serious about the subject start by developing a simple chrome extension (from my own experience Chrome throws more verbose extension errors than Firefox): https://developer.chrome.com/extensions/getstarted
Take a look afterwards at the main get-set-fetch background plugins: FetchPlugin (loads an url in a tab and waits for the DOM to stabilize), ExtractUrlPLugin (identifies additional urls to scrape from the current url), ExtractHtmlContentPlugin (actual scraping based on CSS selectors).
There are downsides though. It’s a lot easier to run a scraping script in your favorite language dumping the scraped content into a database than automatically starting the browser, loading the extension, controlling the extension, exporting scraped data to a format like csv, importing that data into a database.
In my opinion, it only makes sense to use a browser extension if you don’t want to automate the data extraction or the page you’re trying to scrape is so javascript heavy it’s easier to automate the extension than writing a scraping script.

I am wondering if it is possible to web scrape within the basic JS or chrome API's?
Yes, use fetch to call REST APIs.

Related

How do I scrape data generated with javascript using BeautifulSoup?

I'm trying to migrate some comments from a blog using web scraping with python and BeautifulSoup. The content I'm looking for isn't in the HTML itself and seems to have been generated in a script tag (which I can't find). I've seen some answers regarding this but most of them are specific to a certain problem and I can't seem to figure out how to apply it to my site. I'm just trying to scrape comments from pages like this one:
http://www.themasterpiececards.com/famous-paintings-reviewed/bid/92327/famous-paintings-duccio-s-maesta
I've also tried Selenium, but I'm using a Cloud9-based IDE currently and it doesn't seem to support web drivers.
I apologize if I botched any of the lingo, I'm pretty new to programming. If anyone has any tips, that would be helpful. Thanks!
You have many ways to scrap such content. One would be to find out how comments are loaded on this website. On quick lookup in chromium developer tools, comments for the page mentioned are loaded via this api call.
This may not be a suitable way for you as you may not generate this url for every different page.
Another more reliable way would be to render such js content using GUIless browser, for ease of implementation i would suggest using scrapy with splash .Splash is a python framework which renders most of the content for your requests.

use of google script editor

Hey so currently working on my first personal project so bear with the questions!
Currently trying to create a Javascript program that will parse info from google forms to produce slides displaying the info. So far from my research the best way I've found to facilitate this process is googles app script editor. However, I was wondering if I can run this code by requesting it from a different javascript (or maybe even java) program that I will write code on webstorm. If I cant do this what is the best way to utilize the google apps script editor?
Thanks!
Google Apps Script is just javascript with extra built-in APIs (like SpreadsheetApp, FormApp, etc.).
It also has a UrlFetchApp API.
So you can run code like this:
// The code below logs the HTML code of the Google home page.
var response = UrlFetchApp.fetch("http://www.google.com/");
Logger.log(response.getContentText());
As such, if you want to provide JavaScript from elsewhere, you could fetch it and then eval it on the Google Apps Script side. (but we all know how tricky eval can get)
One other option is to have your own server side written using Google App Engine (or any other framework) and use Google's OAuth and authorize your app to fetch data from the Forms form
Slides and Google Apps Script
You might like to take a look at the addon "Slides Merge" by Bruce McPherson. I've never used it but it sounds like it might work for you. Here's what it's looks like in the addon store:
Getting information from Google Forms is a snap with google apps script since your can link the form right up to a spreadsheet. The Google Apps Script documentation is really quite good these days. Here's the documentation link. Google Apps Script is loosely based on Javascript 1.6. If your already a programmer my guess is that you'll have few problems learning to use it. In my experience the most difficult thing was dealing with the arrays of arrays produced by the getValues() method of ranges in google apps script and I made a short video that might be of some help to you.
I also have a script that I wrote in Google Apps Script that produces a sheet show that is a slide show inside of a spreadsheet.
I've found that using the Script Editor is pretty easy. There's some documentation in the support section of the documentation. It can be a bit buggy at times but overall I think it's a pretty good tool.

Add enhancements to a website (whether it be by C#, Chrome Extensions, etc.) -- Not sure what would work?

There is a website that I visit often... let's call it www.example.com. And, I am able to interact with parts of this website. The interactions send XMLHttpRequest and get a response back through Javascript, jQuery I believe.
I'm not sure what technology will let me achieve what I want to do, and where to start. Basically, I want to add additional options/shortcuts that the site does not provide. I thought about maybe using a macro, but trying to use macro recording software is just a pain in the butt.
I inspected (using Google Chrome's Developer Tools) the XMLHttpRequest being sent back and forth and I noticed that it is simple JSON messages. I figured the best way to add enhancements to the site without waiting for the actual owners of the site to do so would be to simulate the website sending/recieving these XMLHttpRequest/Response and making additional adjustments to the DOM to provide extra shortcuts.
I don't want to interfere with the original site's functionality though... ie if I send a request and receive a response I want both the original script and my script to process the response. So, here is where I'm stuck... I'm not sure whether to go along the paths of creating a C# application or a Google Chrome extension (I use Google Chrome) or something else alltogether. Any pointers on what dev tools/languages will give me the ability to do what I want would be great. Thanks!
Chrome has built in support for user scripts. You can use these to modify the page as you see fit and also to make requests. Without more details regarding what exactly you want to do with these AJAX request it's hard to advise further.
I'm not 100% sure what your question is, but as I understand it, you want to be able to make changes to a certain website. If these changes can be done with js, i would recommend Greasemonkey for Firefox. It basically lets you run a custom script when you are visiting a certain webpage/domain. You can be as specific as you want about which pages use the script. Once your script loads jQuery, it is really easy to add any functionality.
https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/
You can find pre-written scripts for tons of sites here:
http://userscripts.org/

Is it possible to create a service to manipulate a webpage DOM dynamically?

Hi I am relatively new to this topic so I have no idea if this is possible.
What I want to do is to create a widget which could be attached to the any web page other there dynamically. This widget has nothing to do with any web pages in particular but once the widget is created all the visitor of the web pages should be able to see this widget (is this possible?)
I don't know where to start..... should this service be browser's plugin (addons) or is there a way to dynamically manipulate someone else's dom dynamically?
Any thoughts, help etc would be a great help!
Thanks.
If I have understood your question correctly, you want to have a script that injects onto every webpage the user visits and displays a widget, correct?
You could create an add-on, although you would have to create a separate add-on for each browser you plan to support, and they can sometimes be a bit more complicated than they have to be for something as simple as script injection.
A better alternative is create a user script, which is basically a JavaScript file which is run whenever a vistor visits a website which matches a pattern that you specify (for instance, all websites they visit). Firefox has support for user scripts through the Greasemonkey extension, and Opera and Google Chrome has built-in support.
If you want to learn how to make your own user scripts, you can check out the Greasemonkey wiki or you can study some of the scripts

How do I use Mechanize to process JavaScript?

I'm connecting to a web site, logging in.
The website redirects me to new pages and Mechanize deals with all cookie and redirection jobs, but, I can't get the last page. I used Firebug and did same job again and saw that there are two more pages I had to pass with Mechanize.
I took a quick look at the pages and saw that there is some JavaScript and HTML code but couldn't understand it because it doesn't look like normal page code. What are those pages for? How they can redirect to other pages? What should I do to pass these?
If you need to handle pages with Javascript, try WATIR or Selenium - those drive a real web browser, and can thus handle any Javascript. WATIR Classic requires either IE or Firefox with a certain extension installed, and you will see the pages flash on the screen as it works.
Your other option would be understanding what the Javascript on the offending page does and bypassing it manually, but that seems onerous.
At present, Mechanize doesn't handle JavaScript. There's talk of eventually merging Johnson's capabilities into Mechanize, but until that happens, you have two options:
Figure out the JavaScript well enough to understand how to traverse those pages.
Automate an actual browser that does understand JavaScript using Watir.
what are those pages for? how they can redirect to other pages. what should i do to pass these?
Sometimes work is done on those pages. Sometimes the JavaScript is there to prevent automated access like what you're trying to do :). A lot of websites have unnecessary checks to make sure you have a "good" browser, so make sure that your user_agent is set to something common, like IE. Sometimes setting the user_agent to look like an old browser will let you get past without JavaScript.
Website automation is fun because you have to outsmart the website and its software developers, using multiple strategies. Like the others said, Watir is the best tool for getting past JavaScript at the moment.

Categories

Resources