I'm intending to create an app using electron to extract data from a website and save it in a DB. This website is similar to google(an input and a search button, but no API) instead in addition requires for the first search to introduce a captcha code. My concern is to find out what's the simple way to automatize the process of sending requests and collecting the results. Can be Selenium a way or there is no need of other extensions to achieve it? Please offer me advice?(I don't have Electron experience)
I'm sorry, I don't completely understand the problem.
You want to scrap data from a website. But why you want to do so with Electron?
You can do it with selenium in a lot of different ways (and languages), using the drivers (like chromedriver).
The question is about Electron and Selenium or about how to avoid captcha and automatize the process?
In the first case, you can follow:
https://github.com/electron/electron/blob/master/docs/tutorial/using-selenium-and-webdriver.md
In the other case, it depends from the website and from a lot of other things.
PS: If I set up a captcha, it's because I don't want that someone scrap my data :)
Related
I'm attempting to create my first ever full-blown website. The site will host a form that takes a user's link and scrapes the inputted website. Besides scraping, I'm also looking for click and input functionality (for future projects).
I managed to create a PhantomJS function that scrapes a website depending on the parameters. However, I have absolutely no idea how to implement this into my Xampp server which locally hosts my website...PhantomJS only runs off of a command line as far as I'm aware.
Is it even possible to host a PhantomJS program on a website? Should I be using something else, such as NodeJS or another language? Ultimately, I'll go learn anything in order to make this happen.
Any ideas/suggestions are much appreciated. I apologize if this question seems dumb.
I'm trying to access data from a government website, designed for "point-and-click" downloads. My objective is to find out what the pattern to get to CSVs, and then create an easy API for other people to get to that data. The website is supposed to be open data, but is quite obscure as to how to get data programatically.
I have however failed to figure out what the pattern is to find a URL to the CSVs, because they seem to be hidden behind some JavaScript.
An example of a page is this one, and I want to know what the link behind the png image on the page.
How can I programatically get to the links behind this button?
How can I get to the links behind this button?
Investigate your web browser's "web developer" features. There should be a way to get the browser to log the full URLs for all of the requests that it is making.
Then reverse engineer the pattern from the examples. (This may or may not be possible. But if it is not possible, you should let the people who designed the site that it is unfriendly to people trying to use it ... programatically.)
How can I programatically get to the links behind this button?
Different question. Here are some possible options:
Use a web-scraping framework that understands how to execute Javascript as well.
Use a web testing framework like Selenium
There is a "headless browser" framework called Phantom.JS that may help.
Note that it is a lot more complicated to do this programatically. If reverse engineering is possible, that would be simpler.
Alright, so I'm in a small pickle. I'm running into issues with JSoup since the page needs Javascript to to finish loading some of the page. Fortunately, I've worked around it in the past (parsed the raw javascript code), and it's very tedious. As of late, I tried to make a program to login to a website but it requires a token from a element. That form element is not visible unless JavaScript is executed, so it wont show up at all for me to even extract. So I decided to look into Selenium.
First question, is this the library I should be looking into? The reason why I'm so bent on using HttpClient is because some of these websites are very high in traffic and doesn't load up all the way BUT I don't need these pages to load up all the way. I just need it to load up enough to where I can retrieve the login token. I prefer to communicate with the webserver with raw JSON/POST methods once I discover the the methods required vs. having Selenium automate a click/wait/type sequence.
Basically, I only need selenium to load up 1/4 of the page, just to retrieve request tokens. The rest of my program will send POST methods using HttpClient.
Or should I just let selenium do all the work? My goal is speed. I need to login, purchase an item fast.
Edit: Actually, I might go with HtmlUnit because it's very minimal. I only need to scrape information, and I don't want to run Selenium's StandAlone Server. Is this the better approach?
Basically, HtmlUnit is quicker than Selenium so if you are going for speed you should use that. Anyway, keep in mind that Selenium has its own implementation of HtmlUnitDriver. So, as another option, you could use Selenium with HtmlUnit. The difference between them is that HtmlUnit is a browser itself without GUI, meanwhile Selenium works calling browsers feature. You may want to take a look at this other question for further details: Selenium vs HtmlUnit?
I have an iOS app in which I use parse.com as backend service. Now, I hired someone to do a website interface using HTML and CSS. I want to share the same data between iOS app and website, I know parse.com offers me a few ways to do this, including creating a javaScriptapplication. The problem is, my programmer doesn't have any experience in JavaScript, nor do I.
My question is: Is it possible to use what I have (objective-c, xcode) as far as retrieving data from parse.com and showing on website? Even if I need to code something new, is it possible to use objective-c together with HTML and CSS?
Thanks.
Parse has several APIs, one of which is REST. Your web developer should use the REST API to get data from Parse
https://www.parse.com/docs/rest
If there is will there is way, but you'll be making something really specific to your use and will be non standard and will be immediately hard to maintain, I recommend that you hire another developer and do things properly using the technologies given to you by parse !. if the cost will be high now I can promise you it'll be much higher if you went the path you're going to now.
So my answer is:
Yes, everything is possible and no, don't do it ! :)
Edit: Added an example to a possible way to do it to actually answer OP's question.
Example case:
1-Create a simple Mac Application in Xcode that fetches data exactly like you do it on iOS, and store the needed data into a database of your choice on your server
2-You now have access to the data you needed from parse, but on a local mirror. you will need some tool to fetch that data though, I recommend a simple PHP script.
Note that this will require an OSX server to always be running to fetch that data, you'll also need of find a way to fetch data on demand when a user needs it Vs. polling at specified intervals, this will hardly scale and will be costly as I said.
This question might not be suitable for this website and I am only asking this for information purposes..
So, please let me know if this is not suitable and I will delete it.
I have created a web application using PHP and Javascript.
what I want to do is to find a way to turn the entire thing into a small javascript code so I only give that piece of javascript code to the users and they can copy and paste it in their website in order to be able to use the application on their own website without being able to edit the contents of it!
could someone please advise on this?
Again, please let me know if this question is not suitable for this site and I will delete it.
There are two ways to go about this, make it available as an iframe widget (which could theoretically be injected through javascript) or make a cross-domain API with which the javascript would interface. From the sound of it going down the iframe route sounds most sensible, although it does come with a clickjacking vulnirability.