Scrape website after form submit and data is loaded - javascript

I have to scrape a website which i've reviewed and i realised that i don't need to submit any form. I have the needed urls to get the data.
I'm using NodeJs and Phantom.
My problems source is something related with the session or cookies (i think).
In my web browser i can enter in this link https://www.infosubvenciones.es/bdnstrans/GE/es/convocatorias, hit on the form blue button with text "Procesar consulta". The table below will be filled. In dev tools on network tab you can see a XHR request with a link similar to https://www.infosubvenciones.es/bdnstrans/busqueda?type=convs&_search=false&nd=1594848133517&rows=50&page=1&sidx=4&sord=desc, if you open it in a new tab, the data is displayed. But if you open that link in other web browser you get 0 results.
That's exactly what is happening to me with NodeJs and Phantom and i don't know how to fix it.

If you want to give Scrapy a try, https://docs.scrapy.org/en/latest/topics/dynamic-content.html explains how to deal with this type of scenarios, and I would suggest reading it after completing the tutorial.
The page can also be handy if you use other scraping framework, as there’s not much that is Scrapy-specific, and for Python-specific stuff I’m sure there will be JavaScript counterparts.
As for Cheerio and Phantom, I’m not familiar with them, but it is most likely doable with them as well.
It’s doable with any web client, it’s just a matter of knowing how to use the tool for this purpose. Most of the work involves using your web browser tools to understand how the website works underneath.

Related

How to prevent browser Ctrl+U?

I want to disable Ctrl+U from browser to stop users viewing the source (html + JavaScript) for a page.
This unfortunately is not how it works.
When a user visits your website, there's a lot going on behind the scenes:
The user queries a page on your site.
Your server does some fancy things
Your server transforms those fancy things into something for the users browser to use
Your server sends off its final product back to the browser.
The browser then gets a bunch of code, such as HTML or Javascript.
The browser then reads that HTML and Javascript and organizes it to look and work how it's supposed to on the users screen.
Basically, another way of saying all this, is that the HTML and Javascript that you want to hide is executed client-side. This means that your browser gets a bunch of code, it executes it, and then displays its results to the user. If someone really wanted to see the source code of your website, they could easily bypass your prevention of using CTRL+U. All they have to do is to somehow tell the browser not to execute the code!
Ultimately, if a user really wants to see your source code, they will do it. There is no way to stop it. For this reason, it is recommended to keep things you need to remain a secret on the server-side code (such as your PHP).
You potentially can not prevent user from viewing the html source content. The site that prevents user from rightclick. but Fact is you can still do Ctrl+U in firefox and chrome to view source !
It is impossible to effectively hide the HTML, JavaScript, or any other resource sent to the client. Impossible, and isn't all that useful either.
Furthermore, don't try to disable right-click, as there are many other items on that menu (such as print!) that people use regularly.
Please have a look at this
I think this may help you.
Unfortunately CTRL+U is for "View Source", you can't disable browser functionalities, but you can write secure coding whichever you don't want to show.

Writing and downloading files clientside crossbrowser

I have a program where the user does some actions (i.e. clicking on several buttons). I want to record their clicks and the buttons that they click to allow the user to then download a text file with a record of their clicks when they click a separate "download" button. I looked at the File-system APIs for HTML 5, but they seemed to not have cross-browser support. I would ideally like to have this entire file generation and download scheme be entirely client-side, but I am open to server-side ideas as well.
TL;DR: Essentially I'm looking for an equivalent to Java's FileWriter, FileReader, ObjectOutputStream, and ObjectInputStream within Vanilla JS or jQuery (would like to stay away from php, but I'll use it as a last option).
Also, why don't all browsers support the filesystem api? (I'm guessing that it would make MSWord and Pages go out of business with all the open source clientside text editors that could come out.)
Unfortunately the HTML5-File-system is no longer a part of the spec, long story short FF refused to implement because they claimed everything you could do in the File-System API was doable in the HTML5 Indexeddb (which was mostly true). Please see this blog post for more on why FF didn't implement. I do not know IE's story. (I may have exagerated why FireFox didn't implement, I'm still bummed because you cannot actually do everything in indexeddb that you can do in the noew "Chrome File-system API")
Typically if two of those three browsers implement a spec, it stays in the spec. Otherwise that spec gets orphaned. However, I'm fairly certain a large reason the file-system api didn't take off is because of the IndexedDB API (caniuse IndexedDB) really took off when both specs were introduced. If you want cross browser support, check this api out.
That all said if you are still set on the file-system api some developers wrote a nice wrapper around the IndexedDB, the File-system api wouldn't actually supply you with a stream anyway. You would have to keep appending events to a given file given a fileWriter object. you'd then have to read the entire file and send to the server via an ajax request and then downloaded from the server once successfully uploaded.
The better route would be to use the IndexedDB apiwhich as stated on developer.mozilla
Open a database.
Create an object store in upgrading database.
Start a transaction and make a request to do some database operation, like adding or retrieving data.
Wait for the operation to complete by listening to the right kind of DOM event.
Do something
with the results (which can be found on the request object).
Here are a couple tutorials on the IndexedDB.
https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API/Using_IndexedDB
http://www.html5rocks.com/en/tutorials/indexeddb/todo/
As for giving the user that file, as mentioned briefly before you would have to upload the file to the server and download upon the "download" request. Unfortunately you have to trick the user into giving them the data already on their machine. Anyway, hope this all helps.

Using JS to create a "link" to a password protected site

Quite a few of the sites that the schools I work in use have user accounts to protect the content from people who haven't paid for it which means that the users (aged 5+) have to type in some pretty weird usernames/passwords before they can do their work.
I was wondering if it possible to use Javascript to create a page that would let me do something along the lines of:
Fetch the Login Page
Fill out the form
Submit It
Redirect the user to the site
1-3 would happen in the background without the user seeing it.
In most cases these accounts are shared and the details are on displays etc... in the classrooms so there is no issue with the details being publicly accessible.
I have used Mechanize in ruby before and would imagine a solution like it but running client side.
I know that some inspection of the target site will be needed but once I have an in-principle example I should be able to tailor it to each site later.
If you have a standardized browser, you should consider building a plugin for that browser, that's the easiest way to interact with the web pages. Otherwise you'll get into issues with anti-CSRF protections and cross-domain-policies.
As for the language, Chrome extensions are written in javascript and are pretty easy to build. For the other browsers I don't know.

how to disable view source option in firefox and chrome ?/

I have created a webpage but my friends or collegues always copy the source code and copy all the data easily, so is there any way to hide page source option from browser ?
As a rule, if you are putting information on another user's computer (whether because you made a document or they viewed your webpage), you really can't control what they do with it.
This is an issue that larger companies deal with often. Have you heard of DRM? It's a mechanism that companies like to try to use to control how people can connect to their services, use their content and in general, try to exert control over their data while it's on your system.
Now, a web page is a relatively simple container for holding information. You expressed an urge to prevent your friends from copying the source code. You could try to encrypt it, but if it's using local data to decrypt itself, there still isn't going to be anything that stops them from just copying what's in the View Source window and running it again (even if they can't really read it).
I'd suggest that you don't worry about it. If what you have on your page is so important that others shouldn't be able to see it, don't put it on a webpage.
Finally, Google doesn't much care that you're able to view the source to their home page. Why not? Because the value of the search engine isn't in what the home page looks like, but in the data on the back-end that you don't have direct access to. The value is in the algorithms that execute on the server when you hit that Google Search button that queries that data and returns the information you're looking for. There's very little relative value in the generated HTML that you see in the page. Take a leaf from their book and don't stress that they copy your HTML.
No , there isnt any way to do it, however you can disable right clicking in browser via javascript, but still they can use shortkeys to open developer view (in chrome F12) and see the source. You cannot hide html or javascript from client, but maybe you can make it harder to read.
No. Your HTML output is in the user's realm. Even if there was a way to disable view source in one client, a user could use a different one
Always assume that your site's HTML is fully available to end users.
Yes and no. You can definitely make HTML and JS harder to intrepret by obfuscating your code - that is, taking your code and making it look confusing. Here is a tool that can do that: http://www.colddata.com/developers/online_tools/obfuscator.shtml
However, these things all use code, and code can be decrypted through any number of methods. If you post a song to the internet, even if they cannot find the mp3, they can simply record their speakers. If you upload an image and prevent users from downloading it, they can take a screenshot or use their camera. In order for HTML and Javascript to work, it has to be intrepreted by their computer, and even if you do find a way to disable "View Source" there are others ways, like a DOM inspector (F12 in IE/Chrome, Ctrl+Shift+K in Firefox).
As a workaround, use copyright, warn your users they will be punished if they copy your code, and put watermarks, labels and logos over any mp3s or images you don't want stolen. In the end, disabling right clicking (which is also possible, see How do I disable right click on my web page? ) or disabling selection (also possible) does nothing, because there is more than one way to get your code, like searching through temporary internet files.
However, you ask "what if I want a site where my users can log in and I need security? How can I make it so nobody can see my code then? Doesn't it have to be secure and not out in the open?"
And the answer is, yes, it needs to be secure. That's what server-side languages, like PHP, are for. PHP does all the work on the server itself so the user cannot see it. PHP is like a pre-rendered language - rather than doing it in real-time, PHP does all the work beforehand so the user's computer doesn't have to, making the code safe. The code is never put onto the user's computer, because the user's computer doesn't need it. The work is done by the website itself before the page is sent. SSL is often paired with PHP to make absolutely sure that websites have not been hacked.
But HTML and Javascript have to be done in real time on the user's computer, so you cannot disable View Source because it is useless. There are many, many ways that users could get around it, even if View Source is disabled, and even if right clicking is disabled.
If your code doesn't need to be secure, however, I'd recommend you consider keeping it open source. :)

how to create web pages in which network requests cannot be captured by browser's debugger

This doesn't involve coding. I am just curious on how to make a page like the one described below.
I came across a website where we can attend quiz/tests.
I tried to debug the browser so that I can see if I could hack through the codes by getting the values that are getting passed in debugger.
But to my surprise the debugger is not coming up when I click F12 in that page.
Somehow I opened debugger for that page and I clicked on the Network tab to capture requests that are sent.
But as I was proceeding through the test, not even one request is getting captured in the debugger but the answers are getting validated and scores are getting updated !! I was not even able to do inspect element
I guess its a java applet as i saw the below line in the the launch button
flagPlayerCourse = true;launchApplet(secureSessionId,courseName, courseType,winParams, use508);disablePlayButton(1, 0);
the url had SinglePassUserCmd.cfm?sessionid=3xxxxx
So my question is how can we create such a webpage in which the requests are not captured in the debugger!? I would be happy if someone could tell me how do the same in asp.net. In which language can we develop such web pages!?
Applets are completely different world. Its almost as good as running a .net application on your client machine.
What you see in the debugger are ajax requests and resources loading. If a site doesn't make them, you won't see any network requests in the browser.
That doesn't mean that you can't capture the data being send. You can always use a debugging proxy like fiddler to see what traffic is going across. Ofcourse a secure site would secure their traffic over https.
Applets require a java plugin in your browser. There are similar plugins like Silverlight, flash/shockwave(swf) that too can make network requests.

Categories

Resources