I'm trying to write a python web scraper that takes a pandora account and gets all the stations from it.
However, the stations do not immediately all show up, and i need to click the show all button to view all of the stations. Moreover, even after i click the show all, the source code remains unchanged!
My question is where is the html that displays these extra elements that are seemingly invisible?
Example)
if you go to http://www.pandora.com/people/nenadbach#tbl_stations_table,all
(the #tbl_stations_table,all makes all the stations show up; this is where the "show all" button takes you)
And view source, the stations after The Girl From Ipanema Radio arent stored in the immediate source
Thanks for the help!
If you view the source from Firebug (if you use Firefox) or Inspector (if you use Safari or Chrome) you can see that the data is there. It's most likely being pulled in via ajax (JavaScript).
You would either need a scraper that understands JavaScript or to find the http ajax calls its making and call them yourself. The call that you are probably looking for is:
http://www.pandora.com/favorites/profile_tablerows_station.vm?webname=nenadbach&countRowsOnBrowser=10&countRowsNeeded=25
Note that mostly likely this is using a cookie to detect who you are and what list to show.
Related
I have a small app which calls an URL and scrape the data returned from it. I now want to do something similar for another site but this site uses JavaScript and the results are not included in the html. I've found a way to retrieve the data by using "stringByEvaluatingJavaScript" but to complicate things, the results I want is displayed on the webpage only after I click a button / function on the website:
i.e. To get to display the results I want, I have to:
1) go to the website. (data is displayed but not what I want) 2) click one of the options on the site. (data I really want is displayed)
The URL of this page never changes, as expected being JavaScript. So I want to know if there's a way to call the page so that when the page is displayed, it is already on the option I want, e.g. "https://example.com/page1?option" etc...
I don't know if this is possible since I don't know JavaScript but technically I think it should be?
Thanks.
I would use the Developer Tools/javascript console on your browser
(Chrome has a pretty good one) to see what the browser sends to the
server when you click on the button, then use that as the basis for
your query. – cowbert
#cowbert's suggestion really did the trick! Upon digging more, I found more results in the Chrome console and one of them actually has the link to the data which is what I need!
Thank you to all who contributed! This is my first post here so if I didn't do something right, please forgive me.
A co-worker took this url: https://www.rbi.org.in/Scripts/BS_PressReleaseDisplay.aspx which has month/year pagination via Javascript (see the elements on the right) and was able to give me this url:
https://www.rbi.org.in/Scripts/BS_PressReleaseDisplay.aspx?__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKMTg0MTg0MzQ2NmRk1lDKkbV9IbwhES0FyX%2BlSLhp%2FzA%3D&__VIEWSTATEGENERATOR=380F4D6F&__EVENTVALIDATION=%2FwEdAAiUUGGuo52vbcR6TOSGc2%2FnlK%2BXrsQEVyjeDxQ0A4GYXFBwzdjZXczwplb2HKGyLlqLrBfuDtX7nV3nL%2B5njT0xZDpy7WJnvc3tgXY08CYLJD%2BrfdwJAuBoVBISURIXWlx9xf1loRXvygROM%2FA1O%2BNHJounKCGGAHd04zzVhBPZz4BK5Wx46wqhV0iQkxGw1Nhr9A6c&hdnYear=2016&hdnMonth=12&UsrFontCntr%24txtSearch=&UsrFontCntr%24btn=
where I can replace the year after hdnYear and the month after hdnMonth with any year and month, and it will bring me directly to that page. I asked him how he did it, and he said "I used the Network tab in Chrome dev tools." That's about all I could get out of him.
Does anyone know exactly how this is done? For example, I'm now trying to discover similar way to get the actual url for each page of this site: http://www.ojk.go.id/id/regulasi/otoritas-jasa-keuangan/peraturan-dan-keputusan-dewan-komisioner/Default.aspx by looking at the Network tab as I change pages. There is nothing I can see in there that's similar to the above example.
This is how it was done for the rbi.org.in URL you've mentioned
Open Chrome and go to the URL you've given
Right click on the page and select Inspect
Click on the Network tab.
Click on one of the year/month links on the website (the pagination you referred to)
In the Network tab, you'll see a list of GET/POST requests being made by the client (ie, the browser) to the server.
In the Filter box (on the top-left of the Network tab), type in the search filter method:POST.
Click on the entry in the Name column. This will open up more details about the POST request. Scroll down to the section titled Form Data.
Click on the view encoded button in the Form Data section
These are the parameters your friend included in the URL. You'll notice hdnYear and hdnMonth also listed in there. The URL your friend gave can be obtained by clicking on view source
Well I can't really tell you how to exactly reproduce this in the site you're trying to, but I can tell you what your co-worker did.
In the page https://www.rbi.org.in/Scripts/BS_PressReleaseDisplay.aspx:
Open the network tab in dev tools, clean the log if theres anything there.
Click on a year and month
On the network log search for BS_PressReleaseDisplay.aspx in the "Name" column and click on it
Inside the Headers tab go to "Form Data" and click on "view source"
And thats it, theres is the URL parameters that your coworker gave you, you can try doing this on the site you want to reproduce it clicking on another page and searching for Default.aspx, but you'll have to figure out what does each parameter means to find which one is the page number or whatever you're looking for (check it in the parsed view for easier reading).
Screenshots:
http://prnt.sc/emsl2w
http://prnt.sc/emsm2z
Hope this helps you.
The URL he sent you, has URL parameters/query-strings that, is read by the server which then sends you the selected pages.
So basically the servers pics up the request and reads these paramters which then most likely is parsed into a method of some sort, querying a database then returning the result for you.
If your the owner of the linked website, you can implement such solution, otherwise you´re stuck since it requires coding on the backend.
I am replacing the showModalDialog function which no longer works in Chrome and FF. We have many applications using that. The problem is, pop up windows do post instructions to the web server and update the database. For instance if there's a list of accounts on screen and edit is clicked on one of the accounts, an edit page appears as a pop up, posts changes back to the web server, then the list is refreshed with changes. The entire list may be refreshed or just text that changed.
I made a javascript function to do pop up content using overlays. I thought it would be simple to replace showModalDialog calls with the javascript function, but I did not consider post instructions sent by the pop up page to update the database, and complexity to facilitate that. Posting can be done via ajax-like functionality, encapsulated in a set of functions. Before I start writing code to do this I'd like to know what other people have done in this circumstance. Thanks
I wrote some javascript to do everything I want. Since my pop up windows had javascript, I needed to run javascript upon rendering modal content, and also when the modal content went away. This will produce any number of overlays on top of each other, managing each. Content can optionally appear in a frame with a title bar, closely matching the functionality of showModalDialog.
Download at http://bikehappy.org/modal.html . If used, please give feedback saying if it works and provide update suggestions.
I have a little web app (which only has 1 page) that allows user to input and select some options. The input texts and selections will be displayed in another div in the form of table. You may want to refer to the example here: http://jsfiddle.net/xaKXM/5/
In this fiddle, you can type anything and after you clicked submit it will get the text input and append them to another table #configtableTable
$('#labels #labelTable tr:last').after(addmore);
$('#configtable #configtableTable tr:last').after(displaymore);
I'm using cherrypy as a mini web server (and thus major codes are written in python) and i know that it has session here but i have no idea how to use it at all as the example given is not really what i want to see.
FYI, i'm not using PHP at all and everything is in a single page. i simply show and hide them. But I want the page to remain as showing #configtableTable and hiding #labelTable even after refresh. Note that the fiddle is just part of the web app which will only show all these after getting a reply from another device.
Not sure about cookie because all the links i've found seem broken. How about jQuery session? Is it applicable in my case? I need some examples of application though :(
okay, to conclude my questions:
1. can i save the page state after refresh? and how? which of the methods mention above is worth trying? is there any examples for me to refer? or any other suggestions?
2. can i simply DISABLE refresh or back after reaching a page?
Thanks everyone in advance :)
Don't disable Refresh and / or back navigation. It's a terrible idea - user's have a certain expectation of what actions those buttons will perform and modifying that leads to a bad user experience.
As for saving state, while you could use session or cookies, if you don't need that data server side, you can save the state on client side as well.
For example, you could use localStorage
Alternatively, you could create an object out of the data in the table, JSON.stringify() it and append it to the url like this: example.com#stateData.
In case of either option, at page load, you'd have to check if there is state data. if you find there is, then use it to recreate the table, instead of displaying the form.
The disadvantage of the first, is that not all browsers support localStorage.
The disadvantage of the second is that URLs have a length limit and so this solution won't necessarily work for you if you're expecting large amounts of data.
EDIT
It appears that Midori does support most HTML5 features including localStorage however, it's turned off by default.. (I'm trying to find a better reference). If you can, just point Midori to html5test to see what HTML5 features it supports.
I'm a student stuyding the bioinformatics.
I'm trying to make a crawler where I can put the lists of queries and get the results automatically.
The site I'm interested in is the GEO DataSet site.
www.ncbi.nlm.nih.gov/gds/
If I wish to send a query like 'lung cancer', I can use the following address.
http://www.ncbi.nlm.nih.gov/gds/?term=lung+cancer.
And there are 549 pages showing up.
I can get the results of the first page, but I don't know how to move to the next page.
I mean, how can I move to the next page by changing the URL?
The Next button is linked as "www.ncbi.nlm.nih.gov/gds/?term=lung+cancer#" and I don't think it's the actual URL that button is linked to.
I'm new to the JavaScript, but I heard the hash sign (#) is processed in the JavaScript
I wonder if there is something I can do like
"http://www.ncbi.nlm.nih.gov/gds/?term=lung+cancer&page=2"
so that I can move to the second page.
If you use any debugger tool (Firebug for Firefox, WebDeveloper for Chrome) you should be able to monitor the network traffic. If you do that, you'll see, that by clicking the next button a form is submitted, sending data via post method. However, when concatenating the post data to a get string you can also get to the next page. The following url lets you access to second page of the result set (warning: really, really long!):
http://www.ncbi.nlm.nih.gov/gds/?term=lung+cancer?term=lung+cancer&EntrezSystem2.PEntrez.Gds.Entrez_PageController.PreviousPageName=results&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.sPresentation=docsum&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.sPageSize=20&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.sSort=none&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.FFormat=docsum&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.FSort=&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.FileFormat=docsum&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.LastPresentation=docsum&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.Presentation=docsum&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.PageSize=20&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.LastPageSize=20&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.Sort=&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.LastSort=&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.FileSort=&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.Format=&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.LastFormat=&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Entrez_Pager.cPage=1&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Entrez_Pager.CurrPage=2&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_ResultsController.ResultCount=10973&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_ResultsController.RunLastQuery=&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Entrez_Pager.cPage=1&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.sPresentation2=docsum&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.sPageSize2=20&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.sSort2=none&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.FFormat2=docsum&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_DisplayBar.FSort2=&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Entrez_Filters.CurrFilter=all&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Entrez_Filters.LastFilter=all&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_MultiItemSupl.Taxport.TxView=list&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_MultiItemSupl.Taxport.TxListSize=5&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_MultiItemSupl.RelatedDataLinks.rdDatabase=rddbto&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Gds_MultiItemSupl.RelatedDataLinks.DbName=gds&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Discovery_SearchDetails.SearchDetailsTerm=%22lung+neoplasms%22%5BMeSH+Terms%5D+OR+lung+cancer%5BAll+Fields%5D&EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.HistoryDisplay.Cmd=PageChanged&EntrezSystem2.PEntrez.DbConnector.Db=gds&EntrezSystem2.PEntrez.DbConnector.LastDb=gds&EntrezSystem2.PEntrez.DbConnector.Term=lung+cancer&EntrezSystem2.PEntrez.DbConnector.LastTabCmd=&EntrezSystem2.PEntrez.DbConnector.LastQueryKey=1&EntrezSystem2.PEntrez.DbConnector.IdsFromResult=&EntrezSystem2.PEntrez.DbConnector.LastIdsFromResult=&EntrezSystem2.PEntrez.DbConnector.LinkName=&EntrezSystem2.PEntrez.DbConnector.LinkReadableName=&EntrezSystem2.PEntrez.DbConnector.LinkSrcDb=&EntrezSystem2.PEntrez.DbConnector.Cmd=PageChanged&EntrezSystem2.PEntrez.DbConnector.TabCmd=&EntrezSystem2.PEntrez.DbConnector.QueryKey=&p%24a=EntrezSystem2.PEntrez.Gds.Gds_ResultsPanel.Entrez_Pager.Page&p%24l=EntrezSystem2&p%24st=gds
This complete GET string contains all search parameters like items per page, search terms, display and way more. You should be able to figure out which parameter is used for the offset (cPage and CurrPage are your friends) and then alter it to your needs.
EDIT: Btw, to find javascript events bound to an HTML element, you can use the bookmarklet found at http://www.sprymedia.co.uk/article/Visual+Event+2