I am trying to scrape the following website for a project: https://www.tunefind.com/show/chicago-fire/season-1/12210
The last step is to scrape the links to the spotify songs mentioned on a page. Normally I look into the source code and it is clear from there.
However, in this case not. Looking into the source around the spotify button I cannot find a hyperlink directing me to the song. Probably done on purpose, to prevent scraping? (Oops)
Is there a way to get the hyperlink from the button? I am aware of an 'internet' interface in Python which clicks on the buttons, but I would rather not use this, as this will affect the load time tremendiously.
Thanks!
If using Chorme, Look at DevTools, go to Network tab an reload the site. You'll find that the data yoou need is in this url:
https://www.tunefind.com/api/frontend/show/chicago-fire/season/1?fields=episodes,theme-song,music-supervisors,hot-songs,top-users,related-questions-season,composers,albums&metatags=1
not sure if that helps you, but all that data is loaded up into window.__INITIAL_STATE__
You doing it wrong by scraping the site. If they change one single thing your whole project stops working. You should use their API instead. https://www.tunefind.com/product/api
Related
I have a small app which calls an URL and scrape the data returned from it. I now want to do something similar for another site but this site uses JavaScript and the results are not included in the html. I've found a way to retrieve the data by using "stringByEvaluatingJavaScript" but to complicate things, the results I want is displayed on the webpage only after I click a button / function on the website:
i.e. To get to display the results I want, I have to:
1) go to the website. (data is displayed but not what I want) 2) click one of the options on the site. (data I really want is displayed)
The URL of this page never changes, as expected being JavaScript. So I want to know if there's a way to call the page so that when the page is displayed, it is already on the option I want, e.g. "https://example.com/page1?option" etc...
I don't know if this is possible since I don't know JavaScript but technically I think it should be?
Thanks.
I would use the Developer Tools/javascript console on your browser
(Chrome has a pretty good one) to see what the browser sends to the
server when you click on the button, then use that as the basis for
your query. – cowbert
#cowbert's suggestion really did the trick! Upon digging more, I found more results in the Chrome console and one of them actually has the link to the data which is what I need!
Thank you to all who contributed! This is my first post here so if I didn't do something right, please forgive me.
I want to scrape data from a website within my java-application. The data I want to collect is inside a html-table-element. I tried two different methods:
I tried to load the website with a BufferedReader into a String and collect the data from the String.
I tried to use Jsoup to get access to the exact html-element, but it's empty.
Turns out that the table exists, but it is empty as long as the user has not pressed a button (labled "load raw data"). I inspected the sourcecode of the webpage. When the user presses the button, a load_table()-function is called which loads the data into the table. Obviously, the URL remains the same, otherwise I could've just used the other URL where the data is already loaded into the table. Has anyone an idea on how to scrape data from a website although it's only on the website if the user presses a button after the website is loaded?
I'm not really a trained Javascript-coder, but I tried to look through the script which is executed after the user presses the button. It's kind of hard to understand for me but I made a pastebin of the script with a highlighting where I think the rows are added to the table if that helps. The code for the button is:
Load raw data
The code I use to access the html element with Jsoup would be (all the child(x) methods are called on different div-elements to go deeper into the html-document until I finally reach the table-element):
Jsoup.connect(url).get().body().children().get(5).child(0).child(4).child(1).child(1);
As I stated above, the element is empty. I hope the description of my problem is detailed enough and somebody has at least an idea of what I'm trying to say. Sorry for my clumsy expressions. Not a native speaker.
if you are familiar with selenim webdriving you could use selenium to load the page and then pass to source page into beautifulSoup argument.
html = pageSource()
you could parse the page by this method i guess
To go back to the appropriate tumblr page, we're using:
Back
However, we're getting a lot of traffic directly from the twitter app, and this stops the function working. Is there a way so that if the history.go doesn't work (or takes you outside the site), it will just take you to index.html?
This is one of the pages the history button is on: http://lexican.info/post/49265445109/sesquipedalophobia
Thanks for any help at all.
Sadly I don't think this is possible as there is no relationship between a post and what page of the index the post is displayed on.
Try to check how many page in the history list with history.length:
http://www.w3schools.com/jsref/prop_his_length.asp
I have a section of a site with multiple categories of Widget. There is a menu with each category name. For anybody with Javascript enabled, clicking a category reveals the content of the category within the page. They can click between categories at will, seeing the DOM updated as needed. The url is also updated using the standard hash/hashbang (if we are being Google-friendly). So for somebody who lands on example.com/widgets, they can navigate around to example.com/widgets#one, example.com/widgets#two, example.com/widgets#three etc.
However, to support user agents without Javascript enabled, following one of these category links must load a new page with the category displayed, so for someone without javascript enabled, they would navigate to example.com/widgets/one, example.com/widgets/two, example.com/widgets/three etc.
My question is: What should happen when somebody with Javascript enabled lands on one of these URLS? What should someone with Javascript enabled be presented with when landing on example.com/widgets/one for example? Should they be redirected to example.com/widgets#one?
Please note that I need a single page site experience for anybody with Javascript enabled, but I want a multi-page site for a user agent without JavaScript. Any answer that doesn't address this fact doesn't answer the question. I am not interested in the merits or problems of hashbangs or single-page-sites vs multi-page-sites.
This is how I would structure it:
Use HistoryJS to manage the URL. JS pushstate browsers got full correct URLs and JS non-pushstate browsers got hashed urls. Non-JS users went to the full URL as normal with a page reload.
When a user clicks a link:
If they have JS:
All clicks to other pages are handled by a function that prevents the default action, grabs the HREF and passes the URL to an ajax request and updates the URL at the same time. The http response for that ajax request is then parsed and then loaded into the content area.
Non JS:
Page refreshed as normal and loads the whole document.
When a page loads:
With JS: Attach an event handler to all your links to prevent the default so their href is dealt with via Ajax.
Without JS: Nothing. Allow anchors to work as normal.
I think you should definitely have all of your content accessible via a full, correct URL and being loading it in via ajax then updating the URL to reflect the address where you got your content from. That way, when JS isn't running, you don't have to change anything.
Is that what you mean?
Apparently your question already contains the answer. You say:
I need a single page site experience for anybody with Javascript enabled
and then ask:
What should someone with Javascript enabled be presented with when landing on example.com/widgets/one for example? Should they be redirected to example.com/widgets#one?
I'd say yes, they should be redirected. I don't see any other option, given your requirements (and the fact that information about JavaScript capabilities and the hash fragment of the URL are not available on the server side).
If you can accept relaxing the requirements a bit, I see another option. Remember when the web was crowded with framesets, and we landed on a specific frame via AltaVista (Google wasn't around yet!) search? It was common to see a header saying that page was supposed to be displayed as a frame, and a link to take the user to the frameset version.
You could do something similar: when scripting is available, detect that you're at example.com/widgets/one and add a link to the single-page version. I know that's not ideal, but it's better than nothing, and maybe better than a nasty client-side redirect.
Why should you need to redirect them to a different page. The user arrived at the page looking for an answer. He gets the answer even if he has javascript enabled. It doesn't matter. The user's query has been fulfilled.
But what would happen if the user lands on example.com/widgets#one ? You would need to set up an automatic redirect to example.com/widgets/one in that case. That could be done by checking the if javascript is enabled in the onload event and redirect to the appropriate page.
One way for designing such pages is to design without javascript first.
You can use anchors in the page so:
example.com/widgets#one
Will be a link to the element with id 'one'
Once your page works without javascript, then you add the javascript layer. You can prevent links to be followed by using the event.preventDefault.
(https://developer.mozilla.org/fr/docs/DOM/event.preventDefault), then add the desired javascript functionality.
Does anyone have any idea about how to get the number of time a user visit a particular site? For instance, if you do a search on google and there's a link that you clicked already, google will tell you how many times you have visited that particular link. Any ideas on how to code something like that using javascript?
Thanks.
In Google's case they can track that you have clicked on a link. Its a specific action that they can attach a javascript listener to. If you want to do the same thing on your own site, you can add some javascript that does something similar, and anytime a link is clicked an AJAX call can be made that will allow you to track that it was clicked.
However, if you are just looking to get some basic stats about pages on your site you can add Google Analytics to it, and it will gather a large amount of useful data for you.
http://www.google.com/analytics/
If you want to know how many people are visiting your page, you probably want to check out something like Google Analytics rather than making it yourself. It will give you a lot of data that you'd have to make a lot of effort to gather yourself.