How to scrape 'src' or 'href' value when it uses Javascript? - javascript

Perhaps this is a simple solution, but I'm just really stuck on this one.
Say when you would pull the value of 'href' from a webpage using BeautifulSoup, for example:
soup.find("a", {"id" : "home-page"})['href']
How would you do this if the element looked like this:
<a id="main_lnkWool" class="WhiteLinkText Canela-Medium-Web" href="javascript:__doPostBack('ctl00$main$lnkWool','')">Wool</a>
When the value of the url is pulled from a javascript query?
I can see the jquery.js file the site is using, I'm just not sure how to pull the url using all the pieces together. All I'm trying to do is to use requests to scrape the url's of certain ranges of products.
Here is a link for reference: https://www.kersaintcobb.co.uk/home
The links I'm trying to extract are under the tab 'Our Products'.
I know there are only 6 pages in total, and yes I could just copy and paste them at this point lol! But it's a question I need answering anyway as I've encountered this same problem on other projects so would really help me out if I knew how to solve it.
Thank you :)

Maybe not the best approach, but with JS sites what I have been able to do is use a webdriver, which is a web browser you can control from code (which you can make invisible btw, like hide it from sight). Wait till it loads then pass the source code to BS4. For more info: https://chromedriver.chromium.org/getting-started

Related

Get data from another HTML page

I am making an on-line shop for selling magazines, and I need to show the image of the magazine. For that, I would like to show the same image that is shown in the website of the company that distributes the magazines.
For that, it would be easy with an absolute path, like this:
<img src="http://www.remotewebsite.com/image.jpg" />
But, it is not possible in my case, because the name of the image changes everytime there is a new magazine.
In Javascript, it is possible to get the path of an image with this code:
var strImage = document.getElementById('Image').src;
But, is it possible to use something similar to get the path of an image if it is in another HTML page?
Assuming that you know how to find the correct image in the magazine website's DOM (otherwise, forget it):
the magazine website must explicitly allow clients showing your website to fetch their content by enabling CORS
you fetch their HTML -> gets you a stream of text
parse it with DOMParser -> gets you a Document
using your knowledge or their layout (or good heuristics, if you're feeling lucky), use regular DOM navigation to find the image and get its src attribute
I'm not going to detail any of those steps (there are already lots of SO answers around), especially since you haven't described a specific issue you may have with the technical part.
You can, but it is inefficient. You would have to do a request to load all the HTML of that other page and then in that HTML find the image you are looking for.
It can be achieved (using XMLHttpRequest or fetch), but I would maybe try to find a more efficient way.
What you are asking for is technically possible, and other answers have already gone into the details about how you could accomplish this.
What I'd like to go over in this answer is how you probably should architect this given the requirements that you described. Keep in mind that what I am describing is one way to do this, there are certainly other correct methods as well.
Create a database on the server where your app will live. A simple MySQL DB will work, but you could use anything. Create a table called magazine, with a column url. Your code would pull the url from this DB. Whenever the magazine URL changes, just update the DB and the code itself won't need to be changed.
Your front-end code needs some sort of way to access the DB. One possible solution is a REST API. This code would query the DB for the latest values (in your case magazine URLs), and make them accessible to your web page. This could be done in a myriad of different languages/frameworks, here's a good tutorial on doing something like this in Node.js and express (which is what I'd personally use).
Finally, your front-end code needs to call your REST API to get the updated URLs. This needs to be done with some kind of JavaScript based language. jQuery would make this really easy, something like this:
$(document).ready(function() {
$.Get("http://uri_to_your_rest_api", function(data) {
$("#myImage").attr("scr", data.url);
}
});
Assuming you had HTML like this:
<img id="myImage" src="">
And there you go - You have a webpage that pulls the image sources dynamically from your database.
Now if you're just dipping your toes into web development, this may seem a bit overwhelming. But I promise you, in the long run it'll be easier then trying to parse code from an HTML page :)

Pass innerText value of a page to Chrome extension without opening the page

I've been looking all over for this, and I think the problem is that I inherently suck at programming or scripting of any sort, and I don't know the right words to use...
Basically: I want to make a Chrome extension that reads the the innerText value from the ticketing system at the place I work with. As an example...
<span class="infomsg">Tickets Found [<span id="tickets_count">5</span>]</span>
The goal would be for the extension to display the text "5" over the icon.
What's the best way to do this? I've tried configuring the background.html page with an iframe with the URL with the ticket count as the source, but then I run into the cross-domain scripting issue. document.getElementById("tickets_count").innerHTML can't use a specified URL, as near as I've found.
I'm sure I haven't described it very well at all - totally floundering here, to be honest...let me know what I can clarify, and I'll edit my post.
Thanks!
It depends on whether the page you're looking at is static (e.g. the server sends you HTML with this information already in it) or dynamic (e.g. some JavaScript on the page requests additional information and then adds this to the page).
If it's static, you can use XHR to request the page and find the string you need in the "raw" HTML response. You can't use getElementById in that case - you'll need to find a way to find the string yourself.
If it's dynamic, that won't work. An iframe-in-the-background approach is valid - but you can't access the contents of the iframe. Instead, you should inject a content script in that page and request the information you need.
I understand it's a broad answer - but your question is also quite broad.

Sharepoint - How to: dynamic Url for Note on Noteboard

I'm quite new to SharePoint (about 1 week into it actually) and I'm attempting to mirror certain functionality that my company has with other products. Currently I'm working on how to duplicate the tasking environment in Box.com. Essentially it's just an email link that goes to a webpage where users can view an image and comments related to that image side by side.
I can dynamically load the image based on url parameters using just Javascript so that part is not a problem. As far as the comments part goes I've been trying to use a Noteboard WebPart, and then my desire is to have the "Url for Note" property to change dependent on the same URL parameter. I've looked over the Javascript Object Model and Class Library on MSDN but the hierarchy seems to stop at WebPart so I'm not finding anything that will allow me to update the Url for Note property.
I've read comments saying that there's a lot of exploration involved with this so I've tried the following:
-loading the javascript files into VisualStudio to use intellisense for looking up functions and properties in the SP.js files.
-console.log() on: WebPartDefinitionCollection, WebPartDefinition, WebPart, and methods .get_objectData(), get_properties() on all the previous
-embedding script in the "Builder" on the Url for Note property (where it says "click to use Builder" - I'm still not sure what more this offers than just a bigger textbox to put in the URL path)
I'm certain I've missed something obvious here but am gaining information very slowly now that I've exhausted the usual suspects. I very much appreciate any more resources or information anyone has and am willing to accept that I may be approaching this incorrectly if someone has accomplished this before.
Normally I'd keep going through whatever info I could find but I'm currently on a trial period and start school back up again soon so I won't have as much time with it. Apologies if this seems impatient, I'm just not sure where else to look at the moment.
Did you check out the API libraries like SPServices or SharepointPlus? They could help you doing what you want...
For example with SharepointPlus you could:
Create a Sharepoint List with a "Note" column and whatever you need to record
When the user goes to the page with the image you just show a TEXTAREA input with a SAVE button
When the user hits the SAVE button it will save the Note to the related list using $SP().list("Your list").add()
And you can easily retrieve the information (to show them to the user if he goes back to the page) with $SP().list("Your list").get()
If I understood your problem, that way it may be easier for you to deal with a customized page :-)

How do I extract data from a website using javascript.

Hi complete newbie here so bear with me. Seems like a simple job but I can't seem to find an easy way to do this.
So I need to extract a particular text from a webpage "www.example.com/index.php". I know that the text would be available in p tag with certain id. How do I extract this data out using javascript?
What I'm trying currently is that I have my javascript file (trying.js) on my computer with the following code:
$(document).ready(function () {
$.get("www.example.com/index.php", function(data) {
console.log(data)
}) ;
});
and a html that runs the javascript file.
When I open this html page with firefox it doesn't show me anything in console. How do I get the website's data? Am I on the correct track here? Is there a better way to do this?
What you're looking for is a page scraper. Javascript can't pull it off because it can only gather data from the domain you're on.
You could build it in Ruby, for example, and use one of the many existing gems for this sort of task, like https://github.com/assaf/scrapi or http://nokogiri.org/
Please take a look at Can Javascript read the source of any web page?
There are multiple ways discussed. Hope it helps you.

Does google robot index text from javascript document.write()?

Lets say I have this:
<script type="text/javascript">
var p = document.getElementById('cls');
p.firstChild.nodeValue = 'Some interesting information';
</script>
<div id="cls"> </div>
So, google robots will index text Some interesting information or not?
Thanks!
AFAIK, google robot will now indexing AJAX and Javascript stuff.For reference please follow:
http://www.submitshop.com/2011/11/03/google-bot-now-indexing-ajax-javascript
Get google to index links from javascript generated content
Update
SearchEngine watch has recently mentioned that Google bot has been improvised to read JavaScript, to quote exactly
it can now read and understand certain dynamic comments implemented
through AJAX and JavaScript. This includes Facebook comments left
through services like the Facebook social plugin.
We've had a need to hide pieces of information on pages from GoogleBot. As the information wasn't extremely sensitive, we've used document.write()-s to avoid searchbots indexing content in question.
Later in 2011 Q3 I've found that GoogleBot did index the scripted content, so I'm pretty sure now that Google is indexing much more than just fetching URLs from content, even though it's really not documented anywhere deeply.
Google doesn't index the JavaScript code or the generated content. You will only see it in the cache because the cached page consists of the complete file including the JavaScript code and your browser renders it. Google does scan JavaScript for URLs to crawl, so if the code is pulling content from an external file via Ajax, etc., there's a chance that the external file will also be indexed, but separate from the parent page. If you want the content to be indexed, it's got to be in plain HTML. Good luck!

Categories

Resources