How do I extract data from a website using javascript. - javascript

Hi complete newbie here so bear with me. Seems like a simple job but I can't seem to find an easy way to do this.
So I need to extract a particular text from a webpage "www.example.com/index.php". I know that the text would be available in p tag with certain id. How do I extract this data out using javascript?
What I'm trying currently is that I have my javascript file (trying.js) on my computer with the following code:
$(document).ready(function () {
$.get("www.example.com/index.php", function(data) {
console.log(data)
}) ;
});
and a html that runs the javascript file.
When I open this html page with firefox it doesn't show me anything in console. How do I get the website's data? Am I on the correct track here? Is there a better way to do this?

What you're looking for is a page scraper. Javascript can't pull it off because it can only gather data from the domain you're on.
You could build it in Ruby, for example, and use one of the many existing gems for this sort of task, like https://github.com/assaf/scrapi or http://nokogiri.org/

Please take a look at Can Javascript read the source of any web page?
There are multiple ways discussed. Hope it helps you.

Related

How to scrape 'src' or 'href' value when it uses Javascript?

Perhaps this is a simple solution, but I'm just really stuck on this one.
Say when you would pull the value of 'href' from a webpage using BeautifulSoup, for example:
soup.find("a", {"id" : "home-page"})['href']
How would you do this if the element looked like this:
<a id="main_lnkWool" class="WhiteLinkText Canela-Medium-Web" href="javascript:__doPostBack('ctl00$main$lnkWool','')">Wool</a>
When the value of the url is pulled from a javascript query?
I can see the jquery.js file the site is using, I'm just not sure how to pull the url using all the pieces together. All I'm trying to do is to use requests to scrape the url's of certain ranges of products.
Here is a link for reference: https://www.kersaintcobb.co.uk/home
The links I'm trying to extract are under the tab 'Our Products'.
I know there are only 6 pages in total, and yes I could just copy and paste them at this point lol! But it's a question I need answering anyway as I've encountered this same problem on other projects so would really help me out if I knew how to solve it.
Thank you :)
Maybe not the best approach, but with JS sites what I have been able to do is use a webdriver, which is a web browser you can control from code (which you can make invisible btw, like hide it from sight). Wait till it loads then pass the source code to BS4. For more info: https://chromedriver.chromium.org/getting-started

Display text value from Github Gist in Hugo site

I know I might be asking something quite simple but for the life of me I can't seem to get my head around this and I'm definitely overseeing something simple but I don't know what. Any help would be very appreciated.
I'm generating a static site using Hugo. On one of my pages, I want to create something like a progress bar, using a variable which I need to get from a file from a Github Gist.
Say this is the gist: https://gist.github.com/bogdanbacila/c5a9683089c74d613ad17cdedc08f56b#file-thesis-words-txt
The file only has one number, that's it. What I'm asking is how to get that number from the gist and store it in hugo or at least just display it in some raw html. I want to mention that I'm not looking to use the provided embedded text, I'd rather just get the raw value. At the end of the day all I need is to read and display the number from the raw link here: https://gist.githubusercontent.com/bogdanbacila/c5a9683089c74d613ad17cdedc08f56b/raw/8380782afede80d234209293d4c5033a890e44b6/thesis-words.txt
I've asked this question on the Hugo forum and that wasn't very helpful, instead of providing me with some guidance I got sent here. Here was my original question: https://discourse.gohugo.io/t/get-raw-content-from-github-gist-to-a-variable/38781
Any help would be greatly appreciated, I know there's something very obvious which I'm not seeing, please guide me to the right direction, this doesn't feel like it should be that complicated.
Best,
Bogdan
You could fetch this data and store it in Hugo as a data file but I don't recommend it.
Since Hugo is a static site generator, you would need to not only modify the data files in your repo every time the value changes, but re-build your site as well. Then you have to worry about running the script on a schedule. Meaning you can't be sure that the value is current the second someone visits your site. This is more headache than it's worth in my opinion.
The better route would be to write some client-side JavaScript that makes a call to the raw URL of the gist to get the content. This is Hugo-agnostic which is why I suspect you were pointed here.
From the Gists API docs:
If you need the full contents of the file, you can make a GET request to the URL specified by raw_url.
You can use something like the Fetch API for this or any other JS client. Simply make a GET request to the URL, parse the value from the response body, and write some JavaScript to insert the value in the DOM when someone makes a request to the page it's on.
#wjh18
Cheers! I didn't know about GET requests so I had to dig around for that a little bit but I managed to get it going with this:
<script>
fetch('https://gist.githubusercontent.com/bogdanbacila/c5a9683089c74d613ad17cdedc08f56b/raw').then(function(response) {
return response.json();
}).then(function(data) {
console.log(data);
}).catch(function() {
console.log("Booo");
});
</script>

Get data from another HTML page

I am making an on-line shop for selling magazines, and I need to show the image of the magazine. For that, I would like to show the same image that is shown in the website of the company that distributes the magazines.
For that, it would be easy with an absolute path, like this:
<img src="http://www.remotewebsite.com/image.jpg" />
But, it is not possible in my case, because the name of the image changes everytime there is a new magazine.
In Javascript, it is possible to get the path of an image with this code:
var strImage = document.getElementById('Image').src;
But, is it possible to use something similar to get the path of an image if it is in another HTML page?
Assuming that you know how to find the correct image in the magazine website's DOM (otherwise, forget it):
the magazine website must explicitly allow clients showing your website to fetch their content by enabling CORS
you fetch their HTML -> gets you a stream of text
parse it with DOMParser -> gets you a Document
using your knowledge or their layout (or good heuristics, if you're feeling lucky), use regular DOM navigation to find the image and get its src attribute
I'm not going to detail any of those steps (there are already lots of SO answers around), especially since you haven't described a specific issue you may have with the technical part.
You can, but it is inefficient. You would have to do a request to load all the HTML of that other page and then in that HTML find the image you are looking for.
It can be achieved (using XMLHttpRequest or fetch), but I would maybe try to find a more efficient way.
What you are asking for is technically possible, and other answers have already gone into the details about how you could accomplish this.
What I'd like to go over in this answer is how you probably should architect this given the requirements that you described. Keep in mind that what I am describing is one way to do this, there are certainly other correct methods as well.
Create a database on the server where your app will live. A simple MySQL DB will work, but you could use anything. Create a table called magazine, with a column url. Your code would pull the url from this DB. Whenever the magazine URL changes, just update the DB and the code itself won't need to be changed.
Your front-end code needs some sort of way to access the DB. One possible solution is a REST API. This code would query the DB for the latest values (in your case magazine URLs), and make them accessible to your web page. This could be done in a myriad of different languages/frameworks, here's a good tutorial on doing something like this in Node.js and express (which is what I'd personally use).
Finally, your front-end code needs to call your REST API to get the updated URLs. This needs to be done with some kind of JavaScript based language. jQuery would make this really easy, something like this:
$(document).ready(function() {
$.Get("http://uri_to_your_rest_api", function(data) {
$("#myImage").attr("scr", data.url);
}
});
Assuming you had HTML like this:
<img id="myImage" src="">
And there you go - You have a webpage that pulls the image sources dynamically from your database.
Now if you're just dipping your toes into web development, this may seem a bit overwhelming. But I promise you, in the long run it'll be easier then trying to parse code from an HTML page :)

Extracting a specific set of values out of an HTML table

I'm in the process of teaching myself Javascript and I'm having a little trouble understanding something.
I'm trying to extract every one of the "Title" and "Instructor" values from this class registration page to make an enhanced scheduling tool for myself. However, in the examples I'm looking at, they all use the "getElementsByClassName(class)" and "getElementsById(id)" to extract specific information from an HTML table. When I look at the page source in chrome, I am not able to find either a unique class name or id to specify in these calls.
Would someone mind pointing me in the right direction? Am I using the page source code correctly or is there a better way of doing things?
EDIT: Here's the html of the page in question
view-source:https://admin.wwu.edu/pls/wwis/wwsktime.ListClass
You can use querySelectorAll to use CSS selectors.
document.querySelectorAll("tr>td:nth-child(3)") and document.querySelectorAll("tr>td:nth-child(8)") will give you all Title and Instructor elements
Here's a jsfiddle of it https://jsfiddle.net/n1fuo87p/
No you're not really doing anything wrong, but unfortunately the creators of the web page haven't made use of classes and ids in a way that will make them useful to you.
I'd recommend creating a Google Sheet to import the table. (See the importHTML function in Google sheets.) Then I'd retrieve the data as JSON and work with it that way. IMO you'll learn more valuable skills working with JSON than you will parsing HTML too. This article will take you through getting JSON out of your Google sheet: http://ctrlq.org/code/20004-google-spreadsheets-json

Does google robot index text from javascript document.write()?

Lets say I have this:
<script type="text/javascript">
var p = document.getElementById('cls');
p.firstChild.nodeValue = 'Some interesting information';
</script>
<div id="cls"> </div>
So, google robots will index text Some interesting information or not?
Thanks!
AFAIK, google robot will now indexing AJAX and Javascript stuff.For reference please follow:
http://www.submitshop.com/2011/11/03/google-bot-now-indexing-ajax-javascript
Get google to index links from javascript generated content
Update
SearchEngine watch has recently mentioned that Google bot has been improvised to read JavaScript, to quote exactly
it can now read and understand certain dynamic comments implemented
through AJAX and JavaScript. This includes Facebook comments left
through services like the Facebook social plugin.
We've had a need to hide pieces of information on pages from GoogleBot. As the information wasn't extremely sensitive, we've used document.write()-s to avoid searchbots indexing content in question.
Later in 2011 Q3 I've found that GoogleBot did index the scripted content, so I'm pretty sure now that Google is indexing much more than just fetching URLs from content, even though it's really not documented anywhere deeply.
Google doesn't index the JavaScript code or the generated content. You will only see it in the cache because the cached page consists of the complete file including the JavaScript code and your browser renders it. Google does scan JavaScript for URLs to crawl, so if the code is pulling content from an external file via Ajax, etc., there's a chance that the external file will also be indexed, but separate from the parent page. If you want the content to be indexed, it's got to be in plain HTML. Good luck!

Categories

Resources