Scrapy: extracting data from script tag - javascript

I am new to Scrapy. I am trying to scrape contents from 'https://www.tysonprop.co.za/agents/' for work purposes.
In particular, the information I am looking for seems to be generated by a script tag.
The line: <%= branch.branch_name %>
resolves to: Tyson Properties Head Office
at run time.
I am trying to access the text generated inside the h2 element at run time.
However, the Scrapy response object seems to grab the raw source code. I.e. the data I want appears as <%= branch.branch_name %> and not "Tyson Properties Head Office".
Any help would be appreciated.
HTML response object extract:
<script type="text/html" id="id_branch_template">
<div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
<h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
<div class="branch-agents container_12 first last clearfix">
<div id="agents-list-left" class="agents-list left grid_6">
</div>
<div id="agents-list-right" class="agents-list right grid_6">
</div>
</div>
</div>
</script>
<script type="text/javascript">
Current Scrapy spider code:
import scrapy
from scrapy.crawler import CrawlerProcess
class TysonSpider(scrapy.Spider):
name = 'tyson_spider'
def start_requests(self):
url = 'https://www.tysonprop.co.za/agents/'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
script = response.xpath('//script[#id="id_branch_template"]/text()').get()
div = scrapy.Selector(text=script).xpath('//div[contains(#class,"branch-container")]')
h2 = div.xpath('./h2[contains(#class,"branch-name")]')
Related to this question:
Scrapy xpath not extracting div containing special characters <%=

As the accepted answer on the related question suggests, consider using the AJAX endpoint.
If that doesn't work for you, consider using Splash.
The data seems to be downloaded with AJAX and added to the page with JavaScript. Scrapy can use Splash to execute JS on the page.
For example, this should work just fine after that.
h2 = div.xpath('./h2[contains(#class,"branch-name")]')
The docs have instructions for installing Splash but after getting it up and running, code changes to the actual crawler are pretty minimal.
Install scrapy-splash with
pip install scrapy-splash
Add some configurations to settings.py (listed on the Github page) and finally use SplashRequest instead of scrapy.Request.
If that doesn't work for you, maybe check out Selenium or Pyppeteer.
Also, the HTML response doesn't have "Tyson Properties Head Office" in it before executing JS (i.e. inside a script) except as a dropdown menu item, which probably isn't that useful, so it can't be extracted from the response.

Related

Python Requests getting all html data from site

I am trying to get product data from Metal Mulisha, I have a list of product IDs that I need to find data on. So I use python with python package requests, with the search URL "http://www.metalmulisha.com/shop/search/?q=20M35518334Z%20M45518403Z%20M45518415Z"
I then use BeautifulSoup to find the class and data I need, but I get an error that says there was nothing there.
So I first went to the URL in Chrome then inspected the elements and all the information I needed was in the html on Chrome.
Here is a snippet of what Chrome showed.
<div class="col-md-10 col-md-push-2">
<div data-rfkid="rfkid_7" data-keyphrase="20M35518334Z M45518403Z M45518415Z" class="rfk_sp rfk-sp">
<div class="rfk_sp_container" data-nrp="2" data-ntp="2" data-pg="1" data-status="2" rfk_track_appear_once="f=sp,rfkid=rfkid_7,a=1,c=1">
<div class="rfk_header">
</div>
<div class="rfk_message">
<div class="rfk_msg_noresult">
</div>
<div class="rfk_msg_results">Top Results for "20m35518334z m45518403z m45518415z"</div>
It keeps continuing under the first div, all I am showing you is there in a lot of information after <div data-rfkid=.
Once I ran my python script to find the first div, this is what I get.
<div class="col-md-10 col-md-push-2">
<div data-keyphrase="20M35518334Z M45518403Z M45518415Z" data-rfkid="rfkid_7"></div>
</div>
As if all the product information that I need is not there.
Here is my python code, so you can see what I did. I am using python 3.5.
import requests
from bs4 import BeautifulSoup
url = "http://www.metalmulisha.com/shop/search/?q=20M35518334Z%20M45518403Z%20M45518415Z"
html = requests.get(url).text
bs = BeautifulSoup(html, 'lxml')
possible_links = bs.find('div', attrs={'class': 'col-md-10 col-md-push-2'})
print(possible_links)
My question is why can't python find the html I need? If I inspect the site in Chrome I see it just fine, but when I use Python and request the site, it's not there. Is this to do with JavaScript? And if so how do I fix this?

BeautifulSoup Scraping: loading div instead of the content

Noob here.
I'm trying to scrape search results from this website: http://www.mastersportal.eu/search/?q=di-4|lv-master&order=relevance
I'm using python's BeautifulSoup
import csv
import requests
from BeautifulSoup import BeautifulSoup
for numb in ('0', '69'):
url = ('http://www.mastersportal.eu/search/?q=ci-30,11,10,3,4,8,9,14,15,16,17,34,1,19|di-4|lv-master|rv-1&start=' + numb + '0&order=tuition_eea&direction=asc')
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('div', attrs={'id': 'StudySearchResults'})
lista = []
for i in table.findAll('h3'):
lista.append(h3.string)
print(table.prettify())
I want to get clean data with the basic information about the Master (for now just the name).
The URL I'm using here is for a filtered research on the website and the loop to go on with pages should be fine.
However, the results are:
<div id="StudySearchResults">
<div style="display:none" id="TrackingSearchValue" class="TrackingSearchValue" data-search=""></div>
<div style="display:none" id="SearchViewEvent" class="TrackingEvent TrackingNoLocation" data-type="srch" data-action="view" data-id=""></div>
<div id="StudySearchResultsStudies" class="TrackingLinkedList" data-start="" data-list-type="study" data-type="rslts">
<!-- Wait pane, just here to make sure there is no white page -->
<div id="WaitPane" class="WaitPane">
<img src="http://www.mastersportal.eu/Modules/Results/Resources/Throbber.gif" />
<span>Loading search results...</span>
</div>
</div>
</div>
Why isn't the content displaying but only the loading div? Reading around I feel it has something to do with the way the website handles data with JavaScript, does something like an AJAX request exist for Python? (or any other way to tell the scraper to wait for the page to load?)
If you want only the text, you should do this
lista.append(h3.get_text())
Regarding your second question, jsfan's answer is right. You should try Selenium and use its wait feature to wait for your search results, that appear in divs with the class names Result master premium
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "div[#class*='Result master premium']))
)
You have basically answered your own question. Beautiful Soup is a pure web scraper which will only download whatever the server returns for a specific URL.
If you want to render the page as it is shown in a browser, you will need to use something like Selenium Webdriver which will start up an actual browser and remote control it.
While using Webdriver is very powerful, it has a much steeper learning curve than pure web scraping as well though.
If you want to get into using Webdriver with Python, the official documentation is a good place to start.

How to automate selecting certain codes in an html?

Hi I have a question about automating selecting certain content in an HTML. So if we save an webpage as html only, then we'll get HTML codes along with other stylesheets and javascript codes. However, I only want to extract the HTML codes between <div class='post-content' itemprop='articleBody'>and</div> and then create a new HTML file that has the extracted HTML codes. Is there a possible way to do it? Example codes are down below:
<html>
<script src='.....'>
</script>
<style>
...
</style>
<div class='header-outer'>
<div class='header-title'>
<div class='post-content' itemprop='articleBody'>
<p>content we want</p>
</div>
</div></div>
<div class='footer'>
</div>
</html>
While I'm typing, I'm thinking about javascript, which seems to be able to manipulate HTML DOM elements..Is Ruby able to do that? Can I generate a new clean html that only contains content between <div class='post-content' itemprop='articleBody'>and</div> by using javascript or Ruby? However, as for how to write the actual code, I don't have a clue.
So anybody has any idea about it? Thank you so much!
I'm not quite sure what you're asking, but I'll take a crack at it.
Can Ruby modify the DOM on a webpage?
Short answer, no. Browsers don't know how to run Ruby. They do know how to run javascript, so that's what usually used for real-time DOM manipulation.
Can I generate a new clean html
Yes? At the end of the day, HTML is just a specifically formatted string. If you want to download the source from that page and find everything in the <div class='post-content' itemprop='articleBody'> tag, there are a couple of ways to go about that. The best is probably the nokogiri gem, which is a ruby HTML parser. You'll be able to feed it a string (from a file or otherwise) that represents the old page and strip out what you want. Doing that would look something like this:
require 'nokogiri'
page = Nokogiri::HTML(open("https://googleblog.blogspot.com"))
# finds the first child of the <div class="post-content"> element
text = page.css('.post-content')[0].text
I believe that gives you the text you're looking for. More detailed nokogiri instructions can be found here.
You want to use a regular expression. For example:
//The "m" means multi-line
var regEx = /<div class='post-content' itemprop='articleBody'>([\s\S]*?)<\/div>/m;
//The content (you'll put the javascript at the bottom
var bodyCode = document.body.innerHTML;
var match = bodyCode.match( regEx );
//Prints to the console
console.dir( match );
You can see this in action here: https://regex101.com/r/kJ5kW6/1

Trying to have an <h3> tag change based on URL

After a few days of searching and trial and error i'm still unable to get this to work at all without crashing.
What i'm trying to achieve is changing what text will be displayed in a H3 tag in an .ejs file.
The reason for this is because the system we're building is using partial of another file so rather than creating more files only this tag has to change it's text.
Section in Question:
<%if (window.location.href === '/newWizard') { %>
<h3><strong>Step of a Wizard </strong> - Some Text</h3>
<% }else { %>
<h3>Same Text</h3>
<% } %>
this file is referenced by <%- partial %> and that file is referenced one step back by <%- include %>.
We're building this software off a Node.js and Kendo Grid design.
Any help with this would be greatly appreciated, i'm going to continue research and work on it, will update this if i manage to have it to work properly.
window.location.href, if I'm not mistaken, will give you the full URL. You probably want to check window.location.pathname in order to do the check in the way you've described.
You could also do some pattern matching on the href or pathname, depending on your needs.
Update
Since it's .ejs, it's getting executed on the server, so there is no window object.
Have a script run on the client side. Create the <h3> in the HTML and give it some id like <h3 id="title-heading"></h3> and then add a script doing something like
if (window.location.pathname === '/newWizard') {
document.getElementById('title-heading').innerHTML = '<strong>Step of a Wizard </strong> - Some Text';
}
else {
document.getElementById('title-heading').innerHTML = 'Same Text';
}

template insertion in a editor

Describing a scenario:
I am going through the code mentioned below.B asically I am trying to figure out how to program so that
when a user clicks on "Use Template" button , it gets inserted into an editor.
Page 1:
There are lot of templates present
When a user clicks on the "Use Template" button on , it gets inserted into an editor that is present in
the next page (Page 2).
Please find the code snippet below for the first two templates I am going through:
<div id="templatesWrap">
<div class="template" data-templatelocation="templateone" data-templatename="Template ONE" data-templateid="" >
<div class="templateContainer">
<span>
<a href="https://app.abc.com/pqr/core/compose/message/create?token=c1564e8e3cd11bc4t546b587jan31&sMessageTemplateId=templateone&sHubId=&goalComplete=200" title="Use Template">
<img class="thumbnail" src="templatefiles/thumbnail_010.jpg" alt="templateone">
</a>
</span>
<div class="templateName">Template ONE</div>
<p>
Use Template
</p>
</div>
</div>
<div class="template" data-templatelocation="templatetwo" data-templatename="Template TWO" data-templateid="" >
<div class="templateContainer">
<span>
<a href="https://app.abc.com/pqr/core/compose/message/create?token=c1564e8e3cd11bc4t546b587jan31&sMessageTemplateId=templatetwo&sHubId=&goalComplete=200" title="Use Template">
<img class="thumbnail" src="templatefiles/thumbnail_011.jpg" alt="templatetwo">
</a>
</span>
<div class="templateName">Template TWO</div>
<p>
Use Template
</p>
</div>
</div>
And so on ....
How does the link "https://app.abc.com/pqr/core/compose/message/create?token=c1564e8e3cd11bc4t546b587jan31&sMessageTemplateId=templatetwo&sHubId=&goalComplete=200" is inserting the template into the editor which is located on the next page? I haven't understood the token part and lot's of ID's present in the link
which I think are thereason behind inserting the template.
Has anyone come across such link before? Please advise.
Thanks
MORE CLARIFICATIONS:
Thanks for your answer.It did help me somewhat. I have few more questions:
Basically, I am using TinyMCE 4.0.8 version as my editor. The templates, I am using are from here:
https://github.com/mailchimp/email-blueprints/blob/master/templates/2col-1-2-leftsidebar.html
Some questions based on "Tivie" answer.
1) As you can see in the code for "2col-1-2-leftsidebar.html " it's not defined inside <div> tags unlike you defined it in <div> tags. Do you think that I can still
use it using "2col-1-2-leftsidebar.html " name?
2)I believe,for explanation purpose, you have included
`"<div contenteditable="true" id="myEditor">replaced stuff</div>`
and
<button id="btn">Load TPL</button>
<script>
$("#btn").click(function() {
$("#myEditor").load("template.html");
});
</script>
in the same page. Am I right? ( I understand you were trying to make an educated guess here, hence
just asking :) )
In my case, I have a separate page, where I have written code for buttons just like you wrote in editor.html like the following:
<button id="btn">Load TPL</button>. My button is defined inside <div class="templateContainer">.
Also, my templates are defined in a separate folder. So, I will have to grab the content(HTML Template), from
that folder and then insert into TinyMCE 4.08 editor. (Looks like two step process). Could you elaborate
on how should I proceed here?
More Question As of Dec 27
I have modifier my code for the template as follows:
<div class="templateName">Template ONE</div>
<p>
Use Template
</p>
Please note, I have added an additional id attribute for the following purpose.
If I go by the answer mentioned in the Tivia's post, is the following correct?
<script>
$("#temp1").click(function() {
$("#sTextBody").load("FolderURL/template.html");
});
</script>
My editor is defined like the following on Page 2 (Editor Page).
<div class="field">
<textarea id="sTextBody" name="sTextBody" style="width:948px; max-width:948px; height: 70%"></textarea>
</div>
I am confused, like, the script tag I have defined is in Page 1 where I have defined all the template related code
and the Page 2(Editor) page is a different page. It's simply taking me to Editor page (Page 2) and hence not working.
Please advise where I am wrong.
Thanks
MORE QUESTIONS AS of Jan 2
The problem Iam facing is as follows. Basically, for first template , I have the following code.
Code Snippet #1 where "Use "Template" button is present:
<div class="templateName">Template ONE</div>
<p>
Use Template
</p>
And the function suggested in the answer is as follows:
Code Snippet #2 where Editor is present:
<script>
$("#temp1").click(function() {
$("#sTextBody").load("FolderURL/template.html");
});
</script>
Since, I believe I first need to reach to that page after user clicks on "Use Template" button, where the editor is located, I have defined Code Snippet #1 on Page 1 and have defined the Code Snippet #2 and <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script> as the very first two script tags in the Page 2 ( Editor Page). But still when I click on "User Template" button on Page 1, it's just letting me to next page and not loading the template into the editor.
Am I doing something wrong here? Please advise.
P.S. The problem I feel is somehow the click function on Page 2 is not getting activated with the temp1 id button mentioned on Page 1.
Thanks
Well, one can only guess without having access to the page itself (and it's source code). I can, however, make an educated guess on how it works.
The URL params follows a pattern. First you have a token that is equal in all templates. This probably means the token does not have any relevance to the template mechanism itself. Maybe it's an authentication token or something. Not relevant though.
Then you have the template identification (templateOne, templateTwo, etc...) followed by a HubId that is empty. Lastly you have a goalComplete=200 which might correspond to the HTTP success code 200 (OK).
Based on this, my guess would be that they are probably using AJAX on the background, to fetch those templates from the server. Then, via JScript, those templates are inserted into the editor box itself.
Using JQuery, something like this is trivial. here's an example:
template.html
<div>
<h1>TEST</h1>
<span>This is a template</span>
</div>
editor.html
<!DOCTYPE HTML>
<html>
<head>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
</head>
<body>
<div contenteditable="true" id="myEditor">
replaced stuff
</div>
<button id="btn">Load TPL</button>
<script>
$("#btn").click(function() {
$("#myEditor").load("template.html");
});
</script>
</body>
</html>
Edit:
1) Well, since those templates are quite complex and include CSS, you probably want to keep them separated from you editor page (or the template's CSS will mess up your page's css).
However, since you're using TinyMCE, it comes with a template manager built in, so you probably want to use that. Check this link here http://www.tinymce.com/wiki.php/Configuration:templates for documentation.
2) I think 1 answers your question but, just in case, my method above works for any page in any directory, provided it lives on the same domain. Example:
<script>
$("#btn").click(function() {
$("#myEditor").load("someDirectory/template.html");
});
</script>
I recomend you check this page for the specifics on using TinyMCE http://www.tinymce.com/wiki.php/Configuration:templates
EDIT2:
Let me explain the above code:
$("#btn").click(function() { });
This basically tells the browser to run the code inside the brackets when you click the element with an id="btn"
$("#myEditor").load("someDirectory/template.html");
This is an AJAX request (check the documentation here). It grabs the contents of someDirectory/template.html and places them inside the element whose id="myEditor"

Categories

Resources