Python Requests getting all html data from site

Python Requests getting all html data from site - javascript

I am trying to get product data from Metal Mulisha, I have a list of product IDs that I need to find data on. So I use python with python package requests, with the search URL "http://www.metalmulisha.com/shop/search/?q=20M35518334Z%20M45518403Z%20M45518415Z"
I then use BeautifulSoup to find the class and data I need, but I get an error that says there was nothing there.
So I first went to the URL in Chrome then inspected the elements and all the information I needed was in the html on Chrome.
Here is a snippet of what Chrome showed.
<div class="col-md-10 col-md-push-2">
<div data-rfkid="rfkid_7" data-keyphrase="20M35518334Z M45518403Z M45518415Z" class="rfk_sp rfk-sp">
<div class="rfk_sp_container" data-nrp="2" data-ntp="2" data-pg="1" data-status="2" rfk_track_appear_once="f=sp,rfkid=rfkid_7,a=1,c=1">
<div class="rfk_header">
</div>
<div class="rfk_message">
<div class="rfk_msg_noresult">
</div>
<div class="rfk_msg_results">Top Results for "20m35518334z m45518403z m45518415z"</div>
It keeps continuing under the first div, all I am showing you is there in a lot of information after <div data-rfkid=.
Once I ran my python script to find the first div, this is what I get.
<div class="col-md-10 col-md-push-2">
<div data-keyphrase="20M35518334Z M45518403Z M45518415Z" data-rfkid="rfkid_7"></div>
</div>
As if all the product information that I need is not there.
Here is my python code, so you can see what I did. I am using python 3.5.
import requests
from bs4 import BeautifulSoup
url = "http://www.metalmulisha.com/shop/search/?q=20M35518334Z%20M45518403Z%20M45518415Z"
html = requests.get(url).text
bs = BeautifulSoup(html, 'lxml')
possible_links = bs.find('div', attrs={'class': 'col-md-10 col-md-push-2'})
print(possible_links)
My question is why can't python find the html I need? If I inspect the site in Chrome I see it just fine, but when I use Python and request the site, it's not there. Is this to do with JavaScript? And if so how do I fix this?

Related

Get Dynamiclly loaded Source Code from python using mechanize and bs4

I want to get source code of page that loads from javascript actually that page is linkedin profile page and i want to get job and education details.
I'm not using selenium i don't want browser window to open i know about headless but cookies problem
I have logedin through mechanize and i have get some data like phone number, address, headlines, emails, and Full Name. But as it is loaded from javascript so i can't get whole page data.
Data getting:
.....<code id="bpr-guid-892585" style="display: none">
{"data":{"entityUrn":"urn:li:collectionResponse:uPYuDSPXzooiHx+zPOguG1+f+JFMWTWFEfhiIQtEFMM=","elements":[],"paging":{"count":10,"start":0,"total":0,"links":[]},"$type":"com.linkedin.restli.common.CollectionResponse"},"included":[]}
</code>
<code id="datalet-bpr-guid-892585" style="display: none">
{"request":"/voyager/api/takeovers","status":200,"body":"bpr-guid-892585","method":"GET","headers":{"x-li-uuid":"AAXafRyXk/WxvhRuOZTrnA\u003D\u003D"}}
</code>
<img class="datalet-bpr-guid-892585" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" style="display: none"/><code id="bpr-guid-892586" style="display: none">
{"data":{"entityUrn":"urn:li:collectionResponse:nZx6/1e1AAbOHh075gv083zrunZT186/K+rx5FP70A4=","elements":[{"lixTracking":{"urn":"urn:li:member:865882626","segmentIndex":2,"experimentId":4358724,"treatmentIndex":1,"$type":"com.linkedin.voyager.common.ChameleonConfigLixTrackingInfo"},"data":{"namespace":"premium/templates/components/chooser/plan-card","locale":"en_US","message":"Learn more","key":"i18n_card_select_plan","$type":"com.linkedin.voyager.common.ChameleonConfigDataI18n"},"displayName":"card_select_plan","description":"testing 'Learn more' on SKU cards, vs control of 'Select plan' ","lixTreatment":"VAR_t20152_PR_1","lixKey":"chameleon.PREMIUM:us.copy.17654","creatorDisplayName":"cyount","status":"PERMANENT_RAMP","$type":"com.linkedin.voyager.common.ChameleonConfigItem"},{"lixTracking":{"urn":"urn:li:member:865882626","segmentIndex":3,"experimentId":4395729,"treatmentIndex":0,"$type":"com.linkedin.voyager.common.ChameleonConfigLixTrackingInfo"},"data":{"namespace":"onboarding/templates/components/widgets/people-you-may-know","locale":"en_US","message":"Connecting with people lets you see updates and keep in touch","key":"i18n_onboarding_pymk_page_header_phase_3","$type":"com.linkedin.voyager.common.ChameleonConfigDataI18n"},"displayName":"onboarding_pymk_page_header_phase_3","description":"Onboarding PEOPLE_YOU_MAY_KNOW widget header copy test","lixTreatment":"control","lixKey":"chameleon.ONBOARDING:global.copy.19060","creatorDisplayName":"zihliu","status":"MAX_RAMP","$type":"com.linkedin.voyager.common.ChameleonConfigItem"},{"lixTracking":{"urn":"urn:li:member:865882626","segmentIndex":3,"experimentId":4395707,"treatmentIndex":1,"$type":"com.linkedin.voyager.common.ChameleonConfigLixTrackingInfo"},"data":{"namespace":"onboarding/templates/components/widgets/profile-edit-common","locale":"en_US","message":"What’s your most recent experience?","key":"i18n_onboarding_profile_edit_work_header_v2","$type":"com.linkedin.voyager.common.ChameleonConfigDataI18n"},"displayName":"onboarding_profile_edit_work_header_v2","description":"Onboarding PROFILE_EDIT widget header copy test","lixTreatment":"VAR_t21697_PR_1","lixKey":"chameleon.ONBOARDING:global.copy.19063","$recipeTypes":["com.linkedin.voyager.dash.deco.relationships.ProfileWithEmailRequired","com.linkedin.voyager.dash.deco.identity.profile.WebTopCardCore"],"$type":"com.linkedin.voyager.dash.identity.profile.Profile","firstName":"Adarsh ","profilePicture":{"displayImageWithFrameReferenceUnion":{"vectorImage":{"$recipeTypes":["com.linkedin.voyager.dash.deco.common.VectorImageOnlyRootUrlAndAttribution"],"rootUrl":"https://media-exp1.licdn.com/dms/image/C4E35AQEIVkoUWgLFvw/profile-framedphoto-shrink_","artifacts":[{"width":200,"$recipeTypes":["com.linkedin.voyager.dash.deco.common.VectorArtifact"],"fileIdentifyingUrlPathSegment":"200_200/0/1597096649541?e=1647694800&v=beta&t=XVOK0upwO6V3NaJtUWLwy-yLMDa8cZICzYH0do67vhU","expiresAt":1647694800000,"height":200,"$type":"com.linkedin.common.VectorArtifact"},{"width":400,"$recipeTypes":["com.linkedin.voyager.dash.deco.common.VectorArtifact"],"fileIdentif
</code>
<img class="terminatorlet" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" style="display: none"/>
<div aria-live="polite" class="visually-hidden" id="a11y-notification" role="region"></div>
</body></html>
And also if there is way by using of selenium then please guide me but using headless attribute.
It contains all data that i mentiond above but when loaded from browser after login it's different.
Thanks for any help.

Scrapy: extracting data from script tag

I am new to Scrapy. I am trying to scrape contents from 'https://www.tysonprop.co.za/agents/' for work purposes.
In particular, the information I am looking for seems to be generated by a script tag.
The line: <%= branch.branch_name %>
resolves to: Tyson Properties Head Office
at run time.
I am trying to access the text generated inside the h2 element at run time.
However, the Scrapy response object seems to grab the raw source code. I.e. the data I want appears as <%= branch.branch_name %> and not "Tyson Properties Head Office".
Any help would be appreciated.
HTML response object extract:
<script type="text/html" id="id_branch_template">
<div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
<h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
<div class="branch-agents container_12 first last clearfix">
<div id="agents-list-left" class="agents-list left grid_6">
</div>
<div id="agents-list-right" class="agents-list right grid_6">
</div>
</div>
</div>
</script>
<script type="text/javascript">
Current Scrapy spider code:
import scrapy
from scrapy.crawler import CrawlerProcess
class TysonSpider(scrapy.Spider):
name = 'tyson_spider'
def start_requests(self):
url = 'https://www.tysonprop.co.za/agents/'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
script = response.xpath('//script[#id="id_branch_template"]/text()').get()
div = scrapy.Selector(text=script).xpath('//div[contains(#class,"branch-container")]')
h2 = div.xpath('./h2[contains(#class,"branch-name")]')
Related to this question:
Scrapy xpath not extracting div containing special characters <%=

As the accepted answer on the related question suggests, consider using the AJAX endpoint.
If that doesn't work for you, consider using Splash.
The data seems to be downloaded with AJAX and added to the page with JavaScript. Scrapy can use Splash to execute JS on the page.
For example, this should work just fine after that.
h2 = div.xpath('./h2[contains(#class,"branch-name")]')
The docs have instructions for installing Splash but after getting it up and running, code changes to the actual crawler are pretty minimal.
Install scrapy-splash with
pip install scrapy-splash
Add some configurations to settings.py (listed on the Github page) and finally use SplashRequest instead of scrapy.Request.
If that doesn't work for you, maybe check out Selenium or Pyppeteer.
Also, the HTML response doesn't have "Tyson Properties Head Office" in it before executing JS (i.e. inside a script) except as a dropdown menu item, which probably isn't that useful, so it can't be extracted from the response.

Copy phonenumber with python

Im struggling with python3. I teached myself the basics. Now im learning webdriver and bs4. FUN!
I want to scrape the phonenumber of a page. In other cases i made a working script. But now im on a page that gives me a headache!
I think the problem is that the content is dynamicly loaded. (Its not on the pagesource)
This is the page: https://www.dnls.nl/locatie/diergaarde-blijdorp-rotterdam
There is a element with this text: Toon telefoonnummer. I can click it with:
driver.find_element_by_link_text("Toon telefoonnummer").click().
The phonenumber is visible but for my eyes only! Im trying to grab the number for hours now with xpath and css but i can't grab it!
In the pagesource i see this:
<div class="show_onclick">
<div class="text-center telephone-field"><a class="set-align" :href="'tel:'+project.contact_phone">{{project.contact_phone}}</a></div>
<div class="text-center">{{project.contact_textline}}</div>
</div>
This is my last code:
from selenium import webdriver
try:
#telefoonnummer
driver.find_element_by_link_text("Toon telefoonnummer").click() #This works
driver.implicitly_wait(5)
telefoonnummer = driver.find_element_by_xpath(".//*[#id='main-inner']/div[1]/div[1]/div/div[2]/ul[1]/li[3]/div/div[1]/a").text
print(telefoonnummer)
except:
print("")
Is there a way to scrape this kind of content?
UPDATE: i found the data in a javascript thats in the head of the pagesource. Its a massive javascript and it contains:
"contact_name":"Blijdorp Happenings","contact_phone":"010 4431415","contact_email":"sales#diergaardeblijdorp.nl"
What is my goal: I want to find the phonenumber, copy to phonenumber and safe it in a var 'telefoonnummer'
Kind Regards!

BeautifulSoup Scraping: loading div instead of the content

Noob here.
I'm trying to scrape search results from this website: http://www.mastersportal.eu/search/?q=di-4|lv-master&order=relevance
I'm using python's BeautifulSoup
import csv
import requests
from BeautifulSoup import BeautifulSoup
for numb in ('0', '69'):
url = ('http://www.mastersportal.eu/search/?q=ci-30,11,10,3,4,8,9,14,15,16,17,34,1,19|di-4|lv-master|rv-1&start=' + numb + '0&order=tuition_eea&direction=asc')
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('div', attrs={'id': 'StudySearchResults'})
lista = []
for i in table.findAll('h3'):
lista.append(h3.string)
print(table.prettify())
I want to get clean data with the basic information about the Master (for now just the name).
The URL I'm using here is for a filtered research on the website and the loop to go on with pages should be fine.
However, the results are:
<div id="StudySearchResults">
<div style="display:none" id="TrackingSearchValue" class="TrackingSearchValue" data-search=""></div>
<div style="display:none" id="SearchViewEvent" class="TrackingEvent TrackingNoLocation" data-type="srch" data-action="view" data-id=""></div>
<div id="StudySearchResultsStudies" class="TrackingLinkedList" data-start="" data-list-type="study" data-type="rslts">
<!-- Wait pane, just here to make sure there is no white page -->
<div id="WaitPane" class="WaitPane">
<img src="http://www.mastersportal.eu/Modules/Results/Resources/Throbber.gif" />
<span>Loading search results...</span>
</div>
</div>
</div>
Why isn't the content displaying but only the loading div? Reading around I feel it has something to do with the way the website handles data with JavaScript, does something like an AJAX request exist for Python? (or any other way to tell the scraper to wait for the page to load?)

If you want only the text, you should do this
lista.append(h3.get_text())
Regarding your second question, jsfan's answer is right. You should try Selenium and use its wait feature to wait for your search results, that appear in divs with the class names Result master premium
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "div[#class*='Result master premium']))
)

You have basically answered your own question. Beautiful Soup is a pure web scraper which will only download whatever the server returns for a specific URL.
If you want to render the page as it is shown in a browser, you will need to use something like Selenium Webdriver which will start up an actual browser and remote control it.
While using Webdriver is very powerful, it has a much steeper learning curve than pure web scraping as well though.
If you want to get into using Webdriver with Python, the official documentation is a good place to start.

I am trying to pass the size of a <DIV> back to google app engine (Python code) but can't seem to make it work

Many apologies if I'm being really dopey here - I've been searching most of the morning for the answer to this but maybe I'm searching in the wrong terms. I'm using Google App Engine, with Python code to set up my webapp. I've managed to set up the basic page structure using the following HTML code:
main_html = """
<HTML>
<HEAD>
<LINK rel="stylesheet" type="text/css" href="static/css/main.css">
</HEAD>
<BODY>
<DIV class="wspace">
<BR>
</DIV>
<DIV class="header">
<FONT class="logoFont">Logo Text</font>
</DIV>
<DIV class="maincontainer">
<DIV class="sidebar" id="leftsidebar">
<SCRIPT>
var lsCanWid = document.getElementsByTagName("div"["leftsidebar"].offsetWidth
var lsCanHei = document.getElementsByTagName("div")["leftsidebar"].offsetHeight
</SCRIPT>
</DIV>
<DIV class="sidebar" id="rightsidebar">
<SCRIPT>
var rsCanWid = document.getElementsByTagName("div")["rightsidebar"].offsetWidth
var rsCanHei = document.getElementsByTagName("div")["rightsidebar"].offsetHeight
</SCRIPT>
</DIV>
<DIV class="mainscreen" id="mainscr">
<SCRIPT>
var mainCanWid = document.getElementsByTagName("div")["mainscr"].offsetWidth
var mainCanHei = document.getElementsByTagName("div")["mainscr"].offsetHeight
</SCRIPT>
%(MAINCONTENT)s
</DIV>
</DIV>
<DIV class="footer">
Footer Text
</DIV>
<DIV class="wspace">
<FONT class="crnotice">Copyright Notice Text</FONT>
</DIV>
</BODY>
</HTML>
"""
I've tested the javascript variables, and these work exactly as planned (lsCanWid returns the left sidebar width for example).
I've then got the following code in my Python:
class MainHandler(webapp2.RequestHandler):
def get(self):
lsCanWid = self.request.get('lsCanWid')
lsCanHei = self.request.get('lsCanHei')
rsCanWid = self.request.get('rsCanWid')
rsCanHei = self.request.get('rsCanHei')
mainCanWid = self.request.get('mainCanWid')
mainCanHei = self.request.get('mainCanHei')
temp_str = str(lsCanWid)
self.response.write(main_html %{"MAINCONTENT": temp_str})
app = webapp2.WSGIApplication([
('/', MainHandler)
], debug=True)
I've not tried to use self.request.get within a "get" before, so I'm not sure if this is the problem. The python code is working fine if you replace temp_str with some other string, so that's not where my issue lies. I'd be really grateful of any help here!

In short - You aren't sending these values back to your handler, you need to add them to the query string or post them back.
Javascript runs the browser. Once the page has rendered in the browser, there is no connection with your back end code. In order for your Python code to know anything from the browser, you have to send it back to the code. This is just how HTTP works. Its a "request/response" cycle and there is no connection once the response (what you send to the browser) is finished.
So, to send these values back to your code you need to append them to the request as part of a query string, something like /foo?lsCanWid=3&lsCanHei=4 (and so on). You would need to build this URL using javascript, and then add it to a link which you would have to click - then, on the next request (once this new link is clicked), your Python code will receive the values.

Develop Reference

JavaScript is the programming language of the Web.

Python Requests getting all html data from site - javascript

Related

Get Dynamiclly loaded Source Code from python using mechanize and bs4

Scrapy: extracting data from script tag

Copy phonenumber with python

BeautifulSoup Scraping: loading div instead of the content

I am trying to pass the size of a <DIV> back to google app engine (Python code) but can't seem to make it work

Categories

Resources