Python web drivers - javascript

Im trying to scrape products from a web store similar to how Dropified scrapes items from Ali express,
Current Solution(the way it's set up it will only try and access the first item):
from bs4 import BeautifulSoup
import requests
import time
import re
# Get search inputs from user
search_term = raw_input('Search Term:')
# Build URL to imdb.com
aliURL = 'https://www.aliexpress.com/wholesale?SearchText=%(q)s'
payload = {'q': search_term, }
# Get resulting webpage
r = requests.get(aliURL % payload)
# Build 'soup' from webpage and filter down to the results of search
soup = BeautifulSoup(r.text, "html5lib")
titles = soup.findAll('a', attrs = {'class': 'product'})
itemURL = titles[0]["href"]
seperatemarker = '?'
seperatedURL = itemURL.split(seperatemarker, 1)[0]
seperatedURL = "http:" + seperatedURL
print seperatedURL
IR = requests.get(seperatedURL)
Isoup = BeautifulSoup(IR.text, "html5lib")
productname = Isoup.findAll('h1')
print productname
This solution works assuming that the items on the page don't require javascript if the item does it will only retrieve the initial page before the document is ready.
I realise I can use a python web driver, but I was wondering if there was any other solution to this problem that would allow for easy automation of a web scraping tool.

Checkout selenium with phantomjs. selenium and phantomjs handle most of the issues related to JS generated content on the page. You don't even need to think about these things anymore.
If you're scraping many pages and want to speed things up, you might want to do things asynchronously. For a small to mid-sized setup, you can use RQ. For larger projects you can use celery. What these tools allow you to do is scrape multiple pages at the same time(not concurrently though).
Note that the tools I've mentioned so far have nothing to do with asyncio or other async frameworks.
I tried scraping some e-commerce pages and noticed that the program was spending 80% of its time waiting for HTTP calls to return something. Using the above tools you can reduce that 80% to 10% or less.

Related

Scraping fanduel with BeautifulSoup, can't find values visible in HTML

I'm trying to scrape lines for a typical baseball game from fanduel using BeautifulSoup but I found (as this person did) that much of the data doesn't show up when I try something standard like
import requests
from bs4 import BeautifulSoup
page = requests.get(<some url>)
soup = BeautifulSoup(page.content, 'html.parser')
I know I can get do Dev Tools -> Network tab -> XHR to get a json with data the site is using, but I'm not able to find the same values I'm seeing in the HTML.
I'll give an example but it probably won't be good after a day since the page will be gone. Here's the page on lines for the Rangers Dodgers game tomorrow. You can click and see that (as of right now) the odds for the Dodgers at -1.5 are -146. I'd like to scrape that number (-146) but I can't find it anywhere in the json data.
Any idea how I can find that sort of thing either in the json or in the HTML? Thanks!
Looks like I offered the solution to the reference link you have there. Those lines are there in the json, it's just in the "raw" form, so you need to calculate it out:
import requests
jsonData = requests.get('https://sportsbook.fanduel.com/cache/psevent/UK/1/false/1027510.3.json').json()
money_line = jsonData['eventmarketgroups'][0]['markets'][1]['selections']
def calc_spread_line(priceUp, priceDown, spread):
if priceDown < priceUp:
line = int((priceUp / priceDown) * 100)
spread = spread*-1
else:
line = int((priceDown / priceUp) * -100)
return line, spread
for each in money_line:
priceUp = each['currentpriceup']
priceDown = each['currentpricedown']
team = each['name']
spread = each['currenthandicap']
line, spread = calc_spread_line(priceUp, priceDown, spread)
print ('%s: %s %s' %(team, spread, line))
Output:
Texas Rangers: 1.5 122
Los Angeles Dodgers: -1.5 -146
Otherwise you could use selenium as suggested and parse the html that way. It would be less efficient though.
This may be happening to you because some web pages loads the elements using java script, in which case the html source you receive using requests may not contain all the elements .You can check this by right-clicking on the page and selecting view source , if the data you require is in that source file you can parse it using Beautiful Soup otherwise in order to get dynamically loaded content I will suggest selenium

Web scraping with BeautifulSoup won't work

Ultimately, I'm trying to open all articles of a news website and then make a top 10 of the words used in all the articles. To do this, I first wanted to see how many articles there are so I could iterate over them at some point, haven't really figured out how I want to do everything yet.
To do this, I wanted to use BeautifulSoup4. I think the class I'm trying to get is Javascript as I'm not getting anything back.
This is my code:
url = "http://ad.nl"
ad = requests.get(url)
soup = BeautifulSoup(ad.text.lower(), "xml")
titels = soup.findAll("article")
print(titels)
for titel in titels:
print(titel)
The article name is sometimes an h2 or an h3. It always has one and the same class, but I can't get anything through that class. It has some parents but it uses the same name but with the extension -wrapper for example. I don't even know how to use a parent to get what I want but I think that those classes are Javascript as well. There's also an href which I'm interested in. But once again, that is probably also Javascript as it returns nothing.
Does anyone know how I could use anything (preferably the href, but the article name would be ok as well) by using BeautifulSoup?
In case you don't want to use selenium. This works for me. I've tried on 2 PCs with different internet connection. Can you try?
from bs4 import BeautifulSoup
import requests
cookies={"pwv":"2",
"pws":"functional|analytics|content_recommendation|targeted_advertising|social_media"}
page=requests.get("https://www.ad.nl/",cookies=cookies)
soup = BeautifulSoup(page.content, 'html.parser')
articles = soup.findAll("article")
Then follow kimbo's code to extract h2/h3.
As #Sri mentioned in the comments, when you open up that url, you have a page show up where you have to accept the cookies first, which requires interaction.
When you need interaction, consider using something like selenium (https://selenium-python.readthedocs.io/).
Here's something that should get you started.
(Edit: you'll need to run pip install selenium before running this code below)
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ad.nl'
# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Firefox()
driver.get(url)
# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()
# grab the html. It'll wait here until the page is finished loading
html = driver.page_source
# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")
for article in articles:
# check for article titles in both h2 and h3 elems
h2_titles = article.findAll('h2', {'class': 'ankeiler__title'})
h3_titles = article.findAll('h3', {'class': 'ankeiler__title'})
for t in h2_titles:
# first I was doing print(t.text), but some of them had leading
# newlines and things like '22:30', which I assume was the hour of the day
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
for t in h3_titles:
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
# close the browser
driver.close()
This may or may not be exactly what you have in mind, but this is just an example of how to use selenium and beautiful soup. Feel free to copy/use/modify this as you see fit.
And if you're wondering about what selectors to use, read the comment by #JL Peyret.

Cheerio selector doesn't select some elements

I am trying to build a module that does some basic scraping on an official NBA box score page (e.g. https://stats.nba.com/game/0021800083) using request-promise and cheerio. I wrote the following test code:
const rp = require("request-promise");
const co = require("cheerio");
// the object to be exported
var stats = {};
const test = (gameId) => {
rp(`http://stats.nba.com/game/${gameId}`)
.then(response => {
const $ = co.load(response);
$('td.player').each((i, element) => {
console.log(element);
});
});
};
// TESTING
test("0021800083");
module.exports = stats;
When I inspect the test webpage, there are multiple instances of td tags with class="player", but for some reason selecting them with cheerio doesn't work.
But cheerio does successfully select some elements, including a, script and div tags.
Help would be appreciated!
Using a scraper like request-promise will not work for a site built with AngularJS. Your response does not consist of the rendered HTML, as you are probably expecting. You can confirm by console logging the response. In order to properly scrape this site you could use PhantomJS, Selenium Webdriver, and the like.
An easier approach is to identify the AJAX call that is providing the data your are after and scrape that instead. For this, go to the site and in the developer tools, open the Network tab. Look at the list of requests and identify which one has the data you are after.
Assuming you are after the player stats in the tables, the one I believe you are looking for is "0021800083_gamedetail.json"
Further reading:
Can you scrape a Angular JS website
Scraping Data from AngularJS loaded page
Best of luck!

Scraping Javascript-rendered page that requires login using Python

My issue is that I can't scrape a website that uses login when it renders the page using Javascript.
I can easily log in using this code:
import requests
from lxml import html
payload ={
"username":"username",
"password":"password"
}
session_requests = requests.session()
result = session_requests.get(login_url)
tree = html.fromstring(result.text)
result = session_requests.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
Then I can get some values using this code:
result = session_requests.get(agent_url, headers = dict(referer = agent_url ))
tree = html.fromstring(result.content)
needed_info = tree.xpath("//div[#class='col-md-6']/div[#class='table-responsive']/table/tbody/tr[22]/td[2]")[0].text
However, not everything is rendered.
I've also tried to use dryscrape, however, it does not work on Windows.
Selenium is just too heavy for my needs and I'm having issues installing Spynner (probably because it does not support Python 3.6?)
What would you recommend?
I just went and did it using selenium. Everything else was just too much of a hassle for this little project.

How to yield fragment URLs in scrapy using Selenium?

from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).
I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post
After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.
So Finally, my biggest questions are:
1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy?
So far, this is the code I'm using, but doesn't work...
EDIT:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# The require imports...
def getBrowser():
path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/15.0.87")
browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)
return browser
class MySpider(Spider):
name = "myspider"
browser = getBrowser()
def start_requests(self):
the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="
yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)
def parse(self, response):
self.get_page_links()
def get_page_links(self):
""" This first part, goes through all available pages """
for i in xrange(1, 3): # 210
new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},
"config": {"page": str(i)}}
json_data = json.dumps(new_data)
new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)
self.browser.get(new_url)
print "\nThe new URL is -> ", new_url, "\n"
content = self.browser.page_source
self.get_item_links(content)
def get_item_links(self, body=""):
if body:
""" This second part, goes through all available items """
raw_links = re.findall(r'listclickable.+?>', body)
links = []
if raw_links:
for raw_link in raw_links:
new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"",
"")
links.append(str(new_link))
if links:
ids = self.get_ids(links)
for link in links:
current_id = self.get_single_id(link)
print "\nThe Link -> ", link
# If commented the line below, code works, doesn't otherwise
yield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)
def get_ids(self, list1=[]):
if list1:
ids = []
for elem in list1:
raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")
ids.append(raw_id)
return ids
else:
return []
def get_single_id(self, text=""):
if text:
raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")
return raw_id
else:
return ""
def parse_room(self, response):
# More scraping code...
So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.
2 - If the answer to the first question is negative, how could I tackle this issue? I'm opened to another means, otherwise
3 - Can anyone tell me or show me pages where I can learn how to solve/combine webscraping along javaScript and Ajax? Nowadays are more the websites that use JavaScript and Ajax scripts to load content
Many thanks in advance!
Selenium is one of the best tools to scrape dynamic data.you can use selenium with any web browser to fetch the data that is loading from scripts.That works exactly like the browser click operations.But I am not prefering it.
For getting dynamic data you can use scrapy + splash combo. From scrapy you wil get all the static data and splash for other dynamic contents.
Have you looked into BeautifulSoup? It's a very popular web scraping library for python. As for JavaScript, I would recommend something like Cheerio (If you're asking for a scraping library in JavaScript)
If you are meaning that the website uses HTTP requests to load content, you could always try to manipulate that manually with something like the requests library.
Hope this helps
You can definitely use Selenium as a standalone to scrap webpages with dynamic content (like AJAX loading).
Selenium will just rely on a WebDriver (basically a web browser) to seek content over the Internet.
Here are a few of them (but the most often used) :
ChromeDriver
PhantomJS (my favorite)
Firefox
Once your started, you can start your bot and parse the html content of the webpage.
I included a minimal working example below using Python and ChromeDriver :
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path='chromedriver')
driver.get('https://www.google.com')
# Then you can search for any element you want on the webpage
search_bar = driver.find_element(By.CLASS_NAME, 'tsf-p')
search_bar.click()
driver.close()
See the documentation for more details !

Categories

Resources