I'm trying to develop a simple web scraper. I want to extract text without the HTML code. It works on plain HTML, but not in some pages where JavaScript code adds text.
For example, if some JavaScript code adds some text, I can't see it, because when I call:
response = urllib2.urlopen(request)
I get the original text without the added one (because JavaScript is executed in the client).
So, I'm looking for some ideas to solve this problem.
EDIT Sept 2021: phantomjs isn't maintained any more, either
EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end.
dryscape isn't maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium's python library with Phantom JS as a web driver fast enough and easy to get the work done.
Once you have installed Phantom JS, make sure the phantomjs binary is available in the current path:
phantomjs --version
# result:
2.1.1
#Example
To give an example, I created a sample page with following HTML code. (link):
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Javascript scraping test</title>
</head>
<body>
<p id='intro-text'>No javascript support</p>
<script>
document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
</script>
</body>
</html>
without javascript it says: No javascript support and with javascript: Yay! Supports javascript
#Scraping without JS support:
import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>
#Scraping with JS support:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'
You can also use Python library dryscrape to scrape javascript driven websites.
#Scraping with JS support:
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>
We are not getting the correct results because any javascript generated content needs to be rendered on the DOM. When we fetch an HTML page, we fetch the initial, unmodified by javascript, DOM.
Therefore we need to render the javascript content before we crawl the page.
As selenium is already mentioned many times in this thread (and how slow it gets sometimes was mentioned also), I will list two other possible solutions.
Solution 1: This is a very nice tutorial on how to use Scrapy to crawl javascript generated content and we are going to follow just that.
What we will need:
Docker installed in our machine. This is a plus over other solutions until this point, as it utilizes an OS-independent platform.
Install Splash following the instruction listed for our corresponding OS.Quoting from splash documentation:
Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5.
Essentially we are going to use Splash to render Javascript generated content.
Run the splash server: sudo docker run -p 8050:8050 scrapinghub/splash.
Install the scrapy-splash plugin: pip install scrapy-splash
Assuming that we already have a Scrapy project created (if not, let's make one), we will follow the guide and update the settings.py:
Then go to your scrapy project’s settings.py and set these middlewares:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
The URL of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container's IP address from the host?):
SPLASH_URL = 'http://localhost:8050'
And finally you need to set these values too:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
Finally, we can use a SplashRequest:
In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example:
class MySpider(scrapy.Spider):
name = "jsscraper"
start_urls = ["http://quotes.toscrape.com/js/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url=url, callback=self.parse, endpoint='render.html'
)
def parse(self, response):
for q in response.css("div.quote"):
quote = QuoteItem()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
yield quote
SplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.
Solution 2: Let's call this experimental at the moment (May 2018)...
This solution is for Python's version 3.6 only (at the moment).
Do you know the requests module (well who doesn't)?
Now it has a web crawling little sibling: requests-HTML:
This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.
Install requests-html: pipenv install requests-html
Make a request to the page's url:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get(a_page_url)
Render the response to get the Javascript generated bits:
r.html.render()
Finally, the module seems to offer scraping capabilities.
Alternatively, we can try the well-documented way of using BeautifulSoup with the r.html object we just rendered.
Maybe selenium can do it.
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
htmlSource = driver.page_source
If you have ever used the Requests module for python before, I recently found out that the developer created a new module called Requests-HTML which now also has the ability to render JavaScript.
You can also visit https://html.python-requests.org/ to learn more about this module, or if your only interested about rendering JavaScript then you can visit https://html.python-requests.org/?#javascript-support to directly learn how to use the module to render JavaScript using Python.
Essentially, Once you correctly install the Requests-HTML module, the following example, which is shown on the above link, shows how you can use this module to scrape a website and render JavaScript contained within the website:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://python-requests.org/')
r.html.render()
r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>' #This is the result.
I recently learnt about this from a YouTube video. Click Here! to watch the YouTube video, which demonstrates how the module works.
It sounds like the data you're really looking for can be accessed via secondary URL called by some javascript on the primary page.
While you could try running javascript on the server to handle this, a simpler approach to might be to load up the page using Firefox and use a tool like Charles or Firebug to identify exactly what that secondary URL is. Then you can just query that URL directly for the data you are interested in.
This seems to be a good solution also, taken from a great blog post
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://pycoders.com/archive/'
r = Render(url)
result = r.frame.toHtml()
# This step is important.Converting QString to Ascii for lxml to process
# The following returns an lxml element tree
archive_links = html.fromstring(str(result.toAscii()))
print archive_links
# The following returns an array containing the URLs
raw_links = archive_links.xpath('//div[#class="campaign"]/a/#href')
print raw_links
Selenium is the best for scraping JS and Ajax content.
Check this article for extracting data from the web using Python
$ pip install selenium
Then download Chrome webdriver.
from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://www.python.org/")
nav = browser.find_element_by_id("mainnav")
print(nav.text)
Easy, right?
You can also execute javascript using webdriver.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
driver.execute_script('document.title')
or store the value in a variable
result = driver.execute_script('var text = document.title ; return text')
I personally prefer using scrapy and selenium and dockerizing both in separate containers. This way you can install both with minimal hassle and crawl modern websites that almost all contain javascript in one form or another. Here's an example:
Use the scrapy startproject to create your scraper and write your spider, the skeleton can be as simple as this:
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['https://somewhere.com']
def start_requests(self):
yield scrapy.Request(url=self.start_urls[0])
def parse(self, response):
# do stuff with results, scrape items etc.
# now were just checking everything worked
print(response.body)
The real magic happens in the middlewares.py. Overwrite two methods in the downloader middleware, __init__ and process_request, in the following way:
# import some additional modules that we need
import os
from copy import deepcopy
from time import sleep
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SampleProjectDownloaderMiddleware(object):
def __init__(self):
SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE')
SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub'
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_experimental_option("mobileEmulation", mobile_emulation)
self.driver = webdriver.Remote(command_executor=SELENIUM_URL,
desired_capabilities=chrome_options.to_capabilities())
def process_request(self, request, spider):
self.driver.get(request.url)
# sleep a bit so the page has time to load
# or monitor items on page to continue as soon as page ready
sleep(4)
# if you need to manipulate the page content like clicking and scrolling, you do it here
# self.driver.find_element_by_css_selector('.my-class').click()
# you only need the now properly and completely rendered html from your page to get results
body = deepcopy(self.driver.page_source)
# copy the current url in case of redirects
url = deepcopy(self.driver.current_url)
return HtmlResponse(url, body=body, encoding='utf-8', request=request)
Dont forget to enable this middlware by uncommenting the next lines in the settings.py file:
DOWNLOADER_MIDDLEWARES = {
'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,}
Next for dockerization. Create your Dockerfile from a lightweight image (I'm using python Alpine here), copy your project directory to it, install requirements:
# Use an official Python runtime as a parent image
FROM python:3.6-alpine
# install some packages necessary to scrapy and then curl because it's handy for debugging
RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev
WORKDIR /my_scraper
ADD requirements.txt /my_scraper/
RUN pip install -r requirements.txt
ADD . /scrapers
And finally bring it all together in docker-compose.yaml:
version: '2'
services:
selenium:
image: selenium/standalone-chrome
ports:
- "4444:4444"
shm_size: 1G
my_scraper:
build: .
depends_on:
- "selenium"
environment:
- SELENIUM_LOCATION=samplecrawler_selenium_1
volumes:
- .:/my_scraper
# use this command to keep the container running
command: tail -f /dev/null
Run docker-compose up -d. If you're doing this the first time it will take a while for it to fetch the latest selenium/standalone-chrome and the build your scraper image as well.
Once it's done, you can check that your containers are running with docker ps and also check that the name of the selenium container matches that of the environment variable that we passed to our scraper container (here, it was SELENIUM_LOCATION=samplecrawler_selenium_1).
Enter your scraper container with docker exec -ti YOUR_CONTAINER_NAME sh , the command for me was docker exec -ti samplecrawler_my_scraper_1 sh, cd into the right directory and run your scraper with scrapy crawl my_spider.
The entire thing is on my github page and you can get it from here
A mix of BeautifulSoup and Selenium works very well for me.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))) #waits 10 seconds until element is located. Can have other wait conditions such as visibility_of_element_located or text_to_be_present_in_element
html = driver.page_source
soup = bs(html, "lxml")
dynamic_text = soup.find_all("p", {"class":"class_name"}) #or other attributes, optional
else:
print("Couldnt locate element")
P.S. You can find more wait conditions here
Using PyQt5
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEnginePage
import sys
import bs4 as bs
import urllib.request
class Client(QWebEnginePage):
def __init__(self,url):
global app
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.html = ""
self.loadFinished.connect(self.on_load_finished)
self.load(QUrl(url))
self.app.exec_()
def on_load_finished(self):
self.html = self.toHtml(self.Callable)
print("Load Finished")
def Callable(self,data):
self.html = data
self.app.quit()
# url = ""
# client_response = Client(url)
# print(client_response.html)
You'll want to use urllib, requests, beautifulSoup and selenium web driver in your script for different parts of the page, (to name a few).
Sometimes you'll get what you need with just one of these modules.
Sometimes you'll need two, three, or all of these modules.
Sometimes you'll need to switch off the js on your browser.
Sometimes you'll need header info in your script.
No websites can be scraped the same way and no website can be scraped in the same way forever without having to modify your crawler, usually after a few months. But they can all be scraped! Where there's a will there's a way for sure.
If you need scraped data continuously into the future just scrape everything you need and store it in .dat files with pickle.
Just keep searching how to try what with these modules and copying and pasting your errors into the Google.
Pyppeteer
You might consider Pyppeteer, a Python port of the Chrome/Chromium driver front-end Puppeteer.
Here's a simple example to show how you can use Pyppeteer to access data that was injected into the page dynamically:
import asyncio
from pyppeteer import launch
async def main():
browser = await launch({"headless": True})
[page] = await browser.pages()
# normally, you go to a live site...
#await page.goto("http://www.example.com")
# but for this example, just set the HTML directly:
await page.setContent("""
<body>
<script>
// inject content dynamically with JS, not part of the static HTML!
document.body.innerHTML = `<p>hello world</p>`;
</script>
</body>
""")
print(await page.content()) # shows that the `<p>` was inserted
# evaluate a JS expression in browser context and scrape the data
expr = "document.querySelector('p').textContent"
print(await page.evaluate(expr, force_expr=True)) # => hello world
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
See Pyppeteer's reference docs.
Try accessing the API directly
A common scenario you'll see in scraping is that the data is being requested asynchronously from an API endpoint by the webpage. A minimal example of this would be the following site:
<body>
<script>
fetch("https://jsonplaceholder.typicode.com/posts/1")
.then(res => {
if (!res.ok) throw Error(res.status);
return res.json();
})
.then(data => {
// inject data dynamically via JS after page load
document.body.innerText = data.title;
})
.catch(err => console.error(err))
;
</script>
</body>
In many cases, the API will be protected by CORS or an access token or prohibitively rate limited, but in other cases it's publicly-accessible and you can bypass the website entirely. For CORS issues, you might try cors-anywhere.
The general procedure is to use your browser's developer tools' network tab to search the requests made by the page for keywords/substrings of the data you want to scrape. Often, you'll see an unprotected API request endpoint with a JSON payload that you can access directly with urllib or requests modules. That's the case with the above runnable snippet which you can use to practice. After clicking "run snippet", here's how I found the endpoint in my network tab:
This example is contrived; the endpoint URL will likely be non-obvious from looking at the static markup because it could be dynamically assembled, minified and buried under dozens of other requests and endpoints. The network request will also show any relevant request payload details like access token you may need.
After obtaining the endpoint URL and relevant details, build a request in Python using a standard HTTP library and request the data:
>>> import requests
>>> res = requests.get("https://jsonplaceholder.typicode.com/posts/1")
>>> data = res.json()
>>> data["title"]
'sunt aut facere repellat provident occaecati excepturi optio reprehenderit'
When you can get away with it, this tends to be much easier, faster and more reliable than scraping the page with Selenium, Pyppeteer, Scrapy or whatever the popular scraping libraries are at the time you're reading this post.
If you're unlucky and the data hasn't arrived via an API request that returns the data in a nice format, it could be part of the original browser's payload in a <script> tag, either as a JSON string or (more likely) a JS object. For example:
<body>
<script>
var someHardcodedData = {
userId: 1,
id: 1,
title: 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit',
body: 'quia et suscipit\nsuscipit recusandae con sequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'
};
document.body.textContent = someHardcodedData.title;
</script>
</body>
There's no one-size-fits-all way to obtain this data. The basic technique is to use BeautifulSoup to access the <script> tag text, then apply a regex or a parse to extract the object structure, JSON string, or whatever format the data might be in. Here's a proof-of-concept on the sample structure shown above:
import json
import re
from bs4 import BeautifulSoup
# pretend we've already used requests to retrieve the data,
# so we hardcode it for the purposes of this example
text = """
<body>
<script>
var someHardcodedData = {
userId: 1,
id: 1,
title: 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit',
body: 'quia et suscipit\nsuscipit recusandae con sequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'
};
document.body.textContent = someHardcodedData.title;
</script>
</body>
"""
soup = BeautifulSoup(text, "lxml")
script_text = str(soup.select_one("script"))
pattern = r"title: '(.*?)'"
print(re.search(pattern, script_text, re.S).group(1))
Check out these resources for parsing JS objects that aren't quite valid JSON:
How to convert raw javascript object to python dictionary?
How to Fix JSON Key Values without double-quotes?
Here are some additional case studies/proofs-of-concept where scraping was bypassed using an API:
How can I scrape yelp reviews and star ratings into CSV using Python beautifulsoup
Beautiful Soup returns None on existing element
Extract data from BeautifulSoup Python
Scraping Bandcamp fan collections via POST (uses a hybrid approach where an initial request was made to the website to extract a token from the markup using BeautifulSoup which was then used in a second request to a JSON endpoint)
If all else fails, try one of the many dynamic scraping libraries listed in this thread.
Playwright-Python
Yet another option is playwright-python, a port of Microsoft's Playwright (itself a Puppeteer-influenced browser automation library) to Python.
Here's the minimal example of selecting an element and grabbing its text:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("http://whatsmyuseragent.org/")
ua = page.query_selector(".user-agent");
print(ua.text_content())
browser.close()
As mentioned, Selenium is a good choice for rendering the results of the JavaScript:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
browser = Firefox(executable_path="/usr/local/bin/geckodriver", options=options)
url = "https://www.example.com"
browser.get(url)
And gazpacho is a really easy library to parse over the rendered html:
from gazpacho import Soup
soup = Soup(browser.page_source)
soup.find("a").attrs['href']
I recently used requests_html library to solve this problem.
Their expanded documentation at readthedocs.io is pretty good (skip the annotated version at pypi.org). If your use case is basic, you are likely to have some success.
from requests_html import HTMLSession
session = HTMLSession()
response = session.request(method="get",url="www.google.com/")
response.html.render()
If you are having trouble rendering the data you need with response.html.render(), you can pass some javascript to the render function to render the particular js object you need. This is copied from their docs, but it might be just what you need:
If script is specified, it will execute the provided JavaScript at
runtime. Example:
script = """
() => {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio,
}
}
"""
Returns the return value of the executed script, if any is provided:
>>> response.html.render(script=script)
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}
In my case, the data I wanted were the arrays that populated a javascript plot but the data wasn't getting rendered as text anywhere in the html. Sometimes its not clear at all what the object names are of the data you want if the data is populated dynamically. If you can't track down the js objects directly from view source or inspect, you can type in "window" followed by ENTER in the debugger console in the browser (Chrome) to pull up a full list of objects rendered by the browser. If you make a few educated guesses about where the data is stored, you might have some luck finding it there. My graph data was under window.view.data in the console, so in the "script" variable passed to the .render() method quoted above, I used:
return {
data: window.view.data
}
Easy and Quick Solution:
I was dealing with same problem. I want to scrape some data which is build with JavaScript. If I scrape only text from this site with BeautifulSoup then I ended with tags in text.
I want to render this tag and wills to grab information from this.
Also, I dont want to use heavy frameworks like Scrapy and selenium.
So, I found that get method of requests module takes urls, and it actually renders the script tag.
Example:
import requests
custom_User_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
url = "https://www.abc.xyz/your/url"
response = requests.get(url, headers={"User-Agent": custom_User_agent})
html_text = response.text
This will renders load site and renders tags.
Hope this will help as quick and easy solution to render site which is loaded with script tags.
Related
I want to scan some websites and would like to get all the java script files names and content.I tried python requests with BeautifulSoup but wasn't able to get the scripts details and contents.am I missing something ?
I have been trying lot of methods to find but I felt like stumbling in the dark.
This is the code I am trying
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.marunadanmalayali.com/")
soup = BeautifulSoup(r.content)
You can get all the linked JavaScript code use the below code:
l = [i.get('src') for i in soup.find_all('script') if i.get('src')]
soup.find_all('script') returns a list of all the <script> tags in the page.
A list comprehension is used here to loop over all the elements in the list which returned by soup.find_all('script').
i is a dict like object, use .get('src') to check if it has src attribute. If not, ignore it. Otherwise, put it into a list (which's called l in the example).
The output, in this case looks like below:
['http://adserver.adtech.de/addyn/3.0/1602/5506153/0/6490/ADTECH;loc=700;target=_blank;grp=[group]',
'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js',
'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js',
'http://js.genieessp.com/t/057/794/a1057794.js',
'http://ib.adnxs.com/ttj?id=5620689&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]',
'http://ib.adnxs.com/ttj?id=5531763',
'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]',
'http://xp2.zedo.com/jsc/xp2/fo.js',
'http://www.marunadanmalayali.com/js/mnmads.js',
'http://www.marunadanmalayali.com/js/jquery-2.1.0.min.js',
'http://www.marunadanmalayali.com/js/jquery.hoverIntent.minified.js',
'http://www.marunadanmalayali.com/js/jquery.dcmegamenu.1.3.3.js',
'http://www.marunadanmalayali.com/js/jquery.cookie.js',
'http://www.marunadanmalayali.com/js/swanalekha-ml.js',
'http://www.marunadanmalayali.com/js/marunadan.js?r=1875',
'http://www.marunadanmalayali.com/js/taboola_home.js',
'http://d8.zedo.com/jsc/d8/fo.js']
My code missed some links because they're not in the HTML source actually.
You can see them in the console:
But they're not in the source:
Usually, that's because these links were generated by JavaScript. And the requests module doesn't run any JavaScript in the page like a real browser - it only send a request to get the HTML source.
If you also need them, you have to use another module to run the JavaScript in that page, and you can see these links then. For that, I'd suggest use selenium - which runs a real browser so it can runs JavaScript in the page.
For example (make sure that you have already installed selenium and a web driver for it):
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome() # use Chrome driver for example
driver.get('http://www.marunadanmalayali.com/')
soup = BeautifulSoup(driver.page_source, "html.parser")
l = [i.get('src') for i in soup.find_all('script') if i.get('src')]
__import__('pprint').pprint(l)
You can use a select with script[src] which will only find script tags with a src, you don't need to call .get multiple times:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.marunadanmalayali.com/")
soup = BeautifulSoup(r.content)
src = [sc["src"] for sc in soup.select("script[src]")]
You can also specify src=True with find_all to do the same:
src = [sc["src"] for sc in soup.find_all("script",src=True)]
Which will both give you the same output:
['http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', 'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', 'http://js.genieessp.com/t/052/954/a1052954.js', '//s3-ap-northeast-1.amazonaws.com/tms-t/marunadanmalayali-7219.js', 'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', 'http://www.marunadanmalayali.com/js/mnmcombined1.min.js', 'http://www.marunadanmalayali.com/js/mnmcombined2.min.js']
Also if you use selenium, you can use it with PhantomJs for headless browsing, you don't need beautufulSoup at all if you use selenium, you can use the same css selector directly in selenium:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://www.marunadanmalayali.com/')
src = [sc.get_attribute("src") for sc in driver.find_elements_by_css_selector("script[src]")]
print(src)
Which gives you all the links:
u'https://pixel.yabidos.com/fltiu.js?qid=836373f5137373f5131353&cid=511&p=165&s=http%3a%2f%2fwww.marunadanmalayali.com%2f&x=admeta&nci=&adtg=96331&nai=', u'http://gum.criteo.com/sync?c=72&r=2&j=TRC.getRTUS', u'http://b.scorecardresearch.com/beacon.js', u'http://cdn.taboola.com/libtrc/impl.201-1-RELEASE.js', u'http://p165.atemda.com/JSAdservingMP.ashx?pc=1&pbId=165&clk=&exm=&jsv=1.84&tsv=2.26&cts=1459160775430&arp=0&fl=0&vitp=0&vit=&jscb=&url=&fp=0;400;300;20&oid=&exr=&mraid=&apid=&apbndl=&mpp=0&uid=&cb=54613943&pId0=64056124&rank0=1&gid0=64056124:1c59ac&pp0=&clk0=[External%20click-tracking%20goes%20here%20(NOT%20URL-encoded)]&rpos0=0&ecpm0=&ntv0=&ntl0=&adsid0=', u'http://cdn.taboola.com/libtrc/marunadanaalayali-network/loader.js', u'http://s.atemda.com/Admeta.js', u'http://www.google-analytics.com/analytics.js', u'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', u'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', u'http://js.genieessp.com/t/052/954/a1052954.js', u'http://s3-ap-northeast-1.amazonaws.com/tms-t/marunadanmalayali-7219.js', u'http://d8.zedo.com/jsc/d8/fo.js', u'http://z1.zedo.com/asw/fm/1185/7219/9/fm.js?c=7219&a=0&f=&n=1185&r=1&d=9&adm=&q=&$=&s=1936&l=%5BINSERT_CLICK_TRACKER_MACRO%5D&ct=&z=0.025054786819964647&tt=0&tz=0&pu=http%3A%2F%2Fwww.marunadanmalayali.com%2F&ru=&pi=1459160768626&ce=UTF-8&zpu=www.marunadanmalayali.com____1_&tpu=', u'http://cas.criteo.com/delivery/ajs.php?zoneid=308686&nodis=1&cb=38688817829&exclude=undefined&charset=UTF-8&loc=http%3A//www.marunadanmalayali.com/', u'http://ads.pubmatic.com/AdServer/js/showad.js', u'http://showads.pubmatic.com/AdServer/AdServerServlet?pubId=135167&siteId=135548&adId=600924&kadwidth=300&kadheight=250&SAVersion=2&js=1&kdntuid=1&pageURL=http%3A%2F%2Fwww.marunadanmalayali.com%2F&inIframe=0&kadpageurl=marunadanmalayali.com&operId=3&kltstamp=2016-3-28%2011%3A26%3A13&timezone=1&screenResolution=1024x768&ranreq=0.8869257988408208&pmUniAdId=0&adVisibility=2&adPosition=999x664', u'http://d8.zedo.com/jsc/d8/fo.js', u'http://z1.zedo.com/asw/fm/1185/7213/9/fm.js?c=7213&a=0&f=&n=1185&r=1&d=9&adm=&q=&$=&s=1948&l=%5BINSERT_CLICK_TRACKER_MACRO%5D&ct=&z=0.08655649935826659&tt=0&tz=0&pu=http%3A%2F%2Fwww.marunadanmalayali.com%2F&ru=&pi=1459160768626&ce=UTF-8&zpu=www.marunadanmalayali.com____1_&tpu=', u'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', u'http://ib.adnxs.com/ttj?ttjb=1&bdc=1459160761&bdh=ZllBLkzcj2dGDVPeS0Sw_OTWjgQ.&tpuids=eyJ0cHVpZHMiOlt7InByb3ZpZGVyIjoiY3JpdGVvIiwidXNlcl9pZCI6Il9KRC1PUmhLX3hLczd1cUJhbjlwLU1KQ2VZbDQ2VVUxIn1dfQ==&view_iv=0&view_pos=664,2096&view_ws=400,300&view_vs=3&bdref=http%3A%2F%2Fwww.marunadanmalayali.com%2F&bdtop=true&bdifs=0&bstk=http%3A%2F%2Fwww.marunadanmalayali.com%2F&&id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', u'http://www.marunadanmalayali.com/js/mnmcombined1.min.js', u'http://www.marunadanmalayali.com/js/mnmcombined2.min.js', u'http://pixel.yabidos.com/iftfl.js?ver=1.4.2&qid=836373f5137373f5131353&cid=511&p=165&s=http%3a%2f%2fwww.marunadanmalayali.com%2f&x=admeta&adtg=96331&nci=&nai=&nsi=&cstm1=&cstm2=&cstm3=&kqt=&xc=&test=&od1=&od2=&co=0&tps=34&rnd=3m17uji8ftbf']
Ultimately, I'm trying to open all articles of a news website and then make a top 10 of the words used in all the articles. To do this, I first wanted to see how many articles there are so I could iterate over them at some point, haven't really figured out how I want to do everything yet.
To do this, I wanted to use BeautifulSoup4. I think the class I'm trying to get is Javascript as I'm not getting anything back.
This is my code:
url = "http://ad.nl"
ad = requests.get(url)
soup = BeautifulSoup(ad.text.lower(), "xml")
titels = soup.findAll("article")
print(titels)
for titel in titels:
print(titel)
The article name is sometimes an h2 or an h3. It always has one and the same class, but I can't get anything through that class. It has some parents but it uses the same name but with the extension -wrapper for example. I don't even know how to use a parent to get what I want but I think that those classes are Javascript as well. There's also an href which I'm interested in. But once again, that is probably also Javascript as it returns nothing.
Does anyone know how I could use anything (preferably the href, but the article name would be ok as well) by using BeautifulSoup?
In case you don't want to use selenium. This works for me. I've tried on 2 PCs with different internet connection. Can you try?
from bs4 import BeautifulSoup
import requests
cookies={"pwv":"2",
"pws":"functional|analytics|content_recommendation|targeted_advertising|social_media"}
page=requests.get("https://www.ad.nl/",cookies=cookies)
soup = BeautifulSoup(page.content, 'html.parser')
articles = soup.findAll("article")
Then follow kimbo's code to extract h2/h3.
As #Sri mentioned in the comments, when you open up that url, you have a page show up where you have to accept the cookies first, which requires interaction.
When you need interaction, consider using something like selenium (https://selenium-python.readthedocs.io/).
Here's something that should get you started.
(Edit: you'll need to run pip install selenium before running this code below)
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ad.nl'
# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Firefox()
driver.get(url)
# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()
# grab the html. It'll wait here until the page is finished loading
html = driver.page_source
# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")
for article in articles:
# check for article titles in both h2 and h3 elems
h2_titles = article.findAll('h2', {'class': 'ankeiler__title'})
h3_titles = article.findAll('h3', {'class': 'ankeiler__title'})
for t in h2_titles:
# first I was doing print(t.text), but some of them had leading
# newlines and things like '22:30', which I assume was the hour of the day
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
for t in h3_titles:
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
# close the browser
driver.close()
This may or may not be exactly what you have in mind, but this is just an example of how to use selenium and beautiful soup. Feel free to copy/use/modify this as you see fit.
And if you're wondering about what selectors to use, read the comment by #JL Peyret.
from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).
I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post
After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.
So Finally, my biggest questions are:
1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy?
So far, this is the code I'm using, but doesn't work...
EDIT:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# The require imports...
def getBrowser():
path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/15.0.87")
browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)
return browser
class MySpider(Spider):
name = "myspider"
browser = getBrowser()
def start_requests(self):
the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="
yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)
def parse(self, response):
self.get_page_links()
def get_page_links(self):
""" This first part, goes through all available pages """
for i in xrange(1, 3): # 210
new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},
"config": {"page": str(i)}}
json_data = json.dumps(new_data)
new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)
self.browser.get(new_url)
print "\nThe new URL is -> ", new_url, "\n"
content = self.browser.page_source
self.get_item_links(content)
def get_item_links(self, body=""):
if body:
""" This second part, goes through all available items """
raw_links = re.findall(r'listclickable.+?>', body)
links = []
if raw_links:
for raw_link in raw_links:
new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"",
"")
links.append(str(new_link))
if links:
ids = self.get_ids(links)
for link in links:
current_id = self.get_single_id(link)
print "\nThe Link -> ", link
# If commented the line below, code works, doesn't otherwise
yield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)
def get_ids(self, list1=[]):
if list1:
ids = []
for elem in list1:
raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")
ids.append(raw_id)
return ids
else:
return []
def get_single_id(self, text=""):
if text:
raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")
return raw_id
else:
return ""
def parse_room(self, response):
# More scraping code...
So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.
2 - If the answer to the first question is negative, how could I tackle this issue? I'm opened to another means, otherwise
3 - Can anyone tell me or show me pages where I can learn how to solve/combine webscraping along javaScript and Ajax? Nowadays are more the websites that use JavaScript and Ajax scripts to load content
Many thanks in advance!
Selenium is one of the best tools to scrape dynamic data.you can use selenium with any web browser to fetch the data that is loading from scripts.That works exactly like the browser click operations.But I am not prefering it.
For getting dynamic data you can use scrapy + splash combo. From scrapy you wil get all the static data and splash for other dynamic contents.
Have you looked into BeautifulSoup? It's a very popular web scraping library for python. As for JavaScript, I would recommend something like Cheerio (If you're asking for a scraping library in JavaScript)
If you are meaning that the website uses HTTP requests to load content, you could always try to manipulate that manually with something like the requests library.
Hope this helps
You can definitely use Selenium as a standalone to scrap webpages with dynamic content (like AJAX loading).
Selenium will just rely on a WebDriver (basically a web browser) to seek content over the Internet.
Here are a few of them (but the most often used) :
ChromeDriver
PhantomJS (my favorite)
Firefox
Once your started, you can start your bot and parse the html content of the webpage.
I included a minimal working example below using Python and ChromeDriver :
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path='chromedriver')
driver.get('https://www.google.com')
# Then you can search for any element you want on the webpage
search_bar = driver.find_element(By.CLASS_NAME, 'tsf-p')
search_bar.click()
driver.close()
See the documentation for more details !
I want to batch export a list of flashcard sets/decks from quizlet. Rather than manually clicking on the menu, export, tick 'include pictures', copy, paste into new blank text file, save.... it would be easier to write a script to do this.
How can I do this? Can someone help give me head start (and I can do the rest, etc).
Javascript? JQuery? Python?
Need to parse a text file of URLs (the direct links to each deck).
eg.
https://quizlet.com/215441327/f1-u1a-making-friends-flash-cards/
https://quizlet.com/218503855/f1-u1b-making-friends-flash-cards/
and export.
UPDATE: Is there a way to fire the onclick for that "MORE" button (ellipsis dots), and fire click the "EXPORT"?
Then fire click the checkbox "INCLUDE PICTURES". Then grab the textarea?
My preference is python. for starting point see the code below. I am using BeautifulSoup package. See example below as a starting point.
from bs4 import BeautifulSoup
import requests
url = "https://quizlet.com/215441327/f1-u1a-making-friends-flash-cards/"
headers = {'User-Agent':'Mozilla/5.0'}
page = requests.get(url)
soup = BeautifulSoup(page.text, "html5lib")
To get the english words
for en in soup.select(".TermText.notranslate.lang-en"):
print(en.text.strip())
Outputs:
enjoy
cheerful
everyone
sporty
sometimes
practise
practice
friend
favourite
help
for the other languages
for ch in soup.select(".TermText.notranslate.lang-zh-TW"):
print(ch.text.strip())
Outputs:
請享用
高興的
每個人
運動型的
有時
練習
練習
朋友
最喜歡的
幫助
You can use selenium python library to interact with the webpage also:
from selenium import webdriver
import os
chromedriver = "C:\Users\pappuj\Downloads\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
url='http://www.zoover.nl/cyprus'
driver.get(url)
driver.find_element_by_class_name('next').click()
Im trying to scrape products from a web store similar to how Dropified scrapes items from Ali express,
Current Solution(the way it's set up it will only try and access the first item):
from bs4 import BeautifulSoup
import requests
import time
import re
# Get search inputs from user
search_term = raw_input('Search Term:')
# Build URL to imdb.com
aliURL = 'https://www.aliexpress.com/wholesale?SearchText=%(q)s'
payload = {'q': search_term, }
# Get resulting webpage
r = requests.get(aliURL % payload)
# Build 'soup' from webpage and filter down to the results of search
soup = BeautifulSoup(r.text, "html5lib")
titles = soup.findAll('a', attrs = {'class': 'product'})
itemURL = titles[0]["href"]
seperatemarker = '?'
seperatedURL = itemURL.split(seperatemarker, 1)[0]
seperatedURL = "http:" + seperatedURL
print seperatedURL
IR = requests.get(seperatedURL)
Isoup = BeautifulSoup(IR.text, "html5lib")
productname = Isoup.findAll('h1')
print productname
This solution works assuming that the items on the page don't require javascript if the item does it will only retrieve the initial page before the document is ready.
I realise I can use a python web driver, but I was wondering if there was any other solution to this problem that would allow for easy automation of a web scraping tool.
Checkout selenium with phantomjs. selenium and phantomjs handle most of the issues related to JS generated content on the page. You don't even need to think about these things anymore.
If you're scraping many pages and want to speed things up, you might want to do things asynchronously. For a small to mid-sized setup, you can use RQ. For larger projects you can use celery. What these tools allow you to do is scrape multiple pages at the same time(not concurrently though).
Note that the tools I've mentioned so far have nothing to do with asyncio or other async frameworks.
I tried scraping some e-commerce pages and noticed that the program was spending 80% of its time waiting for HTTP calls to return something. Using the above tools you can reduce that 80% to 10% or less.