Following my previous question :
how to fetch javascript contents in python
I tried to make another script which fetches the data from a javascript. After getting the webpage contents of course.
But, it's just not showing up the content I want. I want to find "content_id" from the javascript of the page. This is the page :- http://www.hulu.com/watch/815743
Here's what I have right now.
import re
import requests
from bs4 import BeautifulSoup
import os
import fileinput
Link = 'http://www.hulu.com/watch/815743'
q = requests.get(Link)
soup = BeautifulSoup(q.text)
#print soup
subtitles = soup.findAll('script',{'type':'text/javascript'})
pattern = re.compile(r'"content_id":"(.*?)"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)
I get this error :
AttributeError: 'NoneType' object has no attribute 'text'
Any idea how to solve this issue..?
There are two problems in your regular expression pattern:
the quotes are escaped with backslashes in the script contents, take that into account
there is a whitespace after the colon
Here is the fixed version:
pattern = re.compile(r'\\"content_id\\":\s*\\"(.*?)\\"', re.MULTILINE | re.DOTALL)
Works for me, getting 60585710 as a result.
FYI, here is the complete code that I'm executing:
import re
import requests
from bs4 import BeautifulSoup
Link = 'http://www.hulu.com/watch/815743'
q = requests.get(Link)
soup = BeautifulSoup(q.text)
pattern = re.compile(r'\\"content_id\\":\s*\\"(.*?)\\"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)
Related
I want to scan some websites and would like to get all the java script files names and content.I tried python requests with BeautifulSoup but wasn't able to get the scripts details and contents.am I missing something ?
I have been trying lot of methods to find but I felt like stumbling in the dark.
This is the code I am trying
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.marunadanmalayali.com/")
soup = BeautifulSoup(r.content)
You can get all the linked JavaScript code use the below code:
l = [i.get('src') for i in soup.find_all('script') if i.get('src')]
soup.find_all('script') returns a list of all the <script> tags in the page.
A list comprehension is used here to loop over all the elements in the list which returned by soup.find_all('script').
i is a dict like object, use .get('src') to check if it has src attribute. If not, ignore it. Otherwise, put it into a list (which's called l in the example).
The output, in this case looks like below:
['http://adserver.adtech.de/addyn/3.0/1602/5506153/0/6490/ADTECH;loc=700;target=_blank;grp=[group]',
'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js',
'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js',
'http://js.genieessp.com/t/057/794/a1057794.js',
'http://ib.adnxs.com/ttj?id=5620689&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]',
'http://ib.adnxs.com/ttj?id=5531763',
'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]',
'http://xp2.zedo.com/jsc/xp2/fo.js',
'http://www.marunadanmalayali.com/js/mnmads.js',
'http://www.marunadanmalayali.com/js/jquery-2.1.0.min.js',
'http://www.marunadanmalayali.com/js/jquery.hoverIntent.minified.js',
'http://www.marunadanmalayali.com/js/jquery.dcmegamenu.1.3.3.js',
'http://www.marunadanmalayali.com/js/jquery.cookie.js',
'http://www.marunadanmalayali.com/js/swanalekha-ml.js',
'http://www.marunadanmalayali.com/js/marunadan.js?r=1875',
'http://www.marunadanmalayali.com/js/taboola_home.js',
'http://d8.zedo.com/jsc/d8/fo.js']
My code missed some links because they're not in the HTML source actually.
You can see them in the console:
But they're not in the source:
Usually, that's because these links were generated by JavaScript. And the requests module doesn't run any JavaScript in the page like a real browser - it only send a request to get the HTML source.
If you also need them, you have to use another module to run the JavaScript in that page, and you can see these links then. For that, I'd suggest use selenium - which runs a real browser so it can runs JavaScript in the page.
For example (make sure that you have already installed selenium and a web driver for it):
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome() # use Chrome driver for example
driver.get('http://www.marunadanmalayali.com/')
soup = BeautifulSoup(driver.page_source, "html.parser")
l = [i.get('src') for i in soup.find_all('script') if i.get('src')]
__import__('pprint').pprint(l)
You can use a select with script[src] which will only find script tags with a src, you don't need to call .get multiple times:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.marunadanmalayali.com/")
soup = BeautifulSoup(r.content)
src = [sc["src"] for sc in soup.select("script[src]")]
You can also specify src=True with find_all to do the same:
src = [sc["src"] for sc in soup.find_all("script",src=True)]
Which will both give you the same output:
['http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', 'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', 'http://js.genieessp.com/t/052/954/a1052954.js', '//s3-ap-northeast-1.amazonaws.com/tms-t/marunadanmalayali-7219.js', 'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', 'http://www.marunadanmalayali.com/js/mnmcombined1.min.js', 'http://www.marunadanmalayali.com/js/mnmcombined2.min.js']
Also if you use selenium, you can use it with PhantomJs for headless browsing, you don't need beautufulSoup at all if you use selenium, you can use the same css selector directly in selenium:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://www.marunadanmalayali.com/')
src = [sc.get_attribute("src") for sc in driver.find_elements_by_css_selector("script[src]")]
print(src)
Which gives you all the links:
u'https://pixel.yabidos.com/fltiu.js?qid=836373f5137373f5131353&cid=511&p=165&s=http%3a%2f%2fwww.marunadanmalayali.com%2f&x=admeta&nci=&adtg=96331&nai=', u'http://gum.criteo.com/sync?c=72&r=2&j=TRC.getRTUS', u'http://b.scorecardresearch.com/beacon.js', u'http://cdn.taboola.com/libtrc/impl.201-1-RELEASE.js', u'http://p165.atemda.com/JSAdservingMP.ashx?pc=1&pbId=165&clk=&exm=&jsv=1.84&tsv=2.26&cts=1459160775430&arp=0&fl=0&vitp=0&vit=&jscb=&url=&fp=0;400;300;20&oid=&exr=&mraid=&apid=&apbndl=&mpp=0&uid=&cb=54613943&pId0=64056124&rank0=1&gid0=64056124:1c59ac&pp0=&clk0=[External%20click-tracking%20goes%20here%20(NOT%20URL-encoded)]&rpos0=0&ecpm0=&ntv0=&ntl0=&adsid0=', u'http://cdn.taboola.com/libtrc/marunadanaalayali-network/loader.js', u'http://s.atemda.com/Admeta.js', u'http://www.google-analytics.com/analytics.js', u'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', u'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', u'http://js.genieessp.com/t/052/954/a1052954.js', u'http://s3-ap-northeast-1.amazonaws.com/tms-t/marunadanmalayali-7219.js', u'http://d8.zedo.com/jsc/d8/fo.js', u'http://z1.zedo.com/asw/fm/1185/7219/9/fm.js?c=7219&a=0&f=&n=1185&r=1&d=9&adm=&q=&$=&s=1936&l=%5BINSERT_CLICK_TRACKER_MACRO%5D&ct=&z=0.025054786819964647&tt=0&tz=0&pu=http%3A%2F%2Fwww.marunadanmalayali.com%2F&ru=&pi=1459160768626&ce=UTF-8&zpu=www.marunadanmalayali.com____1_&tpu=', u'http://cas.criteo.com/delivery/ajs.php?zoneid=308686&nodis=1&cb=38688817829&exclude=undefined&charset=UTF-8&loc=http%3A//www.marunadanmalayali.com/', u'http://ads.pubmatic.com/AdServer/js/showad.js', u'http://showads.pubmatic.com/AdServer/AdServerServlet?pubId=135167&siteId=135548&adId=600924&kadwidth=300&kadheight=250&SAVersion=2&js=1&kdntuid=1&pageURL=http%3A%2F%2Fwww.marunadanmalayali.com%2F&inIframe=0&kadpageurl=marunadanmalayali.com&operId=3&kltstamp=2016-3-28%2011%3A26%3A13&timezone=1&screenResolution=1024x768&ranreq=0.8869257988408208&pmUniAdId=0&adVisibility=2&adPosition=999x664', u'http://d8.zedo.com/jsc/d8/fo.js', u'http://z1.zedo.com/asw/fm/1185/7213/9/fm.js?c=7213&a=0&f=&n=1185&r=1&d=9&adm=&q=&$=&s=1948&l=%5BINSERT_CLICK_TRACKER_MACRO%5D&ct=&z=0.08655649935826659&tt=0&tz=0&pu=http%3A%2F%2Fwww.marunadanmalayali.com%2F&ru=&pi=1459160768626&ce=UTF-8&zpu=www.marunadanmalayali.com____1_&tpu=', u'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', u'http://ib.adnxs.com/ttj?ttjb=1&bdc=1459160761&bdh=ZllBLkzcj2dGDVPeS0Sw_OTWjgQ.&tpuids=eyJ0cHVpZHMiOlt7InByb3ZpZGVyIjoiY3JpdGVvIiwidXNlcl9pZCI6Il9KRC1PUmhLX3hLczd1cUJhbjlwLU1KQ2VZbDQ2VVUxIn1dfQ==&view_iv=0&view_pos=664,2096&view_ws=400,300&view_vs=3&bdref=http%3A%2F%2Fwww.marunadanmalayali.com%2F&bdtop=true&bdifs=0&bstk=http%3A%2F%2Fwww.marunadanmalayali.com%2F&&id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', u'http://www.marunadanmalayali.com/js/mnmcombined1.min.js', u'http://www.marunadanmalayali.com/js/mnmcombined2.min.js', u'http://pixel.yabidos.com/iftfl.js?ver=1.4.2&qid=836373f5137373f5131353&cid=511&p=165&s=http%3a%2f%2fwww.marunadanmalayali.com%2f&x=admeta&adtg=96331&nci=&nai=&nsi=&cstm1=&cstm2=&cstm3=&kqt=&xc=&test=&od1=&od2=&co=0&tps=34&rnd=3m17uji8ftbf']
hello i have a script using Python and Selenium, and I don't understand why this can't retrieve the JS part of the website (the same script works perfectly fine on my other machine):
import chromedriver_binary
from bs4 import BeautifulSoup
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("window-size=1024,768")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--enable-javascript")
url = "https://deliveroo.co.uk/restaurants/london/holborn?geohash=gcpvj6kxet58&collection=pizza"
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "lxml")
show_data = soup.find_all("script", id="__NEXT_DATA__")
mydata = json.loads( show_data[0].text )
I get the following error, meaning that it couldn't see this part of JSON:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Not too sure why this is working on my other machine and not on my current one.
The .text attribute doesn't work here. To get the right data, it worked for me to use encode_contents(), just changing the definition of mydata like this:
mydata = json.loads( show_data[0].encode_contents())
I am scraping a site with beautiful soup but all the content is hidden inside a script inside a js variable like this:
I can't seem to find any solution to this other than using selenium which in this case is not an option, I won't go into detail why but it just doesn't work. I can already scrape it by getting the insid eof the script tag and then using eval() on it but that introduces a few problems (unexpected indent, unwanted functions) I can use python, javascript and maybe C# if anything there helps.
Expected behaviour - whatever makes me get the info (the variable in the last line) into any of those 3 languages (preferably python).
The code (sorry for the formating but i cant since its so long, it isnt even the full variable, its huge):
barLoadGoogleFont('Open Sans'); barCssLoad('/global/pics/js/jquery/royalSlider/skins/universal/rs-universal.css?v=e449c4'); barCssLoad('/global/pics/css/material-icons.css?v=e6d856'); barCssLoad('/user/pics/css/user.css?v=eced9d');
barCssLoad('/user/pics/css/userIcons.css?v=6f9a03');
barCssLoad('/timeline/pics/css/timeline.css?v=8ec2ca'); barJsLibraryLoad('/global/pics/js/jquery/jquery.royalslider.min.js?v=515a43'); barJsLibraryLoad('/anketa/pics/js/utilsAnketa.js?v=9383d5'); barJsLibraryLoad('/znamky/pics/js/utilsZnamky.js?v=7afc9e'); barJsLibraryLoad('/exam/pics/js/utilsExam.js?v=033d55'); barJsLibraryLoad('/timeline/pics/js/utilsTimeline.js?v=29cf0e'); barJsLibraryLoad('/timeline/pics/js/timelineItemCreator.js?v=c37c99'); barJsLibraryLoad('/timeline/pics/js/timelineInputbox.js?v=2fde70'); barJsLibraryLoad('/timeline/pics/js/timelineViewer.js?v=f35e45');
barJsLibraryLoad('/user/pics/js/DailyPlan.js?v=e81fb9'); barJsLibraryLoad('/user/pics/js/userHomeEtest.js?v=6166f3');
$j(document).ready(function() { $j('#jwbcddd3da_md').userhome({"items":[{"timelineid":"2140963","timestamp":"2020-12-09 09:59:13","reakcia_na":"692638","typ":"h_clearplany","user":"Plan5077","target_user":null,"user_meno":"Kvarta aj2","ineid":"clearplany","text":"","cas_pridania":"2020-12-09 09:59:13","cas_udalosti":null,"data":"null","vlastnik":"Ucitel8678605","vlastnik_meno":"Barbora Drugajov\u00e1","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":"1","removed":"0","cas_pridania_btc":"2020-12-09 09:59:13","posledna_reakcia_btc":null},{"timelineid":"2287814","timestamp":"2020-12-09 09:59:12","reakcia_na":"2290613","typ":"h_dailyplan","user":"Trieda8694210","target_user":null,"user_meno":"Kvarta A","ineid":"daily2020-12-09","text":"","cas_pridania":"2020-12-09 09:59:12","cas_udalosti":null,"data":"[]","vlastnik":"Ucitel8678605","vlastnik_meno":"Barbora Drugajov\u00e1","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":"1","removed":"0","cas_pridania_btc":"2020-12-09 09:59:12","posledna_reakcia_btc":null},{"timelineid":"1439827","timestamp":"2020-12-09 08:56:57","reakcia_na":null,"typ":"h_clearplany","user":"*","target_user":null,"user_meno":"Cel\u00e1 \u0161kola","ineid":"clearplany","text":"","cas_pridania":"2020-12-09 08:56:57","cas_udalosti":null,"data":"null","vlastnik":"Ucitel16434","vlastnik_meno":"Ivor Dian","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":null,"removed":"0","cas_pridania_btc":"2020-12-09 08:56:57","posledna_reakcia_btc":null},{"timelineid":"2290324","timestamp":"2020-12-09 08:37:22","reakcia_na":null,"typ":"sprava","user":"CustPlan5075","target_user":null,"user_meno":"Kvarta A+Kvarta B - nj4 \u00b7 nemeck\u00fd jazyk","ineid":null,"text":"Ahojte, zajtra...
Ok, little tough to debug without actually working with it. But you'll need to pull out that json structure. You can do it with splits. So this is sort of a generic code.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
url = 'www.thesite.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '.userhome({' in script.text:
json_str = script.text
data = json_str.split('.userhome(')[-1]
loop=True
while loop == True:
try:
jsonData = json.loads(data)
loop = False
break
except:
data = data.rsplit(';',1)[0]
rows = []
for row in jsonData['items']:
rows.append(row)
table = pd.DataFrame(rows)
I'm learning how to build another scraper for another website, Reverb.com, after getting my scraper on another website to work properly. Reverb, however, has been more challenging to extract information from and the model with my old scraper isn't working the same. I did some research and using requests_html instead of requests seemed like the option most were using for Javascript like what Reverb.com has.
I'm essentially trying to scrape out text versions of the headline and price information and either paginate through the different pages or loop through a list of URLs to get all the content. I'm sort of there but hitting road blocks. Below are 2 versions of code I'm fiddling with.
The first version below prints out all of what looks like only 3 of many pages of content but it prints out all the instrument names and prices with the markup. In the CSV, however, all of those items are printed together on 3 rows only, not 1 item/price pair per row.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
#content scrape
b = soup.findAll("h4", class_="grid-card__title") #title
for i in b:
print(i)
p = soup.findAll("div", class_="grid-card__price") #price
for i in p:
print(i)
Conversely, this version prints out 3 lines only to a CSV but the name and price are stripped of all the markup. But it only happens when I changed the findAll to just find. I read that the for html in r.html was a way to loop through pages without having to make a list of urls.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
#make csv file
csv_file = open("rvscrape.csv", "w", newline='') #added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name","bass_price"])
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for html in r.html:
#content scrape
bass_name = []
b = soup.find("h4", class_="grid-card__title").text.strip() #title
#for i in b:
# bass_name.append(i)
# for i in bass_name:
# print(i)
price = []
p = soup.find("div", class_="grid-card__price").text.strip() #price
#for i in p:
# print(i)
csv_writer.writerow([b, p])
In order to extract all the pages of search results, you need to extract the link of the next page and keep going until there is no next page available. We can do this using a while loop and checking the existence of the next anchor tag.
The following script performs the loop and also adds the results to the csv. It also prints the url of the page, so that we have an estimate of what page the program is on.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
# make csv file
# added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_file = open("rvscrape.csv", "w", newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name", "bass_price"])
session = HTMLSession()
r = session.get(
"https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
stop = False
next_url = ""
while not stop:
print(next_url)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
titles = soup.findAll("h4", class_="grid-card__title") # titles
prices = soup.findAll("div", class_="grid-card__price") # prices
for i in range(len(titles)):
title = titles[i].text.strip()
price = prices[i].text.strip()
csv_writer.writerow([title, price])
next_link = soup.find("li", class_="pagination__page--next")
if not next_link:
stop = True
else:
next_url = next_link.find("a").get("href")
r = session.get("https://reverb.com/marketplace" + next_url)
r.html.render(sleep=5)
Such data output schema issues are highly common for target javascript websites. This can be also solved using dynamic scrapers.
Thanks to this script:
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io
url = "anUrl"
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
div = soup.find('div',id='content')
print div.prettify().encode(sys.stdout.encoding, 'ignore')
i've scraped some content that i want to print into another html page, trough javascript how can i handle the python output? Is it possible to print the content in the same way i've done in command line, with a browser page? I've got some encoding problems trying to do that.
If you are trying to write the div to an HTML file, then you basically do just that.
f = open('file.html', 'w')
f.write(div)
f.close()