How to scrape Javascript element from HTML source code? - javascript

How can I scrape the javascript element from http://www.asfinag.at/home? I think that the javascript element starts from items: in the source code of the page.
My current code looks like this:
import urllib.request
from bs4 import BeautifulSoup
url = 'http://www.asfinag.at/home'
req = urllib.request.Request (url)
opener = urllib.request.build_opener()
f = opener.open (req)
soup = BeautifulSoup (f)
tmp = soup.find_all(id="at-asfinag-pvis-pages-map-LayerPage-0")
print (tmp)
I don't know how to proceed. I basically need to scrape fields such as fahrtrichtung and meldequelle.

Related

Selenium doesn't seem to load the JavaScript part of the website

hello i have a script using Python and Selenium, and I don't understand why this can't retrieve the JS part of the website (the same script works perfectly fine on my other machine):
import chromedriver_binary
from bs4 import BeautifulSoup
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("window-size=1024,768")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--enable-javascript")
url = "https://deliveroo.co.uk/restaurants/london/holborn?geohash=gcpvj6kxet58&collection=pizza"
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "lxml")
show_data = soup.find_all("script", id="__NEXT_DATA__")
mydata = json.loads( show_data[0].text )
I get the following error, meaning that it couldn't see this part of JSON:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Not too sure why this is working on my other machine and not on my current one.
The .text attribute doesn't work here. To get the right data, it worked for me to use encode_contents(), just changing the definition of mydata like this:
mydata = json.loads( show_data[0].encode_contents())

Scrape a javascript variable from a webpage

I am scraping a site with beautiful soup but all the content is hidden inside a script inside a js variable like this:
I can't seem to find any solution to this other than using selenium which in this case is not an option, I won't go into detail why but it just doesn't work. I can already scrape it by getting the insid eof the script tag and then using eval() on it but that introduces a few problems (unexpected indent, unwanted functions) I can use python, javascript and maybe C# if anything there helps.
Expected behaviour - whatever makes me get the info (the variable in the last line) into any of those 3 languages (preferably python).
The code (sorry for the formating but i cant since its so long, it isnt even the full variable, its huge):
barLoadGoogleFont('Open Sans'); barCssLoad('/global/pics/js/jquery/royalSlider/skins/universal/rs-universal.css?v=e449c4'); barCssLoad('/global/pics/css/material-icons.css?v=e6d856'); barCssLoad('/user/pics/css/user.css?v=eced9d');
barCssLoad('/user/pics/css/userIcons.css?v=6f9a03');
barCssLoad('/timeline/pics/css/timeline.css?v=8ec2ca'); barJsLibraryLoad('/global/pics/js/jquery/jquery.royalslider.min.js?v=515a43'); barJsLibraryLoad('/anketa/pics/js/utilsAnketa.js?v=9383d5'); barJsLibraryLoad('/znamky/pics/js/utilsZnamky.js?v=7afc9e'); barJsLibraryLoad('/exam/pics/js/utilsExam.js?v=033d55'); barJsLibraryLoad('/timeline/pics/js/utilsTimeline.js?v=29cf0e'); barJsLibraryLoad('/timeline/pics/js/timelineItemCreator.js?v=c37c99'); barJsLibraryLoad('/timeline/pics/js/timelineInputbox.js?v=2fde70'); barJsLibraryLoad('/timeline/pics/js/timelineViewer.js?v=f35e45');
barJsLibraryLoad('/user/pics/js/DailyPlan.js?v=e81fb9'); barJsLibraryLoad('/user/pics/js/userHomeEtest.js?v=6166f3');
$j(document).ready(function() { $j('#jwbcddd3da_md').userhome({"items":[{"timelineid":"2140963","timestamp":"2020-12-09 09:59:13","reakcia_na":"692638","typ":"h_clearplany","user":"Plan5077","target_user":null,"user_meno":"Kvarta aj2","ineid":"clearplany","text":"","cas_pridania":"2020-12-09 09:59:13","cas_udalosti":null,"data":"null","vlastnik":"Ucitel8678605","vlastnik_meno":"Barbora Drugajov\u00e1","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":"1","removed":"0","cas_pridania_btc":"2020-12-09 09:59:13","posledna_reakcia_btc":null},{"timelineid":"2287814","timestamp":"2020-12-09 09:59:12","reakcia_na":"2290613","typ":"h_dailyplan","user":"Trieda8694210","target_user":null,"user_meno":"Kvarta A","ineid":"daily2020-12-09","text":"","cas_pridania":"2020-12-09 09:59:12","cas_udalosti":null,"data":"[]","vlastnik":"Ucitel8678605","vlastnik_meno":"Barbora Drugajov\u00e1","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":"1","removed":"0","cas_pridania_btc":"2020-12-09 09:59:12","posledna_reakcia_btc":null},{"timelineid":"1439827","timestamp":"2020-12-09 08:56:57","reakcia_na":null,"typ":"h_clearplany","user":"*","target_user":null,"user_meno":"Cel\u00e1 \u0161kola","ineid":"clearplany","text":"","cas_pridania":"2020-12-09 08:56:57","cas_udalosti":null,"data":"null","vlastnik":"Ucitel16434","vlastnik_meno":"Ivor Dian","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":null,"removed":"0","cas_pridania_btc":"2020-12-09 08:56:57","posledna_reakcia_btc":null},{"timelineid":"2290324","timestamp":"2020-12-09 08:37:22","reakcia_na":null,"typ":"sprava","user":"CustPlan5075","target_user":null,"user_meno":"Kvarta A+Kvarta B - nj4 \u00b7 nemeck\u00fd jazyk","ineid":null,"text":"Ahojte, zajtra...
Ok, little tough to debug without actually working with it. But you'll need to pull out that json structure. You can do it with splits. So this is sort of a generic code.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
url = 'www.thesite.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '.userhome({' in script.text:
json_str = script.text
data = json_str.split('.userhome(')[-1]
loop=True
while loop == True:
try:
jsonData = json.loads(data)
loop = False
break
except:
data = data.rsplit(';',1)[0]
rows = []
for row in jsonData['items']:
rows.append(row)
table = pd.DataFrame(rows)

Scraping Javascript Website With BeautifulSoup 4 & Requests_HTML

I'm learning how to build another scraper for another website, Reverb.com, after getting my scraper on another website to work properly. Reverb, however, has been more challenging to extract information from and the model with my old scraper isn't working the same. I did some research and using requests_html instead of requests seemed like the option most were using for Javascript like what Reverb.com has.
I'm essentially trying to scrape out text versions of the headline and price information and either paginate through the different pages or loop through a list of URLs to get all the content. I'm sort of there but hitting road blocks. Below are 2 versions of code I'm fiddling with.
The first version below prints out all of what looks like only 3 of many pages of content but it prints out all the instrument names and prices with the markup. In the CSV, however, all of those items are printed together on 3 rows only, not 1 item/price pair per row.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
#content scrape
b = soup.findAll("h4", class_="grid-card__title") #title
for i in b:
print(i)
p = soup.findAll("div", class_="grid-card__price") #price
for i in p:
print(i)
Conversely, this version prints out 3 lines only to a CSV but the name and price are stripped of all the markup. But it only happens when I changed the findAll to just find. I read that the for html in r.html was a way to loop through pages without having to make a list of urls.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
#make csv file
csv_file = open("rvscrape.csv", "w", newline='') #added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name","bass_price"])
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for html in r.html:
#content scrape
bass_name = []
b = soup.find("h4", class_="grid-card__title").text.strip() #title
#for i in b:
# bass_name.append(i)
# for i in bass_name:
# print(i)
price = []
p = soup.find("div", class_="grid-card__price").text.strip() #price
#for i in p:
# print(i)
csv_writer.writerow([b, p])
In order to extract all the pages of search results, you need to extract the link of the next page and keep going until there is no next page available. We can do this using a while loop and checking the existence of the next anchor tag.
The following script performs the loop and also adds the results to the csv. It also prints the url of the page, so that we have an estimate of what page the program is on.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
# make csv file
# added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_file = open("rvscrape.csv", "w", newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name", "bass_price"])
session = HTMLSession()
r = session.get(
"https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
stop = False
next_url = ""
while not stop:
print(next_url)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
titles = soup.findAll("h4", class_="grid-card__title") # titles
prices = soup.findAll("div", class_="grid-card__price") # prices
for i in range(len(titles)):
title = titles[i].text.strip()
price = prices[i].text.strip()
csv_writer.writerow([title, price])
next_link = soup.find("li", class_="pagination__page--next")
if not next_link:
stop = True
else:
next_url = next_link.find("a").get("href")
r = session.get("https://reverb.com/marketplace" + next_url)
r.html.render(sleep=5)
Such data output schema issues are highly common for target javascript websites. This can be also solved using dynamic scrapers.

How to display in browser python output trough javascript?

Thanks to this script:
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io
url = "anUrl"
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
div = soup.find('div',id='content')
print div.prettify().encode(sys.stdout.encoding, 'ignore')
i've scraped some content that i want to print into another html page, trough javascript how can i handle the python output? Is it possible to print the content in the same way i've done in command line, with a browser page? I've got some encoding problems trying to do that.
If you are trying to write the div to an HTML file, then you basically do just that.
f = open('file.html', 'w')
f.write(div)
f.close()

Search a string in javascript using python

Following my previous question :
how to fetch javascript contents in python
I tried to make another script which fetches the data from a javascript. After getting the webpage contents of course.
But, it's just not showing up the content I want. I want to find "content_id" from the javascript of the page. This is the page :- http://www.hulu.com/watch/815743
Here's what I have right now.
import re
import requests
from bs4 import BeautifulSoup
import os
import fileinput
Link = 'http://www.hulu.com/watch/815743'
q = requests.get(Link)
soup = BeautifulSoup(q.text)
#print soup
subtitles = soup.findAll('script',{'type':'text/javascript'})
pattern = re.compile(r'"content_id":"(.*?)"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)
I get this error :
AttributeError: 'NoneType' object has no attribute 'text'
Any idea how to solve this issue..?
There are two problems in your regular expression pattern:
the quotes are escaped with backslashes in the script contents, take that into account
there is a whitespace after the colon
Here is the fixed version:
pattern = re.compile(r'\\"content_id\\":\s*\\"(.*?)\\"', re.MULTILINE | re.DOTALL)
Works for me, getting 60585710 as a result.
FYI, here is the complete code that I'm executing:
import re
import requests
from bs4 import BeautifulSoup
Link = 'http://www.hulu.com/watch/815743'
q = requests.get(Link)
soup = BeautifulSoup(q.text)
pattern = re.compile(r'\\"content_id\\":\s*\\"(.*?)\\"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)

Categories

Resources