Values Hidden From Source Code- Webscraping Python - javascript

I'm trying to web-scrape a website and when I go through the source code, what I'm looking for is not there. Website is http://www.providentmetals.com/2016-1-oz-canadian-silver-cougar.html and what I'm looking for is the price in the table in the top-right. It says "1+" followed by prices. It's around $18.04 right now.
When I "inspect element" with web developer tools, I can see the price though.
Using BeautifulSoup I've tried to get the value, but it doesn't show up. Here's rougly the code.
It doesn't return a value at all
import res,bs4
url='http://www.providentmetals.com/2016-1-oz-canadian-silver.cougar.html'
res=requests.get(url)
soup=bs4.BeautifulSoup(res.text,'lxml')
elems=soup.findAll('class',{'table':'table table-striped border-light pricing data-table'})
#table name found from inspect element web dev tool
Questions: How do I find the hidden data? Are there any ways that you know of to use bs4/requests to find the data? I'm not great at coding and webscraping, so any help would be good.

The values are not hidden, they are obtained with an Ajax request and then inserted into the page's DOM. That's why you see them in your browser, but not in the page's HTML.
You can directly access the Ajax request that obtains the data you require. The response is in JSON format, so it's really easy to use. You need to know the SKU which, for your example, is BBFS-04253.
The URL is:
http://www.providentmetals.com/services/products.php?type=product&sku=BBFS-04253
Using the requests module:
import requests
url = 'http://www.providentmetals.com/services/products.php'
params = {'type': 'product', 'sku': 'BBFS-04253'}
response = requests.get(url, params)
data = response.json()
>>> from pprint import pprint
>>> pprint(data)
[{u'as_low_as': {u'crypto_price': u'$18.22',
u'list_price': u'$18.78',
u'price': u'$18.03',
u'qty': 1,
u'to_tier': u' + '},
u'crypto_price': u'$18.22',
u'crypto_special_price': u'$0.00',
u'id': u'6034',
u'inStock': None,
u'list_price': u'$18.78',
u'list_special_price': u'$0.00',
u'name': u'2016 1 oz Canadian Silver Cougar | Predator Series',
u'price': u'$18.03',
u'sell_to_us': u'$16.69',
u'sku': u'BBFS-04253',
u'special_price': None,
u'status_allows_price': True,
u'stock_status_code': u'pre-sale',
u'tier_price': [{u'crypto_price': u'$18.22',
u'list_price': u'$18.78',
u'price': u'$18.03',
u'qty': 1,
u'to_tier': u' + '}]}]
print data['price']
So you can access the price and other details directly:
>>> data[0]['price']
u'$18.03'
>>> data[0]['name']
u'2016 1 oz Canadian Silver Cougar | Predator Series'

Related

Webscraping Blockchain data seemingly embedded in Javascript through Python, is this even the right approach?

I'm referencing this url: https://tracker.icon.foundation/block/29562412
If you scroll down to "Transactions", it shows 2 transactions with separate links, that's essentially what I'm trying to grab. If I try a simple pd.read_csv(url) command, it clearly omits the data I'm looking for, so I thought it might be JavaScript based and tried the following code instead:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://tracker.icon.foundation/block/29562412')
r.html.links
r.html.absolute_links
and I get the result "set()"
even though I was expecting the following:
['https://tracker.icon.foundation/transaction/0x9e5927c83efaa654008667d15b0a223f806c25d4c31688c5fdf34936a075d632', 'https://tracker.icon.foundation/transaction/0xd64f88fe865e756ac805ca87129bc287e450bb156af4a256fa54426b0e0e6a3e']
Is JavaScript even the right approach? I tried BeautifulSoup instead and found no cigar on that end as well.
You're right. This page is populated asynchronously using JavaScript, so BeautifulSoup and similar tools won't be able to see the specific content you're trying to scrape.
However, if you log your browser's network traffic, you can see some (XHR) HTTP GET requests being made to a REST API, which serves its results in JSON. This JSON happens to contain the information you're looking for. It actually makes several such requests to various API endpoints, but the one we're interested in is called txList (short for "transaction list" I'm guessing):
def main():
import requests
url = "https://tracker.icon.foundation/v3/block/txList"
params = {
"height": "29562412",
"page": "1",
"count": "10"
}
response = requests.get(url, params=params)
response.raise_for_status()
base_url = "https://tracker.icon.foundation/transaction/"
for transaction in response.json()["data"]:
print(base_url + transaction["txHash"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
https://tracker.icon.foundation/transaction/0x9e5927c83efaa654008667d15b0a223f806c25d4c31688c5fdf34936a075d632
https://tracker.icon.foundation/transaction/0xd64f88fe865e756ac805ca87129bc287e450bb156af4a256fa54426b0e0e6a3e
>>>

google finance api not working from 6/september/2017

I was using google finance api to get the stock quotes and display the contents on my site. All of a sudden from 6/september/2017 this stopped working. The url i used to get the stock quotes is https://finance.google.com/finance/info?client=ig&q=SYMBOL&callback=?.
Previously, i was using yahoo finance api and it was inconsistent. So, i switched over to google finance api.
Could you please help me on this?
Thanks,
Ram
This url works. I think just the url changed from www.google.com to finance.google.com
https://finance.google.com/finance/getprices?q=ACC&x=NSE&p=15&i=300&f=d,c,o,h,l,v
In the end i started using yahoo finance. The data is not live, there is a 20 minutes delay. I thought it will be helpful to people who are facing issues like me.
The yahoo api url is https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%3D%22MSFT%22&env=store://datatables.org/alltableswithkeys
This will return the stock data in xml format. You can parse the xml to get your desired fields.
Thanks,
Ram
We had a same issue & we found below alternative API provided by Microsoft Bing API for Stock Markets. Below API returns the stock data in JSON format.
https://finance.services.appex.bing.com/Market.svc/ChartAndQuotes?symbols=139.1.500209.BOM&chartType=1d&isETF=false&iseod=False&lang=en-IN&isCS=false&isVol=true
Thanks, Shyamal
I was dying to look for thread like this yesterday when I faced the issue!
Like Salketer said, Google Finance API was officially "closed" in 2012. However, for some reason it was still working until September 5, 2017. I built a program to manage my portfolio that uses GF API to get live quotes for US stocks. It stopped working on Sep 6, 2017, so I am assuming that engineers behind "secretly providing" API now "actually" stopped the service.
I found an alternative https://www.alphavantage.co/documentation/ , and this seems like the best alternative for free live US equity quotes. They just require your email, nothing else. It's a bit slow because it doesn't have multi-symbol query yet, but beggars can't be choosers.
I had to switch to Google finance after using Yahoo finance for a long time after Verizon bought yahoo this May and ended the free API service. I went back and re-researched this issue and someone created a new Yahoo finance API call that works with the new yahoo API. https://stackoverflow.com/a/44092983/8316350
The python source and installer can be found here: https://github.com/c0redumb/yahoo_quote_download
The arguments are (ticker, start_date, and end_date) where dates are yyyymmdd format and returns a list of unicode strings. The following test will download a couple weeks worth of data and then extract only the adjusted close price to return a list called adj_close:
from yahoo_quote_download import yqd
import string
quote = yqd.load_yahoo_quote('AAPL', '20170515', '20170530')
print(quote[0]) # print the column headers
print(quote[1]) # print a couple rows of data
print(quote[2]) # just to make sure it looks right
quote.pop() # get rid of blank string at end of data
quote = [row.encode("utf-8") for row in quote] # convert to byte data
quote = [string.split(row, ',') for row in quote] # split the string to create a list of lists
adj_close = [row[5] for row in quote] # grab only the 'adj close' data and put into a new list
print(adj_close)
Returns:
Date,Open,High,Low,Close,Adj Close,Volume
2017-05-15,156.009995,156.649994,155.050003,155.699997,155.090958,26009700
2017-05-16,155.940002,156.059998,154.720001,155.470001,154.861862,20048500
['Adj Close', '155.090958', '154.861862', '149.662277', '151.943314', '152.461288', '153.387650', '153.198395', '152.740189', '153.268112', '153.009140', '153.068893']
I was manually reading from Google Finance page for each stock before I got the ?info link. As this is not working anymore, I am going back to the webpage.
Here is my python snippet:
def get_market_price(symbol):
print "Getting market price: " + symbol
base_url = 'http://finance.google.com/finance?q='
retries = 2
while True:
try:
response = urllib2.urlopen(base_url + symbol)
html = response.read()
except Exception, msg:
if retries > 0:
retries -= 1
else:
raise Exception("Error getting market price!")
soup = BeautifulSoup(html, 'lxml')
try:
price_change = soup.find("div", { "class": "id-price-change" })
price_change = price_change.find("span").find_all("span")
price_change = [x.string for x in price_change]
price = soup.find_all("span", id=re.compile('^ref_.*_l$'))[0].string
price = str(unicode(price).encode('ascii', 'ignore')).strip().replace(",", "")
return (price, price_change)
except Exception as e:
if retries > 0:
retries -= 1
else:
raise Exception("Can't get current rate for scrip: " + symbol)
Example:
Getting market price: NSE:CIPLA
('558.55', [u'+4.70', u'(0.85%)'])
You can simply parse the result of this request:
https://finance.google.com/finance/getprices?q=GOOG&x=NASD&p=1d&i=60&f=d,c,o,h,l,v
(GOOG at NASDAQ, one day, frequency 60 seconds, DATE,CLOSE,HIGH,LOW,OPEN,VOLUME)
I had a same problem in PHP.
I replace the URL https://www.google.com/finance/converter?a=$amount&from=$from_Currency&to=$to_Currency
to
https://finance.google.com/finance/converter?a=1&from=$from_Currency&to=$to_Currency
Works fine for me.

Python: Retrieve post parameter from javascript button

I am programming in python a script to obtain statistical data of the public schools of the city in which I live. With the following code I get the source code of a page that shows, by pages, the first 100 results of a total of 247 schools:
import requests
url = "http://www.madrid.org/wpad_pub/run/j/BusquedaAvanzada.icm"
post_call = {'Public title': 'S', 'cdMuni': '079', 'cdNivelEdu': '6545'}
r = requests.post(url, data = post_call)
The page can be viewed here.
On that page there is a button that activates a javascript function to download a csv file with all 247 results. I was thinking of using Selenium to download this file, but I have seen, using Tamper Data, that when the button is pressed a POST call occurs, in which the parameter codCentrosExp is sent with the codes of the 247 colleges. The parameter looks like this:
CodCentrosExp = 28077877%3B28077865%3B28063751%3B28018392%3B28018393%...(thus up to the 247 codes)
This makes my work easier, since I do not have to download the csv file, open it, select the code column, etc. And I could do it with Tamper Data, but my question is: how can I get that parameter with my Python script, without having to use Tamper Data?
I finally found the parameter in the page's source code, and extracted them as follows:
schools = BeautifulSoup(r.content, "lxml")
school_codes = schools.findAll(Attrs = {"name": "codCentrosExp", "value": re.compile("^.+$")})[0]["value"]
school_codes = school_codes.split(";")
Anyway, if anyone knows how to respond to the original question, I would be grateful to know how it could be done.

Python Flask data feed from Pandas Dataframe, dynamically define with unique endpoint

Hi I am building a web app with Flask Python. I got a problem here:
#app.route('/analytics/signals/<ticker_url>')
def analytics_signals_com_page(ticker_url):
all_ticker = full_list
ticker_name = com_name
ticker = ticker_url.upper()
pricerec = sp500[ticker_url.upper()].tolist()
timerec = sp500[ticker_url.upper()].index.tolist()
return render_template('company.html', all_ticker=all_ticker, ticker_name=ticker_name, ticker=ticker, pricerec=pricerec, timerec=timerec)
Here I am defining company pages based on the a page will contain different content. The problem is that everything is fine upto ticker = ticker_url.upper(). It works perfectly fine. But for pricerec and timerec, they make problems.
sp500 is a pandas DataFrame columns being companies like "AAPL", "GOOG","MSFT", and so forth 505 companies and the index are timestamps, and values are the prices at each time.
So what I am doing for the pricerec, I am taking the ticker_url and use it to take the specific company's price and make it as a list. And timerec is to take the index (timestamps) and make it as a list. And I am passing these two variables into the company.html page.
But it makes internal server error. I do not know why it happens.
My expectation was that when a user click a button that href to "~/analytics/signals/aapl" then the company.html page will contain the pricerec and timerec for me to draw a graph. But it didn't work like that. It makes internal server error. I defined those two variables in the javascript also like I did for the other variables(all_ticker, ticker_name, and ticker)
Can anyone help me with this issue?
Thanks!

Dumbed down Powershell web client function to let me post form data easily

Ive been using an Internet Explorer automation script found here:
http://www.pvle.be/2009/06/web-ui-automationtest-using-powershell/
That lets me easily post form data using commands (functions) like this:
NavigateTo "http://www.websiteURI/"
SetElementValueByName "q" "powershell variable scope"
SetElementValueByName "num" "30"
SetElementValueByName "lr" "lang_en"
ClickElementById "sb_form_go"
The above would let me post values to elements and click to submit the form.
I would like to do the equivalent with Powershell's web client using helper functions. I haven't found such a script. The closest I could find was The Scripting Guys, Send-WebRequest:
http://gallery.technet.microsoft.com/scriptcenter/7e7b6bf2-d067-48c3-96b3-b38f26a1d143
which I'm not even sure it does what I expect (since there's no working examples showing how to do what I want).
Anyway, I'd really appreciate some help to get me started to do the equivalent of what I showed up there with working examples (as simple as possible). A bonus would be to also be able to get a list of element names for a URI in order to know what form elements I want to submit.
PS: I also need to be able to specify user-agent and credentials; so, examples with these included would be ideal.
Have you taken a look at the Invoke-WebRequest commmand? (requires powershell 3.0 or above) I believe the following would work for submitting the data
#POSTing data
Invoke-WebRequest http://www.websiteURI/ `
-UserAgent 'My User Agent' `
-Credential $cred `
-Method Post `
-Body #{
q = 'powershell variable scope'
num = 30
lr = 'lang_en'
}
For your bonus, the result of Invoke-WebRequest contains a collection of the InputFields on the page, which you can use to get a list of form elements to set.
#List input elements
Invoke-WebRequest http://www.websiteURI/ | select -ExpandProperty InputFields

Categories

Resources