Beautiful soup complicated query question

Beautiful soup complicated query question - javascript

I want to scrape the table on the following website
https://www.hkab.org.hk/DisplayInterestSettlementRatesAction.do
However, it has a very complicated query
I tried the following code but I cant find the table I want.
url = "https://www.hkab.org.hk/DisplayInterestSettlementRatesAction.do"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find('table',{'class':'etxtmed'})
And the table result is:
<table border="0" cellpadding="4" cellspacing="0" class="etxtmed" width="100%">
<tr>
<td height="30" valign="top">Home
</td>
<td align="right" class="etxtsml" valign="top">
</td>
</tr>
</table>
How can I get the value of the table? I cant find the table value.
Some comment say that it is generated by javascript, any suggestion for getting the table value instead of beautifulsoup?

I've tracked from where the data loaded and found the url to load from it :).
import requests
from bs4 import BeautifulSoup
import csv
r = requests.get(
'https://www.hkab.org.hk/hibor/listRates.do?lang=en&Submit=Detail')
soup = BeautifulSoup(r.text, 'html.parser')
mat = []
hk = []
for item in soup.findAll('td', {'align': 'right'})[2:]:
item = item.text.strip()
mat.append(item)
for item in soup.findAll('td', {'align': 'middle'})[3:11]:
item = item.text
hk.append(item)
data = []
for item in zip(mat, hk):
data.append(item)
with open('output.csv', 'w+', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Maturity', 'HKD Interest\nSettlement Rate'])
writer.writerows(data)
print("Operation Completed")
Output: Click Here

Related

Pull Data from Spreadsheet but only Display the Last Four Digits Using Apps Script

I use Apps Script to pull data from Google Spreadsheets and then Display the data on an HTML page in a table format.
I am trying to hide the data that is pulled from the spreadsheet and displayed on the table but make the last four digits of the data to be visible.
I tried the JS code below and it works fine on HTML pages but the JS code doesn't work when I tried to apply it to Apps Script probably because the HTML page has its table value typed as text in the cells while Apps Script needs to pull its table value from the spreadsheet before applying the rule from the JS code to the data to hide it and only display the last four digits of the data.
Any help to understand how to apply the script below to Apps Script to hide data but show the last four digits will be appreciated.
JS and HTML Code Below;
var id_number = document.getElementById('data2');
id_number.innerHTML = new Array(id_number.innerHTML.length-3).join('x') +
id_number.innerHTML.substr(id_number.innerHTML.length-4, 4);
console.log(id_number.innerHTML);
var phone = document.getElementById('data3');
phone.innerHTML = new Array(phone.innerHTML.length-3).join('x') +
phone.innerHTML.substr(phone.innerHTML.length-4, 4);
console.log(phone.innerHTML);
<table>
<tr>
<td width = "120">ID NUMBER</td>
<td width = "15">:</td>
<td id="data2" class="hidetext">1234567890</td>
</tr>
<tr>
<td>PHONE</td>
<td>:</td>
<td id="data3" class="hidetext">0000000000</td>
</tr>
</table>
My Code.gs Code in Apps Script Below;
function doGet(e) {
return HtmlService.createTemplateFromFile("index").evaluate()
.setTitle("HTML DATA PAGE")
.addMetaTag('viewport', 'width=device-width, initial-scale=1')
.setXFrameOptionsMode(HtmlService.XFrameOptionsMode.ALLOWALL);
};
let ss = SpreadsheetApp.getActiveSpreadsheet();
function getData(keyword){
let sheet = ss.getSheetByName('Data');
let data = sheet.getRange(1,1,sheet.getLastRow(),sheet.getLastColumn()).getDisplayValues();
let index = data.map(d => d[2]).indexOf(keyword);
console.log(data);
console.log(data.map(d => d[2]));
console.log(index);
if(index > -1){
return data[index];
}else{
return undefined;
}
}
let url = ScriptApp.getService().getUrl();
My Apps Script Table Code in Index.html
<table>
<tr>
<td width = "120">ID NUMBER</td>
<td width = "15">:</td>
<td id="data2"></td>
</tr>
<tr>
<td>PHONE</td>
<td>:</td>
<td id="data3"></td>
</tr>
</table>

How to access 'parent' from BeautifulSoup using select or find?

import requests
from bs4 import BeautifulSoup
# raw = requests.get("https://www.daum.net")
# raw = requests.get("http://127.0.0.1:5000/emp")
response = requests.get("https://vip.mk.co.kr/newSt/rate/item_all.php?koskok=KOSPI&orderBy=upjong")
response.raise_for_status()
response.encoding= 'EUC-KR'
html = response.text
bs = BeautifulSoup(html, 'html.parser')
result = bs.select("tr .st2")
<tr>
<td width='92' class='st2'><a href="javascript:goPrice('000020&MSid=&msPortfolioID=')" title='000020'>somethinbg</a></td>
<td width='60' align='right'>15,100</td>
<td width='40' align='right'><span class='t_12_blue'>▼300</span></td>
</tr>
I want to get datas from the someweher by using BeautifulSoup.
But I should access parent Node where has .
However, it's really hard to do it.
This is the code:
Then, how can I get datas from the parent which has the '<tr class ='st2>''
here is the example

You can access the parent element in BeautifulSoup via the element's parent attribute. Since you asked for a list of elements, this has to be done in an iteration.
I'm assuming here that you want to extract each row in the table.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://vip.mk.co.kr/newSt/rate/item_all.php?koskok=KOSPI&orderBy=upjong")
response.raise_for_status()
response.encoding= 'EUC-KR'
html = response.text
bs = BeautifulSoup(html, 'html.parser')
result = [[child.text for child in elem.parent.findChildren('td', recursive=False)] \
for elem in bs.select('tr .st2')]
The result is:
[
['동화약품', '15,100', '▼300'],
['유한양행', '61,100', '▼400'],
['유한양행우', '58,900', '▲300'],
...
]

How to get the webpage using pyqt5?

I want to get the webpage using pyqt5.
The url is https://land.3fang.com/LandAssessment/b6d8b2c8-bd4f-4bd4-9d22-ca49a7a2dc1f.html.
The webpage will generate two values with javascript.
Just input 5 in the text box and press the red button.
Two values in red will be returned.
Please refer to the image.
The code below is used to get the webpage.
However, I wait for a long time and there is no response.
What should I change in my code?
Thank you very much.
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEngineView
from bs4 import BeautifulSoup
import pandas as pd
class Render(QWebEngineView):
def __init__(self, url):
self.html = None
self.first_pass = True
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._load_finished)
self.load(QUrl(url))
self.app.exec_()
def _load_finished(self, result):
if self.first_pass:
self._first_finished()
self.first_pass = False
else:
self._second_finished()
def _first_finished(self):
self.page().runJavaScript('document.getElementById("txtDistance").value = "5";')
self.page().runJavaScript("void(0)")
self.page().runJavaScript("CheckUserWhere();")
def _second_finished(self):
self.page().toHtml(self.callable)
def callable(self, data):
self.html = data
self.app.quit()
url = "https://land.3fang.com/LandAssessment/b6d8b2c8-bd4f-4bd4-9d22-ca49a7a2dc1f.html"
web = Render(url)
soup = BeautifulSoup(web.html, 'html.parser')
element = soup.find('div', {'id':"divResult"})
df = pd.read_html(str(element))

It seems that you have several misconceptions:
When js is executed, the page is not reloaded, so the _second_finished function will never be called.
If you do not want to show the window then it is better to use QWebEnginePage.
Considering the above the html that is obtained is:
<div class="p8-5" id="divResult" style="display:block;">
<div align="center" display="block" id="rsloading" style="display: block;">
<img src="//img2.soufunimg.com/qyb/loading.gif"/>
正在为您加载数据...
</div>
<table border="0" cellpadding="0" cellspacing="0" class="tablebox01" display="none" id="tbResult" style="display: none;" width="600">
<tbody><tr>
<td style="width:260px;"><span class="gray8">建设用地面积：</span>14748平方米</td>
<td style="width:340px;"><span class="gray8">所在城市：</span>山西省 长治市 </td>
</tr>
<tr>
<td><span class="gray8">规划建筑面积：</span>51617平方米</td>
<td><span class="gray8">土地评估楼面价：</span><b class="redc00 font14" id="_bpgj">867.61</b> 元/平方米</td>
</tr>
<tr>
<td><span class="gray8">容积率：</span>大于1并且小于或等于3.5</td>
<td><span class="gray8">土地评估总价：</span><b class="redc00 font14" id="_bSumPrice">4478.34</b> 万元</td>
</tr>
<tr>
<td><span class="gray8">规划用途：</span>住宅用地</td>
<td><span class="gray8">推出楼面价：</span>27.51元/平方米</td>
</tr>
</tbody></table>
</div>
So the simplest thing to do is to filter by the ids "_bpgj" and "_bSumPrice"
import sys
from PyQt5 import QtCore, QtWidgets, QtWebEngineWidgets
from bs4 import BeautifulSoup
class Render(QtWebEngineWidgets.QWebEnginePage):
def __init__(self, url):
self.html = ""
self.first_pass = True
self.app = QtWidgets.QApplication(sys.argv)
super(Render, self).__init__()
self.loadFinished.connect(self._load_finished)
self.loadProgress.connect(print)
self.load(QtCore.QUrl(url))
self.app.exec_()
def _load_finished(self, result):
if result:
self.call_js()
def call_js(self):
self.runJavaScript('document.getElementById("txtDistance").value = "5";')
self.runJavaScript("void(0)")
self.runJavaScript("CheckUserWhere();")
self.toHtml(self.callable)
def callable(self, data):
self.html = data
self.app.quit()
url = "https://land.3fang.com/LandAssessment/b6d8b2c8-bd4f-4bd4-9d22-ca49a7a2dc1f.html"
web = Render(url)
soup = BeautifulSoup(web.html, 'html.parser')
_bpgj = soup.find('b', {'id':"_bpgj"}).string
_bSumPrice = soup.find('b', {'id':"_bSumPrice"}).string
print(_bpgj, _bSumPrice)
Output:
867.61 4478.34

Reading HTML response in Vuejs to display it in a Dialog box

I am getting a response from the server with the REST request in an HTML format. I have stored this in a data:[] and when I print it on a console it looks like this i.e. HTML. This reply is a String and now my problem is to filter it in JavaScript to make it an array of objects
<table border='1' frame = 'void'>
<tr>
<th>name</th>
<th>age</th>
<th>date of birth</th>
</tr>
<tr>
<td>John</td>
<td>30</td>
<td>10.09.1987</td>
</tr>
</table>
My question is how can I show this HTML data in a dialog box using vuejs.
I want this values as an array of objects like this
[
name,
age,
data of birth,
john,
30,
10.09.1987
]

This is not a Vue.js problem, but an HTML/JavaScript one. You can iterate the cells text content and convert into an array like below.
var stringFromREST = "<table border='1' frame='void'><tr><th>name</th><th>age</th><th>date of birth</th></tr><tr><td>John</td><td>30</td><td>10.09.1987</td></tr></table>";
var tempDiv = document.createElement('div');
tempDiv.innerHTML = stringFromREST;
var cells = tempDiv.querySelectorAll('th,td');
var contentArray = [];
for (var i = 0; i < cells.length; i++) {
contentArray.push(cells[i].innerText);
}
console.log(contentArray);

Parse HTML table without IDs or CSS selectors in Node.js

This data is from an old system and the output is as is. We cannot add CSS selectors or IDs. Most of the examples online for node.js parsing involves parsing tables, rows, data with some ID or CSS classes but so far I haven't run into anything that can help parse the page below. This includes examples for JSDOM (AFAIK).
What I would like is to extract each of the rows into [fileName, link, size, dateTime] tuples on which I can then run some queries like what was the latest timestamp in the group, etc and then extract the filename and link - was thinking of using YQL. The alternating table row attributes is also making it a bit challenging. New to node.js so some of the terminology might be wrong. Any help will be appreciated.
Thanks.
<html>
<body>
<table width="100%" cellspacing="0" cellpadding="5" align="center">
<tr>
<td align="left"><font size="+1"><strong>Filename</strong></font></td>
<td align="center"><font size="+1"><strong>Size</strong></font></td>
<td align="right"><font size="+1"><strong>Last Modified</strong></font></td>
</tr>
<tr>
<td align="left">
<tt>file1.csv</tt></td>
<td align="right"><tt>86.6 kb</tt></td>
<td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
</tr>
<tr bgcolor="#eeeeee">
<td align="left">
<tt>file2.csv</tt></td>
<td align="right"><tt>20.7 kb</tt></td>
<td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
</tr>
<tr>
<td align="left">
<tt>file1.xml</tt></td>
<td align="right"><tt>266.5 kb</tt></td>
<td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
</tr>
<tr bgcolor="#eeeeee">
<td align="left">
<tt>file2.xml</tt></td>
<td align="right"><tt>27.2 kb</tt></td>
<td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
</tr>
</table>
</body>
</html>
Answer (thanks #Enragedmrt):
res.on('data', function(data) {
$ = cheerio.load(data.toString());
var data = [];
$('tr').each(function(i, tr){
var children = $(this).children();
var fileItem = children.eq(0);
var linkItem = children.eq(0).children().eq(0);
var lastModifiedItem = children.eq(2);
var row = {
"Filename": fileItem.text().trim(),
"Link": linkItem.attr("href"),
"LastModified": lastModifiedItem.text().trim()
};
data.push(row);
console.log(row);
});
});

I would suggest using Cheerio over JSDOM as it's significantly faster and more lightweight. That said, you'll need to do a for each loop grabbing up the 'tr' elements and subsequently their 'td' elements. Here's a rough example (My Node.js/Cheerio is rusty, but if you dig around in JQuery you can find some decent examples):
var data = [];
$('tr').each(function(i, tr){
var children = $(this).children();
var row = {
"Filename": children[0].text(),
"Size": children[1].text(),
"Last Modified": children[2].text()
};
data.push(row);
});

I don't know JSDom, but it sounds like it can parse a HTML document into a DOM (Document Object Model). From there it should be very possible to loop through the nodes and recognise them by tag name, attributes or position in the document, even if they don't have ids.
Googling for 5 seconds, please hold on...
JSDom's documentation on GitHub seems to confirm this. It shows jQuery-like selectors, like window.$("a.the-link").text(). So instead of adding a class, you can select for selectors like td, th, or probably even td[align="left"]. Using selectors like that, and convenient methods like .first and .each, to traverse over multiple results (like every row) you should be able to parse the document just fine, although it will of course be a bit more cumbersome than having convenient classnames for every different kind of cell.
I still don't think I'm a JSDom expert, but reading their project's main page for a couple of minutes already shows all the answers to your questions, and much more.

JSFiddle
var rawData = new Array();
var rows = document.getElementsByTagName('tr');
for(var cnt = 1; cnt < rows.length; cnt++) {
var cells = rows[cnt].getElementsByTagName('tt');
var row = [];
for (var count = 0; count < cells.length; count++) {
row.push(cells[count].innerText.trim());
}
rawData.push(row);
}
console.log(rawData);

Additional way
var cheerio = require('cheerio'),
cheerioTableparser = require('cheerio-tableparser');
res.on('data', function(data) {
$ = cheerio.load(data.toString());
cheerioTableparser($);
var data = [];
var array = $("table").parsetable(false, false, false)
array[0].forEach(function(d, i) {
var firstColumnHTMLCell = $("<div>" + array[0][i] + "</div>");
var fileItem = firstColumnHTMLCell.text().trim();
var linkItem = firstColumnHTMLCell.find("a").attr("href");
var lastModifiedItem = $("<div>" + array[2][i] + "</div>").text();
var row = {
"Filename": fileItem,
"Link": linkItem,
"LastModified": lastModifiedItem
};
data.push(row);
console.log(row);
})
});

Develop Reference

JavaScript is the programming language of the Web.

Beautiful soup complicated query question - javascript

Related

Pull Data from Spreadsheet but only Display the Last Four Digits Using Apps Script

How to access 'parent' from BeautifulSoup using select or find?

How to get the webpage using pyqt5?

Reading HTML response in Vuejs to display it in a Dialog box

Parse HTML table without IDs or CSS selectors in Node.js

Categories

Resources