Information inside Html Node not visible - javascript

I am trying to grab a phone number from a node from a website. For some reason when I inspect the node in chrome the actual number inside of the element is not visible. Here is the website that I am attempting to grab the number from: https://tempophone.com/ . Am I inspecting the wrong element or is it just not possible to grab the phone number from the website by accessing the node. Here is my code, I am using htmlAgilityPack:
string url = "https://tempophone.com/";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var phoneNumber = doc.DocumentNode.SelectNodes("//*[#id=\"temporary - phone\"]")[0].InnerText;
if(phoneNumber != null)
Console.WriteLine(phoneNumber);
else
Console.WriteLine("null");
Here is a screenshot of the inspected element, as you can see there is no phone number there:

Firstly there is no text inside that node.
Second what you want is this.
string s = doc.DocumentNode.SelectNodes("//*[#id='temporary-phone']")[0].GetAttributeValue("value", "false");
Third. This will always return "Loading...". Because the attribute 'value' in the node is updated/changed by the use javascript. When you use HtmlWeb or HttpWebRequest you will ALWAYS get the source of the page. If you want to be able to load dynamic content into your HtmlDocument you will need to use WebBrowser or Selenium with WebDriver.
How-to with Selenium and FirefoxDriver
var driver = new FirefoxDriver();
driver.Navigate().GoToUrl("https://tempophone.com/");
Thread.Sleep(2000);
driver.FindElement(By.XPath("//button[#id='check-phone']")).Click();
string s = driver.FindElement(By.XPath("//h1[#id='phone-number']")).GetAttribute("data-phone-number");
Console.WriteLine("Here: " + s);
Or you could just call their API
https://tempophone.com/api/v1/phones/random

Related

VBA & Selenium | Access iframe within HTML containing #document

I am trying to access the HTML within two iframes using Selenium Basic in VBA, as IE has been blocked on our machines, and Python, etc. are not available to us.
Previously I could access the html with this:
Dim IE As InternetExplorerMedium
Set IE = New InternetExplorerMedium
' actual website excluded as it is a work hosted website which requires login, etc.
website = "..."
IE.navigate (website)
Dim IEDocument As HTMLDocument
Set IEDocument = IE.document.getElementById(id1).contentDocument.getElementById(id2).contentDocument
From there I would have access to all the HTML elements which I could work with.
Now I am trying the following with Selenium Basic:
Set cd = New Selenium.ChromeDriver
website = "..."
cd.Start baseUrl:=website
cd.Get "/"
Dim af1 As Selenium.WebElement, af2 As Selenium.WebElement
Set af1 = cd.FindElementById("CRMApplicationFrame")
Set af2 = af1.FindElementById("WorkAreaFrame1")
It works up to the last line, as it is able to set af to the "CRMApplicationFrame" id; however, I am unable to get inside of it.
I think the solution lies in executing a bit of JavaScript, similar to as in this video:
https://www.youtube.com/watch?v=phYGCGXGtEw
Although I don't have a #ShadowDOM line, I do have a #document line.
Based on and trying to adapt the video I have tried the following:
Set af2 = cd.ExecuteScript(Script:="return arguments[0].contentDocument", arguments:=af1 )
However, that did not work.
I also tested:
Dim af1 As Selenium.WebElement
Set af1 = cd.FindElementById("CRMApplicationFrame")
call cd.SwitchToFrame (af1)
Debug.Print cd.PageSource
However, the SwitchToFrame line won't execute, with a 438 error: Object doesn't support this property or method.
Any advice or guidance on how I could succeed would be highly appreciated!
Replace:
call cd.SwitchToFrame (af1)
with:
cd.SwitchToFrame "CRMApplicationFrame"
You can find a relevant detailed discussion in Selenium VBA Excel - problem clicking a link within an iframe

Sending whatsapp messages via python/JS

I made a program which takes information from excel and sends messages via python.
I used selenium and "span" for finding the element I need.
now, WhatsApp changed their HTML and there is no span anymore.
the old code is here:
import time
import xlrd
from selenium import webdriver
chrome_driver_binary = "D:\pycharm\chromedriver.exe"
driver = webdriver.Chrome(chrome_driver_binary)
driver.get('http://web.whatsapp.com')
file_location = "C:\Users\ErelNahum\Desktop\data.xlsx"
book = xlrd.open_workbook(file_location)
print "there is " + str(book.nsheets) + " sheets"
sheet = book.sheet_by_index(0)
cols = sheet.ncols - 1
print "the number of cols is " + str(cols)
raw_input('Enter anything after scanning QR code')
for i in range(cols):
tel = sheet.cell_value((i+1), 0)
tel = tel.replace("\"", "")
print tel
messege = sheet.cell_value((i+1), 1)
messege = (messege +str(b+1))
user = driver.find_element_by_xpath('//span[#title = "{}"]'.format(tel))
user.click()
msg_box = driver.find_element_by_class_name('_input-container')
msg_box.send_keys(messege)
driver.find_element_by_class_name('compose-btn-send').click()
time.sleep(0.5)
if you have any idea how to change the program so it will work please show me.
I know Python, JS, C# so every language is fine.
Thank You,
Erel.
Check whatever tag surrounds the data you're trying to scrap after the span tag was removed, and adjust the code accordingly.
There is no general replacement for span. Can you provide the markup you're trying to scrap (at least the part where the span tag was)

How to read the generated source (html with DOM changes) of a webpage within javascript?

I want to read a webpage programmatically (with javascript-angular) and search some elements inside. What i have until now is:
$http.get('http://.....').success(function(data) {
var doc = new DOMParser().parseFromString(data, 'text/html');
var result = doc.evaluate('//div[#class = \'xx\']/a', doc, null, XPathResult.STRING_TYPE, null);
$scope.all = result.stringValue;
});
so in the example i can read the value of any html element.
Very unluckily, the page i want to read uses some Javascript and the source code (html) is just a part of its entire html source (including DOM changes), which the browser at the end shows. So the html which is returned from the http get, does not necessarily contain the elements i need.
Is there a way of getting the entire html after the javascript run?
Edit: Yes the page is from another domain + The provided API does not give me the info i need.

How to solve error while parsing HTML

I´m trying to get the elements from a web page in Google spreadsheet using:
function pegarAsCoisas() {
var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
var elements = XmlService.parse(html);
}
However I keep geting the error:
Error on line 2: Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character. (line 4, file "")
How do I solve this? I want to get the H1 text from this site, but for other sites I´ll have to select other elements.
I know the method XmlService.parse(html) works for other sites, like Wikipedia. As you can see here.
The html isn't xml. And you don't need to try to parse it. You need to use string methods:
function pegarAsCoisas() {
var urlFetchReturn = UrlFetchApp.fetch("http://www.saosilvestre.com.br");
var html = urlFetchReturn.getContentText();
Logger.log('html.length: ' + html.length);
var index_OfH1 = html.indexOf('<h1');
var endingH1 = html.indexOf('</h1>');
Logger.log('index_OfH1: ' + index_OfH1);
Logger.log('endingH1: ' + endingH1);
var h1Content = html.slice(index_OfH1, endingH1);
var h1Content = h1Content.slice(h1Content.indexOf(">")+1);
Logger.log('h1Content: ' + h1Content);
};
The XMLService service works only with 100% correct XML content. It's not error tolerant. Google apps script used to have a tolerant service called XML service but it was deprecated. However, it still works and you can use that instead as explained here: GAS-XML
Technically HTML and XHTML are not the same. See What are the main differences between XHTML and HTML?
Regarding the OP code, the following works just fine
function pegarAsCoisas() {
var html = UrlFetchApp
.fetch('http://www.saosilvestre.com.br')
.getContentText();
Logger.log(html);
}
As was said on previous answers, other methods should be used instead of using the XmlService directly on the object returned by UrlFetchApp. You could try first to convert the web page source code from HTML to XHTML in order to be able to use the Xml Service Service (XmlService), use the Xml Service as it could work directly with HTML pages, or to handle the web page source code directly as a text file.
Related questions:
How to parse an HTML string in Google Apps Script without using XmlService?
What is the best way to parse html in google apps script
Try replace itemscope by itemscope = '':
function pegarAsCoisas() {
var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
html = replace("itemscope", "itemscope = ''");
var elements = XmlService.parse(html);
}
For more information, look here.

Issues in developing web scraper

I want to develop a platform where users can enter a URL and then my website will open the webpage in an iframe. Now the user can modify his website by simply right clicking and I will provide him options like "remove this element", "copy this element". I am almost through. Many of the websites are opening perfectly in iframe but for a few websites some errors have shown up. I could not identify the reason so asking for your help.
I have solved other issues like XSS problem.
Here is the procedure I have followed :-
Used JavaScript and sent the request to my Java server which makes connection to the URL specified by the user and fetches the HTML and then use Jsoup HTML parser to convert relative URLs into absolute URLs and then save the HTML to my disk in Java. And then I render the saved HTML into my iframe.
Is somewhere wrong ?
A few websites are working perfectly but a few are not.
For example:-
When I tried to open http://www.snapdeal.com it gave me the
Uncaught TypeError: Cannot read property 'paddingTop' of undefined
error. I don't understand why this is happening..
Update
I really wonder how this is implemented? # http://www.proxywebsites.in/browse.php?u=Oi8vd3d3LnNuYXBkZWFsLmNvbQ%3D%3D&b=13&f=norefer
2 issues, pick any you like:
your server side proxy code contains bugs
plenty of sites have either explicit frame-break code or at least expect to be top level frame.
You can try one more thing. In your proxy script you are saving your webpage on your disk and then loading into iframe. I think instead of loading the page you saved on disk in iframe try to open that page in browser. All those sites that restirct their page to be loaded into iframe will now get opened without any error.
Try this I think it an work
My Proxy Server side code :-
DateFormat df = new SimpleDateFormat("ddMMyyyyHHmmss");
String dirName = df.format(new Date());
String dirPath = "C:/apache-tomcat-7.0.23/webapps/offlineWeb/" + dirName;
String serverName = "http://localhost:8080/offlineWeb/" + dirName;
boolean directoryCreated = new File(dirPath).mkdir();
if (!directoryCreated)
log.error("Error in creating directory");
String html = Jsoup.connect(url.toString()).get().html();
doc = Jsoup.parse(html, url);
links = doc.select("link");
scripts = doc.select("script");
images = doc.select("img");
for (Element element : links) {
String linkHref = element.attr("abs:href");
if (linkHref != "") {
element.attr("href", linkHref);
}
}
for (Element element : scripts) {
String scriptSrc = element.attr("abs:src");
if (scriptSrc != "") {
element.attr("src", scriptSrc);
}
}
for (Element element : images) {
String imgSrc = element.attr("abs:src");
if (imgSrc != "") {
element.attr("src", imgSrc);
log.info(imgSrc);
}
}
And Now i am just returning the path where i saved my html file
That's it about my server code

Categories

Resources