HtmlUnit: trying to scrape AngularJS webpage - javascript

I'm trying to use HtmlUnit to scrape the content of a Website.
I am able to programmatically login to my account, but most of the content on the client-side is generated using Javascript (Angular JS). So my questions are:
Can we use HTMLUnit to scrape this page using Angular JS as the scripting language.
This is my code:
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
WebClientOptions webClientOptions = webClient.getOptions();
webClientOptions.setJavaScriptEnabled(true);
final HtmlPage page1 = webClient.getPage(//url);
final HtmlForm form2 = page2.getFormByName("login");
final HtmlSubmitInput button = form2.getInputByValue("Sign In");
final HtmlTextInput textField = form2.getInputByName("email");
textField.setValueAttribute("//email");
final HtmlPasswordInput textField2 = form2.getInputByName("login_password");
textField2.setValueAttribute(//password);
final HtmlPage page3 = button.click();
webClient.waitForBackgroundJavaScript(30000);
Thank you in advance for your help.

Related

Creating API service to load processed data with html on an external website (require logic)

Need to figure out how chat providers like tawk.to or zopim works.
These site give you a small js code like this:
var Tawk_API = Tawk_API || {},
Tawk_LoadStart = new Date();
(function() {
var s1 = document.createElement("script"),
s0 = document.getElementsByTagName("script")[0];
s1.async = true;
s1.src = 'https://embed.tawk.to/5b1a548e10b99c7b36d4bf04/default';
s1.charset = 'UTF-8';
s1.setAttribute('crossorigin', '*');
s0.parentNode.insertBefore(s1, s0);
})();
And you are supposed to place the code on your website and the chat is loaded.
Suppose I have created a chatbot where when you send Hi it sends back Hello
And, I have created a Restful API like this:
domain.com/query?token=11223344&message=hi
I can definitely execute the URL with an ajax call on any of my page.
BUT, how can I create an Service API for this?
Best case scenario is where I provide a small js code with a token to anyone and they place the code on their website and the html is parsed.
It should also be able to perform the example functionality (say Hi get Hello)
Is there any framework which helps to achieve this functionality out of the box?
Note: Currently using Laravel as the Backend.

How to trigger jQuery script on site by Java parser

I'm trying to parse a vacancies from https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine
But I dont get anything execept plain text like "Job Listings Global/English Deutschland/Deutsch Россия/Русский"
The problem is when you load a page - browser runs a script that load some vacancies, but how can I undesrstand JSOUP cant "simulate" browser and run a script. I tried HtmlUnit, but it also done nothing.
Question: What should i do? Am I doing something wrong with HtmlUnit?
Jsoup
Element page = = Jsoup.connect("https://www.epam.com/careers/job-listings?sort=best_match&query=java&department=all&city=all&country=Poland").get();
HtmlUnit
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_52)) {
page = webClient.getPage("https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine");
}
I think i need manualy run some script with
result = page.executeJavaScript("function aa()");
But which one?
You just need to wait a little as hinted here.
You can use:
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
String url = "https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine";
HtmlPage page = webClient.getPage(url);
Thread.sleep(3_000);
System.out.println(page.asXml());
}

load webpage completely in C# (contains page-load scripts)

I'm trying to load a webpage in my application background. following code shows How I am loading a page:
request = (HttpWebRequest)WebRequest.Create("http://example.com");
request.CookieContainer = cookieContainer;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream st = response.GetResponseStream();
StreamReader sr = new StreamReader(st);
string responseString = sr.ReadToEnd();
sr.Close();
st.Close();
}
as you know, the server responses HTML codes or some javascript codes, but there are many codes which added to the webpage by javascripts functions. so I have to interpret or compile the first HTTP response.
I tried to use System.Windows.Forms.WebBrowser object to load the webpage completely, but this is a weak engine to do this.
so I tried to use CEFSharp (Chromium embedded Browser), it's great and works fine but I have trouble with that. following is how I use CEFSharp to load a webpage:
ChromiumWebBrowser MainBrowser = new ChromiumWebBrowser("http://Example/");
MainBrowser.FrameLoadEnd+=MainBrowser.FrameLoadEnd;
panel1.Controls.Add(MainBrowser);
MainBrowser.LoadHtml(responseString,"http://example.com");
it works fine when I use this code in Form1.cs and when I add MainBrowser to a panel. but I want to use it in another class, actually ChromiumWebBrowser is part of another custom object and the custom object works in background. also it would possible 10 or 20 custom objects work in a same time. in this situation ChromiumWebBrowser doesn't work any more!
second problem is the threading issue, when I call this function MainBrowser.LoadHtml(responseString,"http://example.com");
it doesn't return any results, so I have to pause the code execution by using Semaphore and wait for the result at this event: MainBrowser.FrameLoadEnd
so I wish my code be some thing like this:
request = (HttpWebRequest)WebRequest.Create("http://example.com");
request.CookieContainer = cookieContainer;
string responseString="";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream st = response.GetResponseStream();
StreamReader sr = new StreamReader(st);
responseString = sr.ReadToEnd();
sr.Close();
st.Close();
}
string FullPageContent = SomeBrowserEngine.LoadHtml(responseString);
//Do stuffs
Can you please show me how to do this? do you know any other web browser engines that work like what I want?
please tell me if I'm doing any things wrong with CEFSharp or other concepts.

HTMLUnit Angular JS Seo not populating UI-VIEW from state change

I am using htmlUnit headless browsing for creating static content of my page automatically.
But due to some reason my page is not getting the sub-child of ui-view inside it
Below is the code
try (final WebClient webclient = new WebClient(BrowserVersion.CHROME)) {
webclient.getOptions().setCssEnabled(true);
webclient.setCssErrorHandler(new SilentCssErrorHandler());
webclient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webclient.getOptions().setThrowExceptionOnScriptError(false);
webclient.getOptions().setTimeout(5000);
webclient.getOptions().setJavaScriptEnabled(true);
webclient.getOptions().setPopupBlockerEnabled(true);
webclient.getOptions().setPrintContentOnFailingStatusCode(false);
webclient.waitForBackgroundJavaScript(50000);
webclient.getOptions().setRedirectEnabled(true);
final HtmlPage page = webclient.getPage("http://societyfocus.com/portal/#/login/signin");
String finalXmlString = page.asXml();
assertTrue(finalXmlString.contains("Sign in to your account"));
System.out.println(page.asXml());
Request you to kindly suggest on the same.
This is base Junit test case. With reference URL. I am trying to generate auto snapshot creation and sending the same to google bot for SEO.

How to use Python to get elements that do not appear in HTML, but appear in "Inspect Element" tool of Chrome?

Dear Python Experts out there!
I am totally new to Python and writing a small program to fetch the information from a web page. There is nothing to ask if the page would return all the information in the page-source HTML, which can easily view by Chrome. The problem is that the Elements I want to get after submitting an IP address to https://www.maxmind.com/en/geoip-demo do not appear in the body of HTML, but only when I click "inspect element" tool of Chrome. I used to following code to post to the page and print the response string, but the elements I want are not there.
import urllib2
import requests
url = 'https://www.maxmind.com/en/geoip-demo'
data = {'addresses':'162.237.72.200'}
post = requests.post(url, data = data)
content = post.content
print content
With this code, I hope to get some information related to the IP address in the body of HTML such as
162.237.72.200
US
Pittsburg,California,United States,North America
94565
38.0051,
-121.8387
AT&T U-verse
AT&T U-verse
sbcglobal.net
807
But those information is not there in the HTML body, so I am really grateful if anyone could give me just a hint to solve the problem. Thank you so much!
A working solution simulating the browser navigation and interaction with the form to retrieve the data using scrapy and webdriver.
class MaxSpider(CrawlSpider):
name = "max"
allowed_domains = ["maxmind.com"]
start_urls = ["https://www.maxmind.com/en/geoip-demo"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
button = self.driver.find_element_by_id('addresses')
login_form = self.driver.find_element_by_id('addresses')
actions = ActionChains(self.driver)
actions.click(login_form)
actions.perform()
login_form.send_keys("62.237.72.200")
submit = self.driver.find_element_by_xpath('//*[#id="geoip-demo-form"]/button')
actions.click(submit)
time.sleep(3)
for element in self.driver.find_elements_by_id('geoip-demo-results-tbody'):
print element.text
self.driver.close()
excerpt from output:
2015-01-13 13:27:18+0100 [max] DEBUG: Crawled (200) https://www.maxmind.com/en/geoip-demo> (referer: http://www.bing.com)
62.237.72.200 FI Finland, Europe 60.1708,
24.9375 Tele Danmark Tele Danmark

Categories

Resources