HTMLUnit Angular JS Seo not populating UI-VIEW from state change - javascript

I am using htmlUnit headless browsing for creating static content of my page automatically.
But due to some reason my page is not getting the sub-child of ui-view inside it
Below is the code
try (final WebClient webclient = new WebClient(BrowserVersion.CHROME)) {
webclient.getOptions().setCssEnabled(true);
webclient.setCssErrorHandler(new SilentCssErrorHandler());
webclient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webclient.getOptions().setThrowExceptionOnScriptError(false);
webclient.getOptions().setTimeout(5000);
webclient.getOptions().setJavaScriptEnabled(true);
webclient.getOptions().setPopupBlockerEnabled(true);
webclient.getOptions().setPrintContentOnFailingStatusCode(false);
webclient.waitForBackgroundJavaScript(50000);
webclient.getOptions().setRedirectEnabled(true);
final HtmlPage page = webclient.getPage("http://societyfocus.com/portal/#/login/signin");
String finalXmlString = page.asXml();
assertTrue(finalXmlString.contains("Sign in to your account"));
System.out.println(page.asXml());
Request you to kindly suggest on the same.
This is base Junit test case. With reference URL. I am trying to generate auto snapshot creation and sending the same to google bot for SEO.

Related

How to trigger jQuery script on site by Java parser

I'm trying to parse a vacancies from https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine
But I dont get anything execept plain text like "Job Listings Global/English Deutschland/Deutsch Россия/Русский"
The problem is when you load a page - browser runs a script that load some vacancies, but how can I undesrstand JSOUP cant "simulate" browser and run a script. I tried HtmlUnit, but it also done nothing.
Question: What should i do? Am I doing something wrong with HtmlUnit?
Jsoup
Element page = = Jsoup.connect("https://www.epam.com/careers/job-listings?sort=best_match&query=java&department=all&city=all&country=Poland").get();
HtmlUnit
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_52)) {
page = webClient.getPage("https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine");
}
I think i need manualy run some script with
result = page.executeJavaScript("function aa()");
But which one?
You just need to wait a little as hinted here.
You can use:
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
String url = "https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine";
HtmlPage page = webClient.getPage(url);
Thread.sleep(3_000);
System.out.println(page.asXml());
}

load webpage completely in C# (contains page-load scripts)

I'm trying to load a webpage in my application background. following code shows How I am loading a page:
request = (HttpWebRequest)WebRequest.Create("http://example.com");
request.CookieContainer = cookieContainer;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream st = response.GetResponseStream();
StreamReader sr = new StreamReader(st);
string responseString = sr.ReadToEnd();
sr.Close();
st.Close();
}
as you know, the server responses HTML codes or some javascript codes, but there are many codes which added to the webpage by javascripts functions. so I have to interpret or compile the first HTTP response.
I tried to use System.Windows.Forms.WebBrowser object to load the webpage completely, but this is a weak engine to do this.
so I tried to use CEFSharp (Chromium embedded Browser), it's great and works fine but I have trouble with that. following is how I use CEFSharp to load a webpage:
ChromiumWebBrowser MainBrowser = new ChromiumWebBrowser("http://Example/");
MainBrowser.FrameLoadEnd+=MainBrowser.FrameLoadEnd;
panel1.Controls.Add(MainBrowser);
MainBrowser.LoadHtml(responseString,"http://example.com");
it works fine when I use this code in Form1.cs and when I add MainBrowser to a panel. but I want to use it in another class, actually ChromiumWebBrowser is part of another custom object and the custom object works in background. also it would possible 10 or 20 custom objects work in a same time. in this situation ChromiumWebBrowser doesn't work any more!
second problem is the threading issue, when I call this function MainBrowser.LoadHtml(responseString,"http://example.com");
it doesn't return any results, so I have to pause the code execution by using Semaphore and wait for the result at this event: MainBrowser.FrameLoadEnd
so I wish my code be some thing like this:
request = (HttpWebRequest)WebRequest.Create("http://example.com");
request.CookieContainer = cookieContainer;
string responseString="";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream st = response.GetResponseStream();
StreamReader sr = new StreamReader(st);
responseString = sr.ReadToEnd();
sr.Close();
st.Close();
}
string FullPageContent = SomeBrowserEngine.LoadHtml(responseString);
//Do stuffs
Can you please show me how to do this? do you know any other web browser engines that work like what I want?
please tell me if I'm doing any things wrong with CEFSharp or other concepts.

HtmlUnit: trying to scrape AngularJS webpage

I'm trying to use HtmlUnit to scrape the content of a Website.
I am able to programmatically login to my account, but most of the content on the client-side is generated using Javascript (Angular JS). So my questions are:
Can we use HTMLUnit to scrape this page using Angular JS as the scripting language.
This is my code:
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
WebClientOptions webClientOptions = webClient.getOptions();
webClientOptions.setJavaScriptEnabled(true);
final HtmlPage page1 = webClient.getPage(//url);
final HtmlForm form2 = page2.getFormByName("login");
final HtmlSubmitInput button = form2.getInputByValue("Sign In");
final HtmlTextInput textField = form2.getInputByName("email");
textField.setValueAttribute("//email");
final HtmlPasswordInput textField2 = form2.getInputByName("login_password");
textField2.setValueAttribute(//password);
final HtmlPage page3 = button.click();
webClient.waitForBackgroundJavaScript(30000);
Thank you in advance for your help.

Convert dynamic html to pdf with HiQPdf

I've been trying to get my HTML to accurately translate into a PDF for some time now but I can't see what I'm doing wrong.
Here's my code for the page:
Imports HiQPdf
Imports System.Text
Imports System.IO
Imports System.Web.UI
Partial Class MODULES_CostCalculator_CostCalculator
Inherits System.Web.UI.Page
Dim convertToPdf As Boolean = False
Protected Sub printClick()
convertToPdf = True
End Sub
Protected Overrides Sub Render(writer As System.Web.UI.HtmlTextWriter)
If (convertToPdf) Then
System.Diagnostics.Debug.Write("overriding render")
Dim tw As TextWriter = New StringWriter()
Dim htw As HtmlTextWriter = New HtmlTextWriter(tw)
'render the html markup into the TextWriter
MyBase.Render(htw)
'get the current page html code
Dim htmlCode As String = tw.ToString()
System.Diagnostics.Debug.Write(htmlCode)
'convert the html to PDF
'create html to pdf converter
Dim htmlToPdfConv As HtmlToPdf = New HtmlToPdf()
'htmlToPdfConv.MediaType = "print"
'base url used to resolve images, css and script files
Dim currentPageUrl As String = HttpContext.Current.Request.Url.AbsoluteUri
'convert html to a pdf memory buffer
Dim pdfBuffer As Byte() = htmlToPdfConv.ConvertHtmlToMemory(htmlCode, currentPageUrl)
'inform the browser about the binary data format
HttpContext.Current.Response.AddHeader("Content-Type", "application/pdf")
'let the browser know how to open the pdf doc
HttpContext.Current.Response.AddHeader("Content-Disposition",
String.Format("attachment; filename=ConvertThisHtmlWithState.pdf; size={0}",
pdfBuffer.Length.ToString()))
'write the pdf buffer to http response
HttpContext.Current.Response.BinaryWrite(pdfBuffer)
'call End() method of http response to stop ASP.NET page processing
HttpContext.Current.Response.End()
Else
MyBase.Render(writer)
End If
End Sub
Does anyone see what I might be doing wrong? A lot of the HTML is linked to a Knockout ViewModel, so I'm not sure if that would be causing an issue.
To be clear, I can create PDF's of the page, but only with the HTML in the state it was when the page first loaded. If I change any of the data-bound HTML, it doesn't reflect when I try to make another PDF.
Priorplease try:
Adding clear:
Response.Clear()
Response.ClearHeaders()
After MyBase.Reder method
htw.Flush()
At the before Response.End
Response.Flush()
If nothing above works:
Call support :)
I think the problem is that you're changing the state of the page after it has rendered (using JavaScript), and you're expecting this: -
MyBase.Render(htw)
'get the current page html code
to give you the current state of the page. It won't - it will give you the state of the page as it was rendered. If you're using Knockout or anything other scripting to manipulate the DOM after the page has loaded, the server-side model of the page knows nothing of these changes.

How to create a PDF file that will have initial view=Fit

I am trying to use the iText Stamper to change a PDF file so that it will always open with full page display. I tried,
PdfStamper stamper = new PdfStamper(new PdfReader(src), new FileOutputStream(dest));
PdfWriter writer = stamper.getWriter();
PdfAction action = PdfAction.gotoLocalPage(1, new PdfDestination(PdfDestination.FIT), writer);
writer.setAdditionalAction(PdfWriter.DOCUMENT_OPEN, action);
but DOCUMENT_OPEN is not defined. How can I do this? Should I be using instead stamper.addJavascript? but what JS code will setup the initial view?
I could use setPageAction(PAGE_OPEN, action, 1) and that works, but I think it might be annoying to the user if every time they look at page 1, the view changes.
BTW, initially I tried to use the PDF Open Parameters, but they are very unreliable. I displayed the pdf using
<embed src='myfile.pdf#view=Fit'>
and Adobe Reader often ignores the view for no apparent reason. That is why I am trying to set the initial view within the PDF itself.
Try this instead:
writer.setOpenAction(action);
Also see the documentation for setOpenAction.

Categories

Resources