How to trigger jQuery script on site by Java parser - javascript

I'm trying to parse a vacancies from https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine
But I dont get anything execept plain text like "Job Listings Global/English Deutschland/Deutsch Россия/Русский"
The problem is when you load a page - browser runs a script that load some vacancies, but how can I undesrstand JSOUP cant "simulate" browser and run a script. I tried HtmlUnit, but it also done nothing.
Question: What should i do? Am I doing something wrong with HtmlUnit?
Jsoup
Element page = = Jsoup.connect("https://www.epam.com/careers/job-listings?sort=best_match&query=java&department=all&city=all&country=Poland").get();
HtmlUnit
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_52)) {
page = webClient.getPage("https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine");
}
I think i need manualy run some script with
result = page.executeJavaScript("function aa()");
But which one?

You just need to wait a little as hinted here.
You can use:
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
String url = "https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine";
HtmlPage page = webClient.getPage(url);
Thread.sleep(3_000);
System.out.println(page.asXml());
}

Related

How to execute a JS Script stored in a string on Html code stored in a string

I don't even know if what I'm asking is possible or not, I'm fairly new to web development.
Here is what I'd like to achieve:
I need to execute a JS script on a web page. The script returns some keywords related to that page.
I get the JS script from a remote server and I store it in a string.
The only thing I know about the web page is its url, so I get the content of the page with jQuery.
Here is what I tried (I know it looks stupid, but it illustrates what I'm trying to achieve):
// Let's say I want to execute that script on www.google.com's page
$.get("www.google.com", null, function(data) {
let myScript = localStorage.getItem('my_script')
// What I tried so far (it doesn't work, of course)
var temp = document.createElement('div')
temp.innerHTML = data
var resultsOfMyScript = temp.firstChild.eval(myScript) // not a function
})
Do you have any idea on how I could do that?
Try this:
const script = document.createElement('script')
script.type = 'text/javascript'
script.charset = 'utf-8'
script.text = "console.log('this is my script')"
document.body.appendChild(script)
Substitute .text for the string you mentioned.

load webpage completely in C# (contains page-load scripts)

I'm trying to load a webpage in my application background. following code shows How I am loading a page:
request = (HttpWebRequest)WebRequest.Create("http://example.com");
request.CookieContainer = cookieContainer;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream st = response.GetResponseStream();
StreamReader sr = new StreamReader(st);
string responseString = sr.ReadToEnd();
sr.Close();
st.Close();
}
as you know, the server responses HTML codes or some javascript codes, but there are many codes which added to the webpage by javascripts functions. so I have to interpret or compile the first HTTP response.
I tried to use System.Windows.Forms.WebBrowser object to load the webpage completely, but this is a weak engine to do this.
so I tried to use CEFSharp (Chromium embedded Browser), it's great and works fine but I have trouble with that. following is how I use CEFSharp to load a webpage:
ChromiumWebBrowser MainBrowser = new ChromiumWebBrowser("http://Example/");
MainBrowser.FrameLoadEnd+=MainBrowser.FrameLoadEnd;
panel1.Controls.Add(MainBrowser);
MainBrowser.LoadHtml(responseString,"http://example.com");
it works fine when I use this code in Form1.cs and when I add MainBrowser to a panel. but I want to use it in another class, actually ChromiumWebBrowser is part of another custom object and the custom object works in background. also it would possible 10 or 20 custom objects work in a same time. in this situation ChromiumWebBrowser doesn't work any more!
second problem is the threading issue, when I call this function MainBrowser.LoadHtml(responseString,"http://example.com");
it doesn't return any results, so I have to pause the code execution by using Semaphore and wait for the result at this event: MainBrowser.FrameLoadEnd
so I wish my code be some thing like this:
request = (HttpWebRequest)WebRequest.Create("http://example.com");
request.CookieContainer = cookieContainer;
string responseString="";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream st = response.GetResponseStream();
StreamReader sr = new StreamReader(st);
responseString = sr.ReadToEnd();
sr.Close();
st.Close();
}
string FullPageContent = SomeBrowserEngine.LoadHtml(responseString);
//Do stuffs
Can you please show me how to do this? do you know any other web browser engines that work like what I want?
please tell me if I'm doing any things wrong with CEFSharp or other concepts.

HTMLUnit Angular JS Seo not populating UI-VIEW from state change

I am using htmlUnit headless browsing for creating static content of my page automatically.
But due to some reason my page is not getting the sub-child of ui-view inside it
Below is the code
try (final WebClient webclient = new WebClient(BrowserVersion.CHROME)) {
webclient.getOptions().setCssEnabled(true);
webclient.setCssErrorHandler(new SilentCssErrorHandler());
webclient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webclient.getOptions().setThrowExceptionOnScriptError(false);
webclient.getOptions().setTimeout(5000);
webclient.getOptions().setJavaScriptEnabled(true);
webclient.getOptions().setPopupBlockerEnabled(true);
webclient.getOptions().setPrintContentOnFailingStatusCode(false);
webclient.waitForBackgroundJavaScript(50000);
webclient.getOptions().setRedirectEnabled(true);
final HtmlPage page = webclient.getPage("http://societyfocus.com/portal/#/login/signin");
String finalXmlString = page.asXml();
assertTrue(finalXmlString.contains("Sign in to your account"));
System.out.println(page.asXml());
Request you to kindly suggest on the same.
This is base Junit test case. With reference URL. I am trying to generate auto snapshot creation and sending the same to google bot for SEO.

C# Faster way to get javascript DOM than EO.WebBrowser

I have code in place where I'm using EO.WebBrowser to get the html from a page using the EO.WebView Request:
var cookie = new EO.WebBrowser.Cookie("cookie", "value");
cookie.Path = path;
cookie.Domain = domain;
var options = new BrowserOptions();
options.EnableWebSecurity = false;
Runtime.SetDefaultOptions(options);
var request = new Request(url);
request.Cookies.Add(cookie);
webView.LoadRequestAndWait(request);
Finally I use the following to get the HTML I need:
webView.GetDOMWindow().document.body.outerHTML
My issue is that this is very slow and although I can get it to run it locally, I can not get it to run on Azure server code. Is there a way to do the same thing using HttpWebRequest?
You can use JavaScript:
var data = (string)webView.EvalScript("document.body.outerHTML");
No, HttpWebRequest (and other similar "get me HTML response") methods will only give you HTML itself and will not run JavaScript on the page.
For server side processing of dynamic HTML consider using proper headless internet browser? instead of trying to convince regular IE to work correctly without UI.
the eo.webbrowser runs multi-process like chrome and unsupported by many cloud service environment.
just use WebClient or HttpWebRequest or RestSharp or something like that can do http requests to get the response html.

Issues in developing web scraper

I want to develop a platform where users can enter a URL and then my website will open the webpage in an iframe. Now the user can modify his website by simply right clicking and I will provide him options like "remove this element", "copy this element". I am almost through. Many of the websites are opening perfectly in iframe but for a few websites some errors have shown up. I could not identify the reason so asking for your help.
I have solved other issues like XSS problem.
Here is the procedure I have followed :-
Used JavaScript and sent the request to my Java server which makes connection to the URL specified by the user and fetches the HTML and then use Jsoup HTML parser to convert relative URLs into absolute URLs and then save the HTML to my disk in Java. And then I render the saved HTML into my iframe.
Is somewhere wrong ?
A few websites are working perfectly but a few are not.
For example:-
When I tried to open http://www.snapdeal.com it gave me the
Uncaught TypeError: Cannot read property 'paddingTop' of undefined
error. I don't understand why this is happening..
Update
I really wonder how this is implemented? # http://www.proxywebsites.in/browse.php?u=Oi8vd3d3LnNuYXBkZWFsLmNvbQ%3D%3D&b=13&f=norefer
2 issues, pick any you like:
your server side proxy code contains bugs
plenty of sites have either explicit frame-break code or at least expect to be top level frame.
You can try one more thing. In your proxy script you are saving your webpage on your disk and then loading into iframe. I think instead of loading the page you saved on disk in iframe try to open that page in browser. All those sites that restirct their page to be loaded into iframe will now get opened without any error.
Try this I think it an work
My Proxy Server side code :-
DateFormat df = new SimpleDateFormat("ddMMyyyyHHmmss");
String dirName = df.format(new Date());
String dirPath = "C:/apache-tomcat-7.0.23/webapps/offlineWeb/" + dirName;
String serverName = "http://localhost:8080/offlineWeb/" + dirName;
boolean directoryCreated = new File(dirPath).mkdir();
if (!directoryCreated)
log.error("Error in creating directory");
String html = Jsoup.connect(url.toString()).get().html();
doc = Jsoup.parse(html, url);
links = doc.select("link");
scripts = doc.select("script");
images = doc.select("img");
for (Element element : links) {
String linkHref = element.attr("abs:href");
if (linkHref != "") {
element.attr("href", linkHref);
}
}
for (Element element : scripts) {
String scriptSrc = element.attr("abs:src");
if (scriptSrc != "") {
element.attr("src", scriptSrc);
}
}
for (Element element : images) {
String imgSrc = element.attr("abs:src");
if (imgSrc != "") {
element.attr("src", imgSrc);
log.info(imgSrc);
}
}
And Now i am just returning the path where i saved my html file
That's it about my server code

Categories

Resources