load webpage completely in C# (contains page-load scripts) - javascript

I'm trying to load a webpage in my application background. following code shows How I am loading a page:
request = (HttpWebRequest)WebRequest.Create("http://example.com");
request.CookieContainer = cookieContainer;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream st = response.GetResponseStream();
StreamReader sr = new StreamReader(st);
string responseString = sr.ReadToEnd();
sr.Close();
st.Close();
}
as you know, the server responses HTML codes or some javascript codes, but there are many codes which added to the webpage by javascripts functions. so I have to interpret or compile the first HTTP response.
I tried to use System.Windows.Forms.WebBrowser object to load the webpage completely, but this is a weak engine to do this.
so I tried to use CEFSharp (Chromium embedded Browser), it's great and works fine but I have trouble with that. following is how I use CEFSharp to load a webpage:
ChromiumWebBrowser MainBrowser = new ChromiumWebBrowser("http://Example/");
MainBrowser.FrameLoadEnd+=MainBrowser.FrameLoadEnd;
panel1.Controls.Add(MainBrowser);
MainBrowser.LoadHtml(responseString,"http://example.com");
it works fine when I use this code in Form1.cs and when I add MainBrowser to a panel. but I want to use it in another class, actually ChromiumWebBrowser is part of another custom object and the custom object works in background. also it would possible 10 or 20 custom objects work in a same time. in this situation ChromiumWebBrowser doesn't work any more!
second problem is the threading issue, when I call this function MainBrowser.LoadHtml(responseString,"http://example.com");
it doesn't return any results, so I have to pause the code execution by using Semaphore and wait for the result at this event: MainBrowser.FrameLoadEnd
so I wish my code be some thing like this:
request = (HttpWebRequest)WebRequest.Create("http://example.com");
request.CookieContainer = cookieContainer;
string responseString="";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream st = response.GetResponseStream();
StreamReader sr = new StreamReader(st);
responseString = sr.ReadToEnd();
sr.Close();
st.Close();
}
string FullPageContent = SomeBrowserEngine.LoadHtml(responseString);
//Do stuffs
Can you please show me how to do this? do you know any other web browser engines that work like what I want?
please tell me if I'm doing any things wrong with CEFSharp or other concepts.

Related

Creating API service to load processed data with html on an external website (require logic)

Need to figure out how chat providers like tawk.to or zopim works.
These site give you a small js code like this:
var Tawk_API = Tawk_API || {},
Tawk_LoadStart = new Date();
(function() {
var s1 = document.createElement("script"),
s0 = document.getElementsByTagName("script")[0];
s1.async = true;
s1.src = 'https://embed.tawk.to/5b1a548e10b99c7b36d4bf04/default';
s1.charset = 'UTF-8';
s1.setAttribute('crossorigin', '*');
s0.parentNode.insertBefore(s1, s0);
})();
And you are supposed to place the code on your website and the chat is loaded.
Suppose I have created a chatbot where when you send Hi it sends back Hello
And, I have created a Restful API like this:
domain.com/query?token=11223344&message=hi
I can definitely execute the URL with an ajax call on any of my page.
BUT, how can I create an Service API for this?
Best case scenario is where I provide a small js code with a token to anyone and they place the code on their website and the html is parsed.
It should also be able to perform the example functionality (say Hi get Hello)
Is there any framework which helps to achieve this functionality out of the box?
Note: Currently using Laravel as the Backend.

How to trigger jQuery script on site by Java parser

I'm trying to parse a vacancies from https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine
But I dont get anything execept plain text like "Job Listings Global/English Deutschland/Deutsch Россия/Русский"
The problem is when you load a page - browser runs a script that load some vacancies, but how can I undesrstand JSOUP cant "simulate" browser and run a script. I tried HtmlUnit, but it also done nothing.
Question: What should i do? Am I doing something wrong with HtmlUnit?
Jsoup
Element page = = Jsoup.connect("https://www.epam.com/careers/job-listings?sort=best_match&query=java&department=all&city=all&country=Poland").get();
HtmlUnit
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_52)) {
page = webClient.getPage("https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine");
}
I think i need manualy run some script with
result = page.executeJavaScript("function aa()");
But which one?
You just need to wait a little as hinted here.
You can use:
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
String url = "https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine";
HtmlPage page = webClient.getPage(url);
Thread.sleep(3_000);
System.out.println(page.asXml());
}

Invoking javascript file with dom access

I have a javascript file that has code with DOM access
var a=document.getElementById("abc").value
I have html file, that contains all DOM information
<html>...<input id="abc" ...></html>
Is there anyway to get C# invoke the javascript file, and return the value of a, back to the c# program?
In reality the JavaScript can be much more complex, and I need to channel those "interested values" back to C#, but let's just consider the simple example mentioning here.
Possible directions I could think is using https://jint.codeplex.com/, or Web browser control. The challenge here is that it not only involves the JavaScript, it also involving the HTML file.
What I want to know is:
Is there a way to channel variable value from JavaScript back to C#?
How to get JavaScript evaluate DOM elements from a HTML file?
Using WebBrowser and Jint:
using (WebBrowser browser = new WebBrowser())
{
browser.ScriptErrorsSuppressed = true;
browser.DocumentText = #"<html><head/><body><input id=""abc"" value=""this is the value in input""></body></html>";
// Wait for control to load page
while (browser.ReadyState != WebBrowserReadyState.Complete)
Application.DoEvents();
dynamic d = (dynamic)browser.Document.DomDocument;//get de activex dom
var jengine = new Jint.Engine();
jengine.SetValue("document", d);
try
{
string val=jengine.Execute(#"var a=document.getElementById('abc').value;").GetValue("a").ToString();
Console.WriteLine(val);
}
catch (Jint.Runtime.JavaScriptException je)
{
Console.WriteLine(je);
}
}

C# Faster way to get javascript DOM than EO.WebBrowser

I have code in place where I'm using EO.WebBrowser to get the html from a page using the EO.WebView Request:
var cookie = new EO.WebBrowser.Cookie("cookie", "value");
cookie.Path = path;
cookie.Domain = domain;
var options = new BrowserOptions();
options.EnableWebSecurity = false;
Runtime.SetDefaultOptions(options);
var request = new Request(url);
request.Cookies.Add(cookie);
webView.LoadRequestAndWait(request);
Finally I use the following to get the HTML I need:
webView.GetDOMWindow().document.body.outerHTML
My issue is that this is very slow and although I can get it to run it locally, I can not get it to run on Azure server code. Is there a way to do the same thing using HttpWebRequest?
You can use JavaScript:
var data = (string)webView.EvalScript("document.body.outerHTML");
No, HttpWebRequest (and other similar "get me HTML response") methods will only give you HTML itself and will not run JavaScript on the page.
For server side processing of dynamic HTML consider using proper headless internet browser? instead of trying to convince regular IE to work correctly without UI.
the eo.webbrowser runs multi-process like chrome and unsupported by many cloud service environment.
just use WebClient or HttpWebRequest or RestSharp or something like that can do http requests to get the response html.

Issues in developing web scraper

I want to develop a platform where users can enter a URL and then my website will open the webpage in an iframe. Now the user can modify his website by simply right clicking and I will provide him options like "remove this element", "copy this element". I am almost through. Many of the websites are opening perfectly in iframe but for a few websites some errors have shown up. I could not identify the reason so asking for your help.
I have solved other issues like XSS problem.
Here is the procedure I have followed :-
Used JavaScript and sent the request to my Java server which makes connection to the URL specified by the user and fetches the HTML and then use Jsoup HTML parser to convert relative URLs into absolute URLs and then save the HTML to my disk in Java. And then I render the saved HTML into my iframe.
Is somewhere wrong ?
A few websites are working perfectly but a few are not.
For example:-
When I tried to open http://www.snapdeal.com it gave me the
Uncaught TypeError: Cannot read property 'paddingTop' of undefined
error. I don't understand why this is happening..
Update
I really wonder how this is implemented? # http://www.proxywebsites.in/browse.php?u=Oi8vd3d3LnNuYXBkZWFsLmNvbQ%3D%3D&b=13&f=norefer
2 issues, pick any you like:
your server side proxy code contains bugs
plenty of sites have either explicit frame-break code or at least expect to be top level frame.
You can try one more thing. In your proxy script you are saving your webpage on your disk and then loading into iframe. I think instead of loading the page you saved on disk in iframe try to open that page in browser. All those sites that restirct their page to be loaded into iframe will now get opened without any error.
Try this I think it an work
My Proxy Server side code :-
DateFormat df = new SimpleDateFormat("ddMMyyyyHHmmss");
String dirName = df.format(new Date());
String dirPath = "C:/apache-tomcat-7.0.23/webapps/offlineWeb/" + dirName;
String serverName = "http://localhost:8080/offlineWeb/" + dirName;
boolean directoryCreated = new File(dirPath).mkdir();
if (!directoryCreated)
log.error("Error in creating directory");
String html = Jsoup.connect(url.toString()).get().html();
doc = Jsoup.parse(html, url);
links = doc.select("link");
scripts = doc.select("script");
images = doc.select("img");
for (Element element : links) {
String linkHref = element.attr("abs:href");
if (linkHref != "") {
element.attr("href", linkHref);
}
}
for (Element element : scripts) {
String scriptSrc = element.attr("abs:src");
if (scriptSrc != "") {
element.attr("src", scriptSrc);
}
}
for (Element element : images) {
String imgSrc = element.attr("abs:src");
if (imgSrc != "") {
element.attr("src", imgSrc);
log.info(imgSrc);
}
}
And Now i am just returning the path where i saved my html file
That's it about my server code

Categories

Resources