Most of the answers I have read concerning this subject point to either the System.Windows.Forms.WebBrowser class or the COM interface mshtml.HTMLDocument from the Microsoft HTML Object Library assembly.
The WebBrowser class did not lead me anywhere. The following code fails to retrieve the HTML code as rendered by my web browser:
[STAThread]
public static void Main()
{
WebBrowser wb = new WebBrowser();
wb.Navigate("https://www.google.com/#q=where+am+i");
wb.DocumentCompleted += delegate(object sender, WebBrowserDocumentCompletedEventArgs e)
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)wb.Document.DomDocument;
foreach (IHTMLElement element in doc.all)
{
System.Diagnostics.Debug.WriteLine(element.outerHTML);
}
};
Form f = new Form();
f.Controls.Add(wb);
Application.Run(f);
}
The above is just an example. I'm not really interested in finding a workaround for figuring out the name of the town where I am located. I simply need to understand how to retrieve that kind of dynamically generated data programmatically.
(Call new System.Net.WebClient.DownloadString("https://www.google.com/#q=where+am+i"), save the resulting text somewhere, search for the name of the town where you are currently located, and let me know if you were able to find it.)
But yet when I access "https://www.google.com/#q=where+am+i" from my Web Browser (ie or firefox) I see the name of my town written on the web page. In Firefox, if I right click on the name of the town and select "Inspect Element (Q)" I clearly see the name of the town written in the HTML code which happens to look quite different from the raw HTML that is returned by WebClient.
After I got tired of playing System.Net.WebBrowser, I decided to give mshtml.HTMLDocument a shot, just to end up with the same useless raw HTML:
public static void Main()
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)new mshtml.HTMLDocument();
doc.write(new System.Net.WebClient().DownloadString("https://www.google.com/#q=where+am+i"));
foreach (IHTMLElement e in doc.all)
{
System.Diagnostics.Debug.WriteLine(e.outerHTML);
}
}
I suppose there must be an elegant way to obtain this kind of information. Right now all I can think of is add a WebBrowser control to a form, have it navigate to the URL in question, send the keys "CLRL, A", and copy whatever happens to be displayed on the page to the clipboard and attempt to parse it. That's horrible solution, though.
I'd like to contribute some code to Alexei's answer. A few points:
Strictly speaking, it may not always be possible to determine when the page has finished rendering with 100% probability. Some pages
are quite complex and use continuous AJAX updates. But we
can get quite close, by polling the page's current HTML snapshot for changes
and checking the WebBrowser.IsBusy property. That's what
LoadDynamicPage does below.
Some time-out logic has to be present on top of the above, in case the page rendering is never-ending (note CancellationTokenSource).
Async/await is a great tool for coding this, as it gives the linear
code flow to our asynchronous polling logic, which greatly simplifies it.
It's important to enable HTML5 rendering using Browser Feature
Control, as WebBrowser runs in IE7 emulation mode by default.
That's what SetFeatureBrowserEmulation does below.
This is a WinForms app, but the concept can be easily converted into a console app.
This logic works well on the URL you've specifically mentioned: https://www.google.com/#q=where+am+i.
using Microsoft.Win32;
using System;
using System.ComponentModel;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace WbFetchPage
{
public partial class MainForm : Form
{
public MainForm()
{
SetFeatureBrowserEmulation();
InitializeComponent();
this.Load += MainForm_Load;
}
// start the task
async void MainForm_Load(object sender, EventArgs e)
{
try
{
var cts = new CancellationTokenSource(10000); // cancel in 10s
var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token);
MessageBox.Show(html.Substring(0, 1024) + "..." ); // it's too long!
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
// navigate and download
async Task<string> LoadDynamicPage(string url, CancellationToken token)
{
// navigate and await DocumentCompleted
var tcs = new TaskCompletionSource<bool>();
WebBrowserDocumentCompletedEventHandler handler = (s, arg) =>
tcs.TrySetResult(true);
using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
{
this.webBrowser.DocumentCompleted += handler;
try
{
this.webBrowser.Navigate(url);
await tcs.Task; // wait for DocumentCompleted
}
finally
{
this.webBrowser.DocumentCompleted -= handler;
}
}
// get the root element
var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];
// poll the current HTML for changes asynchronosly
var html = documentElement.OuterHtml;
while (true)
{
// wait asynchronously, this will throw if cancellation requested
await Task.Delay(500, token);
// continue polling if the WebBrowser is still busy
if (this.webBrowser.IsBusy)
continue;
var htmlNow = documentElement.OuterHtml;
if (html == htmlNow)
break; // no changes detected, end the poll loop
html = htmlNow;
}
// consider the page fully rendered
token.ThrowIfCancellationRequested();
return html;
}
// enable HTML5 (assuming we're running IE10+)
// more info: https://stackoverflow.com/a/18333982/1768303
static void SetFeatureBrowserEmulation()
{
if (LicenseManager.UsageMode != LicenseUsageMode.Runtime)
return;
var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName);
Registry.SetValue(#"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
appName, 10000, RegistryValueKind.DWord);
}
}
}
Your web-browser code looks reasonable - wait for something, that grab current content. Unfortunately there is no official "I'm done executing JavaScript, feel free to steal content" notification from browser nor JavaScript.
Some sort of active wait (not Sleep but Timer) may be necessary and page-specific. Even if you use headless browser (i.e. PhantomJS) you'll have the same issue.
Related
I am using WPF's WebBrowser control to load a simple web page. On this page I have an anchor or a button. I want to capture the click event of that button in my application's code behind (i.e. in C#).
Is there are a way for the WebBrowser control to capture click events on the loaded page's elements?
In addition, is it possible to communicate event triggered data between the page and the WebBrowser? All of the above should be possible am I right?
Edit: Probable solution:
I have found the following link that might be a solution. I haven't tested it yet but it's worth the shot. Will update this question depending on my test results.
http://support.microsoft.com/kb/312777
Link taken from: Source
Ok Answer found - tested and it works:
Add a reference from the COM tab called: Microsoft HTML Object Library
The following is an example code:
You will need two components: WebBrowser (webBrowser1) and a TextBox (textBox1)
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
webBrowser1.LoadCompleted += new LoadCompletedEventHandler(webBrowser1_LoadCompleted);
}
private void webBrowser1_LoadCompleted(object sender, NavigationEventArgs e)
{
mshtml.HTMLDocument doc;
doc = (mshtml.HTMLDocument)webBrowser1.Document;
mshtml.HTMLDocumentEvents2_Event iEvent;
iEvent = (mshtml.HTMLDocumentEvents2_Event)doc;
iEvent.onclick += new mshtml.HTMLDocumentEvents2_onclickEventHandler(ClickEventHandler);
}
private bool ClickEventHandler(mshtml.IHTMLEventObj e)
{
textBox1.Text = "Item Clicked";
return true;
}
}
Here is another example.
I was trying to inject a remote javascript file and execute some code when ready by adding a DOM element <script src="{path to remote file}" /> to the header, essentially the same idea as jQuery.getScript(url, callback)..
The code below works fine.
HtmlElementCollection head = browser.Document.GetElementsByTagName("head");
if (head != null)
{
HtmlElement scriptEl = browser.Document.CreateElement("script");
IHTMLScriptElement element = (IHTMLScriptElement)scriptEl.DomElement;
element.src = url;
element.type = "text/javascript";
head[0].AppendChild(scriptEl);
// Listen for readyState changes
((mshtml.HTMLScriptEvents2_Event)element).onreadystatechange += delegate
{
if (element.readyState == "complete" || element.readyState == "loaded")
{
Callback.execute(callbackId);
}
};
}
Good time of day!:).Net 4.0, console. I need to write a parser page in console mode to get the code page in the form in which it is displayed to the user after download, without clicking the buttons, scrolling, and other events. I use this code, but it returns absolutely not what you need. In what ways can I get my desired result? I use other methods or components?
wb = new WebBrowser();
wb.Navigate(linkNorm);
wb.ScriptErrorsSuppressed = true;
wb.DocumentCompleted += new
WebBrowserDocumentCompletedEventHandler(w_DocumentCompleted);
while (wb.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}
originalText = wb.DocumentText;
wb.Dispose();
void w_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
Trace.WriteLine(wb.DocumentText);
}
Change method to wb.Document.Body.OuterHtml - not help me. Result is so bad.
I think this is a pretty simple task, probably someone already solved.
Note - i need full HTML text in string-variable after work ALL JS
Whenever I try to execute JavaScript through C# using CefSharp (Stable 57.0), I get an error. I am simply trying to execute the alert function, so I can make sure that works and later test it out with my own function. However, I seem to be getting errors trying to do so.
public partial class WebBrowserWindow : Window
{
public WebBrowserWindow()
{
InitializeComponent();
webBrowser.MenuHandler = new ContextMenuHandler();
webBrowser.RequestHandler = new RequestHandler();
}
//Trying to execute this with either method gives me an error.
public void ExecuteJavaScript()
{
//webBrowser.GetMainFrame().ExecuteJavaScriptAsync("alert('test')");
//webBrowser.ExecuteScriptAsync("alert('test');");
}
}
I have tried both ways of executing the script.
The first one:
webBrowser.GetMainFrame().ExecuteJavaScriptAsync("alert('test')");
Gives me this error:
The second:
webBrowser.ExecuteScriptAsync("alert('test');");
Gives me this error:
My objective is to create a C# function that can execute a JavaScript function in my CefSharp Browser.
I tried many links/references and there weren't that many on stack overflow. I also read The FAQ for CefSharp and couldn't find any simple examples that allow me to execute JavaScript at will through C#.
In addition, I've verified the events where the Frame is loaded (it finishes loading), and unloaded (it does not unload), and if the webbrowser is null (which it's not), and the message from the:
webBrowser.GetMainFrame().ExecuteJavaScriptAsync("alert('test')");
still causes the first error to occur.
I tested for GetMainFrame(). It always returns null. ALWAYS. Doesn't matter how long I wait, or what conditions I check for.
IMPORTANT
I forgot to add one crucial piece of information, I have 2 assemblies in my project. Both of them compile into separate executables:
Helper.exe
Main.exe
main.exe has a window "CallUI" that, when a button gets clicked, it executes the method I created "ExecuteJavaScript()", which is inside of my window "BrowserWindow". The CallUI window is declared and initialized in Helper.exe.
So basically I am trying to use a separate program to open a window, click a button that calls the method and execute javascript. So I think because they are different processes, it tells me the browser is null. However, when I do it all in Main.exe it works fine. Is there a workaround that allows me to use the separate process to create the window from Helper.exe and execute the Javascript from Main.exe?
It has come to my attention that I was handling the problem the wrong way.
My problem, in fact, doesn't exist if it's just a single process holding all the code together. However, the fact that my project has an executable that was trying to communicate with another was the problem. I actually never had a way for my helper.exe to talk to my main.exe appropriately.
What I learned from this is that the processes were trying to talk to each other without any sort of shared address access. They live in separate address spaces, so whenever my helper.exe tried to execute that javascript portion that belonged in Main.exe, it was trying to execute the script in an uninitialized version of a browser that belonged in its own address space and not main.exe.
So how did I solve that problem? I had to include an important piece that allowed the helper.exe process to talk to the main.exe process. As I googled how processes can talk to each other, I found out about MemoryMappedFiles. So I decided to implement a simple example into my program that allows Helper.exe to send messages to Main.exe.
Here is the example. This is a file I created called "MemoryMappedHandler.cs"
public class MemoryMappedHandler
{
MemoryMappedFile mmf = MemoryMappedFile.CreateOrOpen("mmf1", 512);
MemoryMappedViewStream stream;
MemoryMappedViewAccessor accessor;
BinaryReader reader;
public static Message message = new Message();
public MemoryMappedHandler()
{
stream = mmf.CreateViewStream();
accessor = mmf.CreateViewAccessor();
reader = new BinaryReader(stream);
new Thread(() =>
{
while (stream.CanRead)
{
Thread.Sleep(500);
message.MyStringWithEvent = reader.ReadString();
accessor.Write(0, 0);
stream.Position = 0;
}
}).Start();
}
public static void PassMessage(string message)
{
try
{
using (MemoryMappedFile mmf = MemoryMappedFile.OpenExisting("mmf1"))
{
using (MemoryMappedViewStream stream = mmf.CreateViewStream(0, 512))
{
BinaryWriter writer = new BinaryWriter(stream);
writer.Write(message);
}
}
}
catch (FileNotFoundException)
{
MessageBox.Show("Cannot Send a Message. Please open Main.exe");
}
}
}
This is compiled into a dll that both Main.exe and Helper.exe can use.
Helper.exe uses the method PassMessage() to send the message to a Memory Mapped File called "mmf1". Main.exe, which must be open at all times, takes care of creating that file that can receive the messages from Helper.exe. I sends that Message to a class that holds that message and every time it receives it, it activates an event.
Here is what the Message class looks like:
[Serializable]
public class Message
{
public event EventHandler HasMessage;
public string _myStringWithEvent;
public string MyStringWithEvent
{
get { return _myStringWithEvent; }
set
{
_myStringWithEvent = value;
if (value != null && value != String.Empty)
{
if (HasMessage != null)
HasMessage(this, EventArgs.Empty);
}
}
}
}
Finally, I had to initialize Message in my WebBrowserWindow class like this:
public partial class WebBrowserWindow : Window
{
public WebBrowserWindow()
{
InitializeComponent();
webBrowser.MenuHandler = new ContextMenuHandler();
webBrowser.RequestHandler = new RequestHandler();
MemoryMappedHandler.message.HasMessage += Message_HasMessage;
}
private void Message_HasMessage(object sender, EventArgs e)
{
ExecuteJavaScript(MemoryMappedHandler.message.MyStringWithEvent);
}
public void ExecuteJavaScript(string message)
{
//webBrowser.GetMainFrame().ExecuteJavaScriptAsync("alert('test')");
//webBrowser.ExecuteScriptAsync("alert('test');");
}
}
And now it allows me to execute the javascript I need by sending a message from the Helper.exe to the Main.exe.
Have you tried this link? Contains a snippet that checks if the browser is initialised first.
cefsharp execute javascript
private void OnIsBrowserInitializedChanged(object sender, IsBrowserInitializedChangedEventArgs args)
{
if(args.IsBrowserInitialized)
{
browser.ExecuteScriptAsync("alert('test');");
}
}
I'm scraping dynamic data from a website. For some reason the PageSource that I get() is partial. However, it is not partial when I view the page source directly from Chrome or Firefox browsers. I would like to know an answer that will enable me to completely scrape the data from the page.
For my application, I want to scrape programmatically using a .Net web browser or similar. I've tried using Selenium WebDriver 2.48.2 with ChromeDriver; I've also tried PhantomJSDriver; I've also tried WebClient; and also HttpWebRequest. All with .Net 4.6.1.
The url: http://contests.covers.com/KingOfCovers/Contestant/PendingPicks/ARTDB
None of the following are working...
Attempt #1: HttpWebRequest
var urlContent = "";
try
{
var request = (HttpWebRequest) WebRequest.Create(url);
request.CookieContainer = new CookieContainer();
if (cookies != null)
{
foreach (Cookie cookie in cookies)
{
request.CookieContainer.Add(cookie);
}
}
var responseTask = Task.Factory.FromAsync<WebResponse>(request.BeginGetResponse,request.EndGetResponse,null);
using (var response = (HttpWebResponse)await responseTask)
{
if (response.Cookies != null)
{
foreach (Cookie cookie in response.Cookies)
{
cookies.Add(cookie);
}
}
using (var sr = new StreamReader(response.GetResponseStream()))
{
urlContent = sr.ReadToEnd();
}
}
Attempt #2: WebClient
// requires async method signature
using (WebClient client = new WebClient())
{
var task = await client.DownloadStringTaskAsync(url);
return task;
}
Attempt #3: PhantomJSDriver
var driverService = PhantomJSDriverService.CreateDefaultService();
driverService.HideCommandPromptWindow = true;
using (var driver = new PhantomJSDriver(driverService))
{
driver.Navigate().GoToUrl(url);
WaitForAjax(driver);
string source = driver.PageSource;
return source;
}
public static void WaitForAjax(PhantomJSDriver driver)
{
while (true) // Handle timeout somewhere
{
var ajaxIsComplete = (bool)(driver as IJavaScriptExecutor).ExecuteScript("return jQuery.active == 0");
if (ajaxIsComplete)
break;
Thread.Sleep(100);
}
}
I also tried ChromeDriver using page object model. That code is too long to paste here; nonetheless: it has the exact same result as the other 3 attempts.
Expected Results
The data table from the url is complete, without any missing data. For example, here is a screenshot to compare to the screen shot below. The thing to observe is that there is NOT an "...". Instead there is the data. This can reproduced by opening the url in Firefox or Chrome, right click, and View Page Source.
Actual Results
Observe that where the "..." is a big gap, as the arrow indicates in the screen shot. There should be many rows of content in place of that "...". This can be reproduced using any of the above attempts above.
Please note that the url is dynamic data. You will likely not see the exact same results as the screen shots. Nonetheless, the exercise can be repeated it will simply look different than the screen shots. A quick test to understand that there is missing data is to compare the Page Source line count: the "complete" data set will have nearly twice as many rows in the html.
Ok, as requested. glad to have helped. :)
But in your C# were are you copying from?, in your code you have -> urlContent = sr.ReadToEnd(); How are you seeing, copying the result from this?. Are you copying from the debugger?, if so it's maybe the object inspector of the debugger that's trimming. Have you tried getting the result from urlContent and saving to file?. eg. System.IO.File.WriteAllText(#"temp.txt",urlContent);
private void button1_Click(object sender, EventArgs e)
{
HtmlDocument doc = webBrowser1.Document;
HtmlElement from = doc.GetElementById("fromStation");
HtmlElement to = doc.GetElementById("toStation");
HtmlElement d = doc.GetElementById("journeyDateInputDate");
HtmlElement s = doc.GetElementById("ticketType");
HtmlElement ticket = doc.GetElementById("ticketType");
HtmlElement submit = doc.GetElementById("jpsubmit");
HtmlElement hcab = doc.GetElementById("handicapPassengers");
from.SetAttribute("value", textBox3.Text);
to.SetAttribute("value", textBox4.Text);
d.SetAttribute("value", textBox5.Text);
ticket.SetAttribute("value", Properties.Settings.Default["ticket"].ToString());
string com = "true";
if (Properties.Settings.Default["check"].ToString() == com)
hcab.InvokeMember("click");
submit.InvokeMember("click");
}
I am making project on c# where i have to execute code when webpage navigate from one page to another and when web load completely.
I have used button to execute a code when webpage completely loads....but now i what it to execute without using button
To provide a substantial and correct answer, you might want to provide more specifics on the environment. But assuming a whole bunch of details, I'm making an (un)educated guess here.
If it's MVC project, you can execute the code as you're presenting the next view. If the page is navigated to from JS (which is on the client) or simply navigated away from your site, it might be much more tricky.
In any case, since it's an operation on the client, you'll need to manage that from JS on the client. The server has let the contents go and the page is viewed in the browser even of the server goes down.
$(function(){
alert("Page loaded.");
// do other stuff
});