C# + WebBrowser + JavaScript - full load - javascript

Good time of day!:).Net 4.0, console. I need to write a parser page in console mode to get the code page in the form in which it is displayed to the user after download, without clicking the buttons, scrolling, and other events. I use this code, but it returns absolutely not what you need. In what ways can I get my desired result? I use other methods or components?
wb = new WebBrowser();
wb.Navigate(linkNorm);
wb.ScriptErrorsSuppressed = true;
wb.DocumentCompleted += new
WebBrowserDocumentCompletedEventHandler(w_DocumentCompleted);
while (wb.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}
originalText = wb.DocumentText;
wb.Dispose();
void w_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
Trace.WriteLine(wb.DocumentText);
}
Change method to wb.Document.Body.OuterHtml - not help me. Result is so bad.
I think this is a pretty simple task, probably someone already solved.
Note - i need full HTML text in string-variable after work ALL JS

Related

Cannot get ExecuteScriptAsync() to work as expected

I'm trying to set a HTML input to read-only using ExecuteScriptAsync. I can make it work, but it's not an ideal scenario, so I'm wondering if anyone knows why it doesn't work the way I would expect it to.
I'm using Cef3, version 63.
I tried to see if it's a timing issue and doesn't appear to be.
I tried invalidating the view of the browser but that doesn't seem to help.
The code I currently have, which works:
public void SetReadOnly()
{
var script = #"
(function(){
var labelTags = document.getElementsByTagName('label');
var searchingText = 'Notification Initiator';
var found;
for (var i=0; i<labelTags.length; i++)
{
if(labelTags[i].textContent == searchingText)
{
found = labelTags[i]
break;
}
}
if(found)
{
found.innerHTML='Notification Initiator (Automatic)';
var input;
input = found.nextElementSibling;
if(input)
{
input.setAttribute('readonly', 'readonly');
}
}})()
";
_viewer.Browser.ExecuteScriptAsync(script);
_viewer.Browser.ExecuteScriptAsync(script);
}
now, if I remove
found.innerHTML='Notification Initiator (Automatic)';
the input is no longer shown as read-only. The HTML source of the loaded webpage does show it as read-only, but it seems like the frame doesn't get re-rendered once that property is set.
Another issue is that I'm executing the script twice. If I run it only once I don't get the desired result. I'm thinking this could be a problem with V8 Context that is required for the script to run. Apparently running the script will create the context, so that could be the reason why running it twice works.
I have been trying to figure this out for hours, haven't found anything that would explain this weird behaviour. Does anyone have a clue?
Thanks!

How to completely download page source, instead of partial download?

I'm scraping dynamic data from a website. For some reason the PageSource that I get() is partial. However, it is not partial when I view the page source directly from Chrome or Firefox browsers. I would like to know an answer that will enable me to completely scrape the data from the page.
For my application, I want to scrape programmatically using a .Net web browser or similar. I've tried using Selenium WebDriver 2.48.2 with ChromeDriver; I've also tried PhantomJSDriver; I've also tried WebClient; and also HttpWebRequest. All with .Net 4.6.1.
The url: http://contests.covers.com/KingOfCovers/Contestant/PendingPicks/ARTDB
None of the following are working...
Attempt #1: HttpWebRequest
var urlContent = "";
try
{
var request = (HttpWebRequest) WebRequest.Create(url);
request.CookieContainer = new CookieContainer();
if (cookies != null)
{
foreach (Cookie cookie in cookies)
{
request.CookieContainer.Add(cookie);
}
}
var responseTask = Task.Factory.FromAsync<WebResponse>(request.BeginGetResponse,request.EndGetResponse,null);
using (var response = (HttpWebResponse)await responseTask)
{
if (response.Cookies != null)
{
foreach (Cookie cookie in response.Cookies)
{
cookies.Add(cookie);
}
}
using (var sr = new StreamReader(response.GetResponseStream()))
{
urlContent = sr.ReadToEnd();
}
}
Attempt #2: WebClient
// requires async method signature
using (WebClient client = new WebClient())
{
var task = await client.DownloadStringTaskAsync(url);
return task;
}
Attempt #3: PhantomJSDriver
var driverService = PhantomJSDriverService.CreateDefaultService();
driverService.HideCommandPromptWindow = true;
using (var driver = new PhantomJSDriver(driverService))
{
driver.Navigate().GoToUrl(url);
WaitForAjax(driver);
string source = driver.PageSource;
return source;
}
public static void WaitForAjax(PhantomJSDriver driver)
{
while (true) // Handle timeout somewhere
{
var ajaxIsComplete = (bool)(driver as IJavaScriptExecutor).ExecuteScript("return jQuery.active == 0");
if (ajaxIsComplete)
break;
Thread.Sleep(100);
}
}
I also tried ChromeDriver using page object model. That code is too long to paste here; nonetheless: it has the exact same result as the other 3 attempts.
Expected Results
The data table from the url is complete, without any missing data. For example, here is a screenshot to compare to the screen shot below. The thing to observe is that there is NOT an "...". Instead there is the data. This can reproduced by opening the url in Firefox or Chrome, right click, and View Page Source.
Actual Results
Observe that where the "..." is a big gap, as the arrow indicates in the screen shot. There should be many rows of content in place of that "...". This can be reproduced using any of the above attempts above.
Please note that the url is dynamic data. You will likely not see the exact same results as the screen shots. Nonetheless, the exercise can be repeated it will simply look different than the screen shots. A quick test to understand that there is missing data is to compare the Page Source line count: the "complete" data set will have nearly twice as many rows in the html.
Ok, as requested. glad to have helped. :)
But in your C# were are you copying from?, in your code you have -> urlContent = sr.ReadToEnd(); How are you seeing, copying the result from this?. Are you copying from the debugger?, if so it's maybe the object inspector of the debugger that's trimming. Have you tried getting the result from urlContent and saving to file?. eg. System.IO.File.WriteAllText(#"temp.txt",urlContent);

How to execute code when webpage navigate from one page to another in C#

private void button1_Click(object sender, EventArgs e)
{
HtmlDocument doc = webBrowser1.Document;
HtmlElement from = doc.GetElementById("fromStation");
HtmlElement to = doc.GetElementById("toStation");
HtmlElement d = doc.GetElementById("journeyDateInputDate");
HtmlElement s = doc.GetElementById("ticketType");
HtmlElement ticket = doc.GetElementById("ticketType");
HtmlElement submit = doc.GetElementById("jpsubmit");
HtmlElement hcab = doc.GetElementById("handicapPassengers");
from.SetAttribute("value", textBox3.Text);
to.SetAttribute("value", textBox4.Text);
d.SetAttribute("value", textBox5.Text);
ticket.SetAttribute("value", Properties.Settings.Default["ticket"].ToString());
string com = "true";
if (Properties.Settings.Default["check"].ToString() == com)
hcab.InvokeMember("click");
submit.InvokeMember("click");
}
I am making project on c# where i have to execute code when webpage navigate from one page to another and when web load completely.
I have used button to execute a code when webpage completely loads....but now i what it to execute without using button
To provide a substantial and correct answer, you might want to provide more specifics on the environment. But assuming a whole bunch of details, I'm making an (un)educated guess here.
If it's MVC project, you can execute the code as you're presenting the next view. If the page is navigated to from JS (which is on the client) or simply navigated away from your site, it might be much more tricky.
In any case, since it's an operation on the client, you'll need to manage that from JS on the client. The server has let the contents go and the page is viewed in the browser even of the server goes down.
$(function(){
alert("Page loaded.");
// do other stuff
});

C# WebBrowser control - document does not contain html input control [duplicate]

Most of the answers I have read concerning this subject point to either the System.Windows.Forms.WebBrowser class or the COM interface mshtml.HTMLDocument from the Microsoft HTML Object Library assembly.
The WebBrowser class did not lead me anywhere. The following code fails to retrieve the HTML code as rendered by my web browser:
[STAThread]
public static void Main()
{
WebBrowser wb = new WebBrowser();
wb.Navigate("https://www.google.com/#q=where+am+i");
wb.DocumentCompleted += delegate(object sender, WebBrowserDocumentCompletedEventArgs e)
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)wb.Document.DomDocument;
foreach (IHTMLElement element in doc.all)
{
System.Diagnostics.Debug.WriteLine(element.outerHTML);
}
};
Form f = new Form();
f.Controls.Add(wb);
Application.Run(f);
}
The above is just an example. I'm not really interested in finding a workaround for figuring out the name of the town where I am located. I simply need to understand how to retrieve that kind of dynamically generated data programmatically.
(Call new System.Net.WebClient.DownloadString("https://www.google.com/#q=where+am+i"), save the resulting text somewhere, search for the name of the town where you are currently located, and let me know if you were able to find it.)
But yet when I access "https://www.google.com/#q=where+am+i" from my Web Browser (ie or firefox) I see the name of my town written on the web page. In Firefox, if I right click on the name of the town and select "Inspect Element (Q)" I clearly see the name of the town written in the HTML code which happens to look quite different from the raw HTML that is returned by WebClient.
After I got tired of playing System.Net.WebBrowser, I decided to give mshtml.HTMLDocument a shot, just to end up with the same useless raw HTML:
public static void Main()
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)new mshtml.HTMLDocument();
doc.write(new System.Net.WebClient().DownloadString("https://www.google.com/#q=where+am+i"));
foreach (IHTMLElement e in doc.all)
{
System.Diagnostics.Debug.WriteLine(e.outerHTML);
}
}
I suppose there must be an elegant way to obtain this kind of information. Right now all I can think of is add a WebBrowser control to a form, have it navigate to the URL in question, send the keys "CLRL, A", and copy whatever happens to be displayed on the page to the clipboard and attempt to parse it. That's horrible solution, though.
I'd like to contribute some code to Alexei's answer. A few points:
Strictly speaking, it may not always be possible to determine when the page has finished rendering with 100% probability. Some pages
are quite complex and use continuous AJAX updates. But we
can get quite close, by polling the page's current HTML snapshot for changes
and checking the WebBrowser.IsBusy property. That's what
LoadDynamicPage does below.
Some time-out logic has to be present on top of the above, in case the page rendering is never-ending (note CancellationTokenSource).
Async/await is a great tool for coding this, as it gives the linear
code flow to our asynchronous polling logic, which greatly simplifies it.
It's important to enable HTML5 rendering using Browser Feature
Control, as WebBrowser runs in IE7 emulation mode by default.
That's what SetFeatureBrowserEmulation does below.
This is a WinForms app, but the concept can be easily converted into a console app.
This logic works well on the URL you've specifically mentioned: https://www.google.com/#q=where+am+i.
using Microsoft.Win32;
using System;
using System.ComponentModel;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace WbFetchPage
{
public partial class MainForm : Form
{
public MainForm()
{
SetFeatureBrowserEmulation();
InitializeComponent();
this.Load += MainForm_Load;
}
// start the task
async void MainForm_Load(object sender, EventArgs e)
{
try
{
var cts = new CancellationTokenSource(10000); // cancel in 10s
var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token);
MessageBox.Show(html.Substring(0, 1024) + "..." ); // it's too long!
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
// navigate and download
async Task<string> LoadDynamicPage(string url, CancellationToken token)
{
// navigate and await DocumentCompleted
var tcs = new TaskCompletionSource<bool>();
WebBrowserDocumentCompletedEventHandler handler = (s, arg) =>
tcs.TrySetResult(true);
using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
{
this.webBrowser.DocumentCompleted += handler;
try
{
this.webBrowser.Navigate(url);
await tcs.Task; // wait for DocumentCompleted
}
finally
{
this.webBrowser.DocumentCompleted -= handler;
}
}
// get the root element
var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];
// poll the current HTML for changes asynchronosly
var html = documentElement.OuterHtml;
while (true)
{
// wait asynchronously, this will throw if cancellation requested
await Task.Delay(500, token);
// continue polling if the WebBrowser is still busy
if (this.webBrowser.IsBusy)
continue;
var htmlNow = documentElement.OuterHtml;
if (html == htmlNow)
break; // no changes detected, end the poll loop
html = htmlNow;
}
// consider the page fully rendered
token.ThrowIfCancellationRequested();
return html;
}
// enable HTML5 (assuming we're running IE10+)
// more info: https://stackoverflow.com/a/18333982/1768303
static void SetFeatureBrowserEmulation()
{
if (LicenseManager.UsageMode != LicenseUsageMode.Runtime)
return;
var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName);
Registry.SetValue(#"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
appName, 10000, RegistryValueKind.DWord);
}
}
}
Your web-browser code looks reasonable - wait for something, that grab current content. Unfortunately there is no official "I'm done executing JavaScript, feel free to steal content" notification from browser nor JavaScript.
Some sort of active wait (not Sleep but Timer) may be necessary and page-specific. Even if you use headless browser (i.e. PhantomJS) you'll have the same issue.

IE extension to inject javascript in the webpage

I have implemented an IE extension using C++. Its function is to inject javascript in the webpage's head tag, whenever the extension icon is clicked. I have used execScript method for script injection.
It works fine but when I refresh the webpage, or when I click on any link on the webpage, or when I enter another URL the injected script vanishes away.
I don't want the script to vanish away, I want it to be persistent inside the web browser.
How can I achieve that? I am new to IE extension development, any help would be highly appreciated.
Thanks.
STDMETHODIMP CBlogUrlSnaggerAddIn::Exec(
const GUID *pguidCmdGroup, DWORD nCmdID,
DWORD nCmdExecOpt, VARIANTARG *pvaIn, VARIANTARG *pvaOut){
HRESULT hr = S_OK;
CComPtr<IDispatch> spDispDoc;
hr = m_spWebBrowser->get_Document(&spDispDoc);
if (SUCCEEDED(hr)){
CComPtr<IDispatch> spDispDoc;
hr = m_spWebBrowser->get_Document(&spDispDoc);
if (SUCCEEDED(hr) && spDispDoc){
CComPtr<IHTMLDocument2> spHTMLDoc;
hr = spDispDoc.QueryInterface<IHTMLDocument2>( &spHTMLDoc );
if (SUCCEEDED(hr) && spHTMLDoc){
VARIANT vrt = {0};
CComQIPtr<IHTMLWindow2> win;
hr = spHTMLDoc->get_parentWindow(&win);
CComBSTR bstrScript = L"function fn() {alert('helloooo');}var head = document.getElementsByTagName('head')[0],script = document.createElement('script');script[script.innerText ? 'innerText' : 'textContent'] = '(' + fn + ')()';head.appendChild(script);head.parentNode.replaceChild(script,'script');";
CComBSTR bstrLanguage = L"javascript";
HRESULT hrexec = win->execScript(bstrScript,bstrLanguage, &vrt);
}
}}
Instead of writing the execScript code in the Exec event, try adding the piece of code under OnDocumentComplete method. Use the Sink map which is used to set up event handling. A sample is provided below.
BEGIN_SINK_MAP(CMyClass)
SINK_ENTRY_EX(1, DIID_DWebBrowserEvents2,DISPID_DOCUMENTCOMPLETE , OnDocumentComplete)
END_SINK_MAP()
Implement the DocumentComplete in your class file.
void STDMETHODCALLTYPE CMyClass::OnDocumentComplete(IDispatch *pDisp,VARIANT *pvarURL)
{
//Inject the scripts here
}
Updated :
I haven't tried this, but I guess DownloadBegin event would serve your purpose. It is similar to the the Document complete event mapped, only thing which would differ would be the DISPID_DOWNLOADBEGIN. Map a corresponding handler method to the DISPID and give it a try.
BEGIN_SINK_MAP(CMyClass)
SINK_ENTRY_EX(1,DIID_DWebBrowserEvents2,DISPID_DOWNLOADBEGIN, OnDocumentLoad)
END_SINK_MAP()
Similar to DocumentComplete Handler method
void STDMETHODCALLTYPE CMyClass::OnDocumentLoad(IDispatch *pDisp,VARIANT *pvarURL)
{
//Inject scripts here
}
http://msdn.microsoft.com/en-us/library/cc136547(v=vs.85).aspx

Categories

Resources