I'm trying to build a universal rss application for Windows 10 that could be able to download the content of the full article's page for offline consultation.
So after spending a lot of time on stackoverflow I've found some code:
HttpClientHandler handler = new HttpClientHandler { UseDefaultCredentials = true, AllowAutoRedirect = true };
HttpClient client = new HttpClient(handler);
HttpResponseMessage response = await client.GetAsync(ni.Url);
response.EnsureSuccessStatusCode();
string html = await response.Content.ReadAsStringAsync();
However this solution doesn't work on some web page where the content is dynamically called.
So the alternative that remains seems to be that one: load the web page into the Webview control of WinRT and somehow copy and paste the rendered text.
BUT, the Webview doesn't implement any copy/paste method or similar so there is no way to do it easily.
And finally I found this post on stackoverflow (Copying the content from a WebView under WinRT) that seems to be dealing with the same exact problematic as mine with the following solution;
Use the InvokeScript method from the webview to copy and paste the content through a javascript function.
It says: "First, this javascript function must exist in the HTML loaded in the webview."
function select_body() {
var range = document.body.createTextRange();
range.select();
}
and then "use the following code:"
// call the select_body function to select the body of our document
MyWebView.InvokeScript("select_body", null);
// capture a DataPackage object
DataPackage p = await MyWebView.CaptureSelectedContentToDataPackageAsync();
// extract the RTF content from the DataPackage
string RTF = await p.GetView().GetRtfAsync();
// SetText of the RichEditBox to our RTF string
MyRichEditBox.Document.SetText(Windows.UI.Text.TextSetOptions.FormatRtf, RTF);
But what it doesn't say is how to inject the javascript function if it doesn't exist in the page I'm loading ?
If you have a WebView like this:
<WebView Source="http://kiewic.com" LoadCompleted="WebView_LoadCompleted"></WebView>
Use InvokeScriptAsync in combination with eval() to get the document content:
private async void WebView_LoadCompleted(object sender, NavigationEventArgs e)
{
WebView webView = sender as WebView;
string html = await webView.InvokeScriptAsync(
"eval",
new string[] { "document.documentElement.outerHTML;" });
// TODO: Do something with the html ...
System.Diagnostics.Debug.WriteLine(html);
}
Related
I need to change a web page source in GeckoFX web browser including html, css and js.
This is my code:
geckoWebBrowser1.Navigate("http://example.com/");
geckoWebBrowser1.DocumentCompleted += GeckoWebBrowser1_DocumentCompleted;
private void GeckoWebBrowser1_DocumentCompleted(object sender, Gecko.Events.GeckoDocumentCompletedEventArgs e)
{
WebClient w = new WebClient();
string s = (w.DownloadString("http://example.com/"));
//after do changes on (s)
geckoWebBrowser1.LoadHtml(s, "http://example.com/");
But it's not working on javascript, can anyone help me?
The problem is that geckoWebBrowser1.LoadHtml also triggers GeckoWebBrowser1_DocumentCompleted(). So you will loop endlessly.
Move the LoadHtml to another function, or change the content live as below.
Also, are you're using WebClient to download the same page? There is no need, the source is already available.
GeckoHtmlElement element = null;
var geckoDomElement = e.Window.Document.DocumentElement;
if (geckoDomElement is GeckoHtmlElement)
{
element = (GeckoHtmlElement)geckoDomElement;
element.InnerHtml = element.InnerHtml.Replace("Google", "Göggel");
}
Javascript is most easily executed using the following:
using (AutoJSContext context = new AutoJSContext(ActiveBrowser.Window.DomWindow))
{
var result = context.EvaluateScript("testFunction();");
}
I'm trying to parse a vacancies from https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine
But I dont get anything execept plain text like "Job Listings Global/English Deutschland/Deutsch Россия/Русский"
The problem is when you load a page - browser runs a script that load some vacancies, but how can I undesrstand JSOUP cant "simulate" browser and run a script. I tried HtmlUnit, but it also done nothing.
Question: What should i do? Am I doing something wrong with HtmlUnit?
Jsoup
Element page = = Jsoup.connect("https://www.epam.com/careers/job-listings?sort=best_match&query=java&department=all&city=all&country=Poland").get();
HtmlUnit
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_52)) {
page = webClient.getPage("https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine");
}
I think i need manualy run some script with
result = page.executeJavaScript("function aa()");
But which one?
You just need to wait a little as hinted here.
You can use:
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
String url = "https://www.epam.com/careers/job-listings?query=java&department=all&city=Kyiv&country=Ukraine";
HtmlPage page = webClient.getPage(url);
Thread.sleep(3_000);
System.out.println(page.asXml());
}
I'm trying to program my first website scraper, and my first step is to save the HTML to a string. However, from what I can tell, the data that I need to get is not in the HTML code per se, but rather is added after JavaScript executes some stuff.
My current code is this:
let myURLString = "Example URL"
let myURL = URL(string: myURLString)
var myHTMLString = ""
do {
myHTMLString = try String(contentsOf: myURL!)
} catch let error {
print("Error: \(error)")
}
But this doesn't seem to execute the javascript and instead just gives me the 'unprocessed' HTMl.
I read this answer here, but it's written in Swift 2.0 and since I, to be honest, didn't really understand what was going on ( I don't have much programming experience ): I couldn't get to work in Swift 3.
So, Is there a way to take the HTML from a website, run the JavaScript and then save that as a String in Swift 3? And if so, how do you do it?
Thanks!
After some digging I got something that worked:
import Cocoa
import WebKit
class ViewController: NSViewController, WebFrameLoadDelegate {
#IBOutlet var myWebView: WebView!
override func viewDidLoad() {
super.viewDidLoad()
// Do any additional setup after loading the view.
self.myWebView.frameLoadDelegate = self
let urlString = "YOUR HTTPS URL"
self.myWebView.mainFrame.load(NSURLRequest(url: NSURL(string: urlString)! as URL) as URLRequest!)
}
override var representedObject: Any? {
didSet {
// Update the view, if already loaded.
}
}
func webView(_ sender: WebView!, didFinishLoadFor frame: WebFrame!) {
let doc = myWebView.stringByEvaluatingJavaScript(from: "document.documentElement.outerHTML")! //get it as html
//doc now has the 'processed HTML'
}
}
I'm trying to load a webpage in my application background. following code shows How I am loading a page:
request = (HttpWebRequest)WebRequest.Create("http://example.com");
request.CookieContainer = cookieContainer;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream st = response.GetResponseStream();
StreamReader sr = new StreamReader(st);
string responseString = sr.ReadToEnd();
sr.Close();
st.Close();
}
as you know, the server responses HTML codes or some javascript codes, but there are many codes which added to the webpage by javascripts functions. so I have to interpret or compile the first HTTP response.
I tried to use System.Windows.Forms.WebBrowser object to load the webpage completely, but this is a weak engine to do this.
so I tried to use CEFSharp (Chromium embedded Browser), it's great and works fine but I have trouble with that. following is how I use CEFSharp to load a webpage:
ChromiumWebBrowser MainBrowser = new ChromiumWebBrowser("http://Example/");
MainBrowser.FrameLoadEnd+=MainBrowser.FrameLoadEnd;
panel1.Controls.Add(MainBrowser);
MainBrowser.LoadHtml(responseString,"http://example.com");
it works fine when I use this code in Form1.cs and when I add MainBrowser to a panel. but I want to use it in another class, actually ChromiumWebBrowser is part of another custom object and the custom object works in background. also it would possible 10 or 20 custom objects work in a same time. in this situation ChromiumWebBrowser doesn't work any more!
second problem is the threading issue, when I call this function MainBrowser.LoadHtml(responseString,"http://example.com");
it doesn't return any results, so I have to pause the code execution by using Semaphore and wait for the result at this event: MainBrowser.FrameLoadEnd
so I wish my code be some thing like this:
request = (HttpWebRequest)WebRequest.Create("http://example.com");
request.CookieContainer = cookieContainer;
string responseString="";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream st = response.GetResponseStream();
StreamReader sr = new StreamReader(st);
responseString = sr.ReadToEnd();
sr.Close();
st.Close();
}
string FullPageContent = SomeBrowserEngine.LoadHtml(responseString);
//Do stuffs
Can you please show me how to do this? do you know any other web browser engines that work like what I want?
please tell me if I'm doing any things wrong with CEFSharp or other concepts.
I am trying to convert html to pdf from linux ,also i have to use this in web APP please let me know what tools are available for this.Please let me know any other tools for this
So far i have tried
html2ps htmlfilename > a.ps
ps2pdf a.ps > a.pdf
But the above doesnt convert images and is ignoring css .My Development environment is linux(RHEL5)
Also i have tried http://www.webupd8.org/2009/11/convert-html-to-pdf-linux.html i get this error
[root#localhost bin]# ./wkhtmltopdf www.example.com a.pdf
./wkhtmltopdf: error while loading shared libraries: libQtWebKit.so.4: cannot open shared object file: No such file or directory
You are on the right path: wkhtmltopdf is the easiest way to do this. Note that the code in the repositories might be outdated (not sure how up-to date this package is); you may need to compile it from source, or get the statically-linked version (which is huge, but has the QT library and other dependencies already included).
Also, in your case, you may just be missing a library - installing libqt4-webkit-dev might do the trick here.
Two ways that are easy to implement and suitable to convert HTML+CSS to pdf are.
1) Using "Jspdf javascript" plugin with "html2canvas plugin" (Web App).
Insert stable version of jspdf plugin.
var script = document.createElement('script');
script.type = 'text/javascript';
script.src ='https://cdnjs.cloudflare.com/ajax/libs/jspdf/1.0.272/jspdf.min.js';
document.head.appendChild(script);
Insert html2canvas plugin
var script = document.createElement('script');
script.type = 'text/javascript';
script.src = 'https://cdnjs.cloudflare.com/ajax/libs/html2canvas/0.4.1/html2canvas.js';
document.head.appendChild(script);
Insert the following script
var html2obj = html2canvas($('your div class here'));
var queue = html2obj.parse();
var canvas = html2obj.render(queue);
var img = canvas.toDataURL("image/jpg");
console.log(img);
var doc=new jsPDF("p", "mm", "a4");
var width = doc.internal.pageSize.width;
var height = doc.internal.pageSize.height;
doc.addImage(canvas, 'JPEG', 15, 35, 180, 240,'SLOW');
doc.save("save.pdf");
Special Case for IE 11
document.getElementById("your div here").style.backgroundColor = "#FFFFFF";
2) Using wkhtmltopdf
Install wkhtmltopdf from here
we can directly use wkhtmltopdf from terminal/commandLine , However in case of java language we have a wrapper which we can use.
Code Example using wkhtmltopdf wrapper
import com.github.jhonnymertz.wkhtmltopdf.wrapper.Pdf;
import com.github.jhonnymertz.wkhtmltopdf.wrapper.page.PageType;
import com.github.jhonnymertz.wkhtmltopdf.wrapper.params.Param;
public class PofPortlet extends MVCPortlet {
#Override
public void render(RenderRequest request , RenderResponse response) throws PortletException , IOException
{ super.render(request, response);
Pdf pdf = new Pdf();
pdf.addPage("http://www.google.com", PageType.url);
// Add a Table of contents
pdf.addToc();
// The "wkhtmltopdf" shell command accepts different types of options such as global, page, headers and footers, and toc. Please see "wkhtmltopdf -H" for a full explanation.
// All options are passed as array, for example:
// Save the PDF
try {
pdf.saveAs("E:\\output.pdf");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
3) Other tools include phantom.js , itextpdf , grabz.it
Probably the easiest way would be to launch any modern browser, go to the site, and then use the browser's "print" capability to print to a pdf (assuming your system has a pdf printer set up). I don't know if that's an option in your case, though, and this sort of thing won't work from within a web app. Still, you may want to try it.