Scrape HTML after JavaScript has modified it in Swift? - javascript

I'm trying to program my first website scraper, and my first step is to save the HTML to a string. However, from what I can tell, the data that I need to get is not in the HTML code per se, but rather is added after JavaScript executes some stuff.
My current code is this:
let myURLString = "Example URL"
let myURL = URL(string: myURLString)
var myHTMLString = ""
do {
myHTMLString = try String(contentsOf: myURL!)
} catch let error {
print("Error: \(error)")
}
But this doesn't seem to execute the javascript and instead just gives me the 'unprocessed' HTMl.
I read this answer here, but it's written in Swift 2.0 and since I, to be honest, didn't really understand what was going on ( I don't have much programming experience ): I couldn't get to work in Swift 3.
So, Is there a way to take the HTML from a website, run the JavaScript and then save that as a String in Swift 3? And if so, how do you do it?
Thanks!

After some digging I got something that worked:
import Cocoa
import WebKit
class ViewController: NSViewController, WebFrameLoadDelegate {
#IBOutlet var myWebView: WebView!
override func viewDidLoad() {
super.viewDidLoad()
// Do any additional setup after loading the view.
self.myWebView.frameLoadDelegate = self
let urlString = "YOUR HTTPS URL"
self.myWebView.mainFrame.load(NSURLRequest(url: NSURL(string: urlString)! as URL) as URLRequest!)
}
override var representedObject: Any? {
didSet {
// Update the view, if already loaded.
}
}
func webView(_ sender: WebView!, didFinishLoadFor frame: WebFrame!) {
let doc = myWebView.stringByEvaluatingJavaScript(from: "document.documentElement.outerHTML")! //get it as html
//doc now has the 'processed HTML'
}
}

Related

how can i change webpage source including(html,css,js) geckofx c#

I need to change a web page source in GeckoFX web browser including html, css and js.
This is my code:
geckoWebBrowser1.Navigate("http://example.com/");
geckoWebBrowser1.DocumentCompleted += GeckoWebBrowser1_DocumentCompleted;
private void GeckoWebBrowser1_DocumentCompleted(object sender, Gecko.Events.GeckoDocumentCompletedEventArgs e)
{
WebClient w = new WebClient();
string s = (w.DownloadString("http://example.com/"));
//after do changes on (s)
geckoWebBrowser1.LoadHtml(s, "http://example.com/");
But it's not working on javascript, can anyone help me?
The problem is that geckoWebBrowser1.LoadHtml also triggers GeckoWebBrowser1_DocumentCompleted(). So you will loop endlessly.
Move the LoadHtml to another function, or change the content live as below.
Also, are you're using WebClient to download the same page? There is no need, the source is already available.
GeckoHtmlElement element = null;
var geckoDomElement = e.Window.Document.DocumentElement;
if (geckoDomElement is GeckoHtmlElement)
{
element = (GeckoHtmlElement)geckoDomElement;
element.InnerHtml = element.InnerHtml.Replace("Google", "Göggel");
}
Javascript is most easily executed using the following:
using (AutoJSContext context = new AutoJSContext(ActiveBrowser.Window.DomWindow))
{
var result = context.EvaluateScript("testFunction();");
}

Download WebView content in WInRT application

I'm trying to build a universal rss application for Windows 10 that could be able to download the content of the full article's page for offline consultation.
So after spending a lot of time on stackoverflow I've found some code:
HttpClientHandler handler = new HttpClientHandler { UseDefaultCredentials = true, AllowAutoRedirect = true };
HttpClient client = new HttpClient(handler);
HttpResponseMessage response = await client.GetAsync(ni.Url);
response.EnsureSuccessStatusCode();
string html = await response.Content.ReadAsStringAsync();
However this solution doesn't work on some web page where the content is dynamically called.
So the alternative that remains seems to be that one: load the web page into the Webview control of WinRT and somehow copy and paste the rendered text.
BUT, the Webview doesn't implement any copy/paste method or similar so there is no way to do it easily.
And finally I found this post on stackoverflow (Copying the content from a WebView under WinRT) that seems to be dealing with the same exact problematic as mine with the following solution;
Use the InvokeScript method from the webview to copy and paste the content through a javascript function.
It says: "First, this javascript function must exist in the HTML loaded in the webview."
function select_body() {
var range = document.body.createTextRange();
range.select();
}
and then "use the following code:"
// call the select_body function to select the body of our document
MyWebView.InvokeScript("select_body", null);
// capture a DataPackage object
DataPackage p = await MyWebView.CaptureSelectedContentToDataPackageAsync();
// extract the RTF content from the DataPackage
string RTF = await p.GetView().GetRtfAsync();
// SetText of the RichEditBox to our RTF string
MyRichEditBox.Document.SetText(Windows.UI.Text.TextSetOptions.FormatRtf, RTF);
But what it doesn't say is how to inject the javascript function if it doesn't exist in the page I'm loading ?
If you have a WebView like this:
<WebView Source="http://kiewic.com" LoadCompleted="WebView_LoadCompleted"></WebView>
Use InvokeScriptAsync in combination with eval() to get the document content:
private async void WebView_LoadCompleted(object sender, NavigationEventArgs e)
{
WebView webView = sender as WebView;
string html = await webView.InvokeScriptAsync(
"eval",
new string[] { "document.documentElement.outerHTML;" });
// TODO: Do something with the html ...
System.Diagnostics.Debug.WriteLine(html);
}

Is there a better way than eval() in this scenario?

It is a web app, using Google Apps Script, running as the user accessing the app.
We have custom data and code for some users.
That custom information is in a text file within the developer's Google Drive, with only View access from the specific user.
The content of that text file could be like below dummy code:
var oConfig = {
some : "OK",
getinfo : function (s) {
return this.some + s;
}
}
In order to get that custom data / code into the app, we can use eval() as shown below:
var rawjs = DriveApp.getFileById(jsid).getBlob().getDataAsString();
eval(rawjs);
Logger.log(oConfig.getinfo("?")); // OK?
My questions are:
Is there a better way to achieve this goal than eval()?
Is eval() secure enough in this case, considering that the text file is only editable by the developer?
Thanks, Fausto
Well, it looks secure enough. But using eval has other problems, like making it difficult to debug your code, and possibly some other problems.
If you're generating such custom data within your code, I imagine the variety of such customizations is enumerable. If so, I'd leave the code within your script and save in Drive just data and use indicators (like function variants names) of how to rebuild the config object in your script. For example:
function buildConfig(data) {
var config = JSON.parse(data); //only data, no code
config.getInfo = this[config.getInfo]; //hook code safely
return config;
}
function customInfo1(s) { return this.some + s; }
function customInfo2(s) { return s + this.some; }
function testSetup() {
//var userData = DriveApp.getFileById(jsid).getBlob().getDataAsString();
var userData = '{"some":"OK", "getInfo":"customInfo1"}'; //just for easier testing
var config = buildConfig(userdata); //no eval
//let's test it
Logger.log(config.getInfo('test'));
}
It seems secure. But, it will make your execution process slower if you have large data in your text file.
I would still suggest to use JSON.parse() instead of eval() to parse your custom data/code.
{
some : "OK",
getinfo : "function(s){return this.some +\" \"+ s;}"
}
var rawjs = DriveApp.getFileById(jsid).getBlob().getDataAsString();
var oConfig = JSON.parse(rawjs, function(k,v){//put your code here to parse function}); // avoid eval()
Logger.log(oConfig.getinfo("?"));

Using jQuery to access a Flash Function

So I’m trying to interact with a flash variables using jQuery. The original author of the flash based program has not got back to me yet and so I thought to ask here. I'm not that strong in AC3 so forgive me.
Within the original action script, I added a new import statement:
import flash.external.*;
There's a function that initializes the program called ini and added this towards the bottom:
//MODS===========
ExternalInterface.addCallback(‘gotoLastPage’,gotoLastPage)
//===============
For all intensive purposes, just know that there is an existing and working function called gotoLastPage. It is declared as private void and works by the default application. All seemed fine there, got no errors when I recompiled the swf file.
Now the swf object is initialized like this
var flashvars = {};
flashvars.pages = “reader_fl/pages.xml”;
flashvars.settings = “reader_fl/settings.xml”;
var params = {};
params.quality = “high”;
params.scale = “noscale”;
params.wmode = “transparent”; var attributes = {};
attributes.align = “middle”;
attributes.allowFullscreen = “true”;
swffit.showScrollV();
swfobject.embedSWF("reader_fl/PageFlip_v6.swf", "Reader_Window_player", "100%", "100%",
"10.0.0", false, flashvars, params, attributes);
As a note, I'm using swfobject. The reader comes up fine and is wrapping around a div called Reader_Window_player.
Now when I go to jQuery, I tried:
$("#Floating_CtrlStart").click(function(){
var Reader = $('#Reader_Window_player')[0];
Reader.gotoLastPage();
})
However, I still can’t seem to access the gotoLastPage. Console says that gotoLastPage is not defined.
Any help here?
Are you opening the html page from the file system and not served from a web server? If so, that would explain why it's not working.
Calls to ExternalInterface fail if the content (html and swf) is in the local-with-networking or local-with-filesystem sandbox (source: http://help.adobe.com/en_US/ActionScript/3.0_ProgrammingAS3/WS5b3ccc516d4fbf351e63e3d118a9b90204-7c9b.html).
I love JQuery but I usually do that the old fashion way:
var getSwf = function (swfName) {
var isIE = navigator.appName.indexOf("Microsoft") != -1;
return (isIE) ? window[swfName] : document[swfName];
}
getSwf("Reader_Window_player").gotoLastPage();
Also make sure you have the following in your JS:
attributes.id = "Reader_Window_player";
attributes.name = "Reader_Window_player";
and as #Cherniv stated in the comments:
params.allowScriptAccess="always"

How can I get an ASP.NET Session variable into a javascript file?

I am refactoring a legacy web app. This app has a is using the onload event inside the body tag (On the Master page) to run this javascript script. Note this script tag is after the form element in the doc. I know the syntax looks hideous (or Visual Studio at least tells it is by the squiggles), but I'll be darned, the thing DOES indeed work.
function DisplayPDF()
{
var strPDF
strPDF = "<%=SESSION("PDF")%>";
if (strPDF.length != 0)
{
window.open(strPDF);
<%Session("PDF") = ""%>
}
}
My question is I'm trying to develop a more elegant solution. I have ASP.NET ajax and jQuery both available to me. I wrote a tiny asp.net ajax component that I want to use to handle this.
Type.registerNamespace("ck");
ck.pdfOpener = function() {
ck.pdfOpener.initializeBase(this);
}
ck.pdfOpener.prototype = {
initialize: function() {
ck.pdfOpener.callBaseMethod(this, 'initialize');
},
dispose: function() {
ck.pdfOpener.callBaseMethod(this, 'dispose');
},
openPDF: function(){
//HOW CAN I RETRIEVE A SESSION VARIABLE HERE???
}
}
ck.ClientControl.registerClass('ck.pdfOpener', Sys.Component);
if (typeof(Sys) !== 'undefined') Sys.Application.notifyScriptLoaded();
Can/Should I be doing it this way? Or should I create a WebService that returns said variable. Thanks for any advice.
Cheers,
~ck in San Diego
Use the Page.ClientScript.RegisterClientScriptBlock(typeof(YOURPAGECLASS), "KEY", "ACTUALJSCODE"); in your code behind (i.e. the Page_Load event handler)
Alternatlively, you could send the PDF file directly from the asp.net page instead of within the client script.
string pdfFile = Session("PDF");
if (!string.IsNullOrEmpty(pdfFile))
{
Session.Remove("PDF");
Response.ContentType = "application/pdf";
FileInfo fi = new FileInfo(pdfFile);
Response.AddHeader("Content-Disposition", "attachment; filename=\"" + Path.GetFileName(pdfFile) + "\"");
Response.AddHeader("Content-Length", fi.Length.ToString());
Response.WriteFile(pdfFile);
Response.Flush();
return;
}
A simple way to do this would be as you said simply make an Ajax call to a page method, MVC action, HttpHandler or web service that returns the required value from the session. The bigger question here is why you are storing the path to a PDF file in your session state?

Categories

Resources