I would like to read the content of a website into a string.
I started by using jsoup as follows:
private void getWebsite() {
new Thread(new Runnable() {
#Override
public void run() {
final StringBuilder builder = new StringBuilder();
try {
String query = "https://merhav.nli.org.il/primo-explore/search?tab=default_tab&search_scope=Local&vid=NLI&lang=iw_IL&query=any,contains,הארי פוטר";
Document doc = Jsoup.connect(query).get();
String title = doc.title();
Elements links = doc.select("div");
builder.append(title).append("\n");
for (Element link : links) {
builder.append("\n").append("Link : ").append(link.attr("href"))
.append("\n").append("Text : ").append(link.text());
}
} catch (IOException e) {
builder.append("Error : ").append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
tv_result.setText(builder.toString());
}
});
}
}).start();
}
However, the problem is that in this site, when I web browser such as chrome it says in one of it lines:
window.appPerformance.timeStamps['index.html']= Date.now();</script><primo-explore><noscript>JavaScript must be enabled to use the system</noscript><style>.init-message {
So I read that jsoup doesn't have a good solution for this case.
Is there any good way to get the element of this page even though that it uses javascript?
EDIT:
After trying the suggestions below, I used webView to load the url and then parsed it using jsoap as follows:
wb_result.getSettings().setJavaScriptEnabled(true);
MyJavaScriptInterface jInterface = new MyJavaScriptInterface();
wb_result.addJavascriptInterface(jInterface, "HtmlViewer");
wb_result.setWebViewClient(new WebViewClient() {
#Override
public void onPageFinished(WebView view, String url) {
wb_result.loadUrl("javascript:window.HtmlViewer.showHTML ('<head>'+document.getElementsByTagName('html')[0].innerHTML+'</head>');");
}
});
It did the job and indeed showed me the element. However, still, unlike a browser, it shows some lines as a function and not as a result. For example:
ng-href="{{::$ctrl.getDeepLinkPath()}}"
Is there a way to parse and display the result like in the browser?
Thank you
I'd suggest looking at the network tab in chrome developer tools and then submitting the request to load up the URL ... you'll see a lot of requests going back/forth.
Two that seem to contain relevant content are:
https://merhav.nli.org.il/primo_library/libweb/webservices/rest/primo-explore/v1/pnxs?blendFacetsSeparately=false&getMore=0&inst=NNL&lang=iw_IL&limit=10&newspapersActive=false&newspapersSearch=false&offset=0&pcAvailability=true&q=any,contains,%D7%94%D7%90%D7%A8%D7%99+%D7%A4%D7%95%D7%98%D7%A8&qExclude=&qInclude=&refEntryActive=false&rtaLinks=true&scope=Local&skipDelivery=Y&sort=rank&tab=default_tab&vid=NLI
which requires a token to access token which comes from:
https://merhav.nli.org.il/primo_library/libweb/webservices/rest/v1/guestJwt/NNL?isGuest=true&lang=iw_IL&targetUrl=https%253A%252F%252Fmerhav.nli.org.il%252Fprimo-explore%252Fsearch%253Ftab%253Ddefault_tab%2526search_scope%253DLocal%2526vid%253DNLI%2526lang%253Diw_IL%2526query%253Dany%252Ccontains%252C%2525D7%252594%2525D7%252590%2525D7%2525A8%2525D7%252599%252520%2525D7%2525A4%2525D7%252595%2525D7%252598%2525D7%2525A8&viewId=NLI
.. which likely requires the JSessoinId which comes from:
https://merhav.nli.org.il/primo_library/libweb/webservices/rest/v1/configuration/NLI
.. so in order to replicate the chain of calls you could use JSoup to make these (and any other relevant) HTTP GET requests, pull out the relevant HTTP headers (typically: session, referer, accept and some other cookie values potentially)
Its not going to be straight forward, but you're essentially looking for a url on the page in one of the JSON responses from one of the network requests:
Once you know which request you want to recreate, you just have to work back up the list of requests and try to recreate them.
This one is not an easy one and would require a lot of time to recreate - my advice if you're going to attempt it, forget trying to parse HTML, try to rebuild/recreate the chain of 3 or so HTTP requests to the back end to get the relevant JSON and parse that. You can often pick apart the website but this ones a big job
Related
Background
This may seem to be a duplicate to many other questions. Trust me that it isn't.
I'm trying to load html data into a WebView, being able to capture user hyperlink requests. In the process I've found this answer which does exactly what I want to do, except it captures other requests to things like CSS files and images:
// you tell the webclient you want to catch when a url is about to load
#Override
public boolean shouldOverrideUrlLoading(WebView view, WebResourceRequest request){
return true;
}
// here you execute an action when the URL you want is about to load
#Override
public void onLoadResource(WebView view, String url){
if( url.equals("http://cnn.com") ){
// do whatever you want
}
}
I've shut off automatic image loading, network loads, and Javascript execution:
settings.setBlockNetworkLoads(true);
settings.setBlockNetworkImage(true);
settings.setJavaScriptEnabled(false);
But these do nothing as to preventing the capture of these requests.
Maybe there's a different procedure to capturing the link click, but it was either this or to stop the loading of external resources.
Question
How do I prevent WebView from capturing (or attempting to load) resource requests like CSS, JS, or images?
Otherwise if I can't prevent capturing or attempting to load, how can I differentiate between links clicked and web resources?
Thanks ahead!
You could override WebViewClient's shouldInterceptRequest and return some non-null response instead of the CSS, JS, images, etc. being fetched.
Example:
#Override
public WebResourceResponse shouldInterceptRequest(WebView view, String url) {
Log.d(TAG, "shouldInterceptRequest: " + url);
if (url.contains(".css")
|| url.contains(".js")
|| url.contains(".ico")) { // add other specific resources..
return new WebResourceResponse(
"text/css",
"UTF-8",
getActivity().getResources().openRawResource(R.raw.some_css));
} else {
return super.shouldInterceptRequest(view, url);
}
}
where R.raw.some_css is:
body {
font-family: sans-serif;
}
Note:
I'm not sure what pages you're loading, but this approach may ruin the look of the page.
I've found a way to ignore automated WebView resource requests.
By ignoring requests in the first second of WebView initialization, I am able to isolate user based clicks from the rest:
final Long time = System.currentTimeMillis()/1000;
//load up a WebView, define a WebViewClient for capturing link clicking
WebView webview = new WebView(this);
WebViewClient webviewClient = new WebViewClient() {
#Override
public boolean shouldOverrideUrlLoading(WebView view, WebResourceRequest request){
return true;
}
#Override
public void onLoadResource(WebView view, String url){
Long currentTime = System.currentTimeMillis()/1000;
if (currentTime - time > 1) {
//do stuff here
}
}
};
I have not tested this solution without blocking JavaScript execution and automatic image loading, but it should work regardless:
WebSettings settings = webview.getSettings();
settings.setBlockNetworkLoads(true);
settings.setBlockNetworkImage(true);
settings.setJavaScriptEnabled(false);
Short answer is, you can't.
A longer answer could be like this: you won't be able to do that because it is designed to be "capture all or capture nothing". Web requests are a general concept, not tied to a particular resource like images or css - in fact, it does not have any clue of what does are. That's why you won't find anything.
Do like this: in shouldOverrideUrlLoading, instead of returning true all the time, you only return true for the urls you want to handle yourself. For all other cases, like css and so forth, you return false, so the webview will take care of that for you.
For example:
#Override
public boolean shouldOverrideUrlLoading(WebView view, String url) {
// Ignore css and js
if (url.endsWith(".css") || url.endsWith(".js")) {
return false;
}
return true;
}
I'm currently working on graphing data via d3 into a webview. Naturally, things are breaking as soon as I try to reload the graph and feed it new data. This lovely line keeps popping up: W/cr_BindingManager: Cannot call determinedVisibility() - never saw a connection for the pid.
I've scoured SO for an explanation, but there doesn't seem to be anything conclusive. People are just suggesting to turn on DOM storage in webview settings (which obviously doesn't fix the issue). I'm suspecting there is a race condition between reloading the graph and feeding it new data. I've overridden onPageFinished() in my WebViewClient to call the listener to load the data into the chart, thinking it would resolve the race condition, but to no avail.
Can someone please explain to me what W/cr_BindingManager: Cannot call determinedVisibility() - never saw a connection for the pid means? Am I off in my assessment? How can I debug it?
Any tips are appreciated.
EDIT: I've solved the original issue, but I would still love to learn what that line means. Bounty up.
Consecutive calls to loadUrl cause a race condition. The problem is that loadUrl("file://..") doesn't complete immediately, and so when you call loadUrl("javascript:..") it will sometimes execute before the page has loaded.
This is how I setup my webview:
wv = (CustomWebView) this.findViewById(R.id.webView1);
WebSettings wv_settings = wv.getSettings();
//this is where you fixed your code I guess
//And also by setting a WebClient to catch javascript's console messages :
wv.setWebChromeClient(new WebChromeClient() {
public boolean onConsoleMessage(ConsoleMessage cm) {
Log.d(TAG, cm.message() + " -- From line "
+ cm.lineNumber() + " of "
+ cm.sourceId() );
return true;
}
});
wv_settings.setDomStorageEnabled(true);
wv.setWebViewClient(new WebViewClient() {
#Override
public void onPageFinished(WebView view, String url) {
super.onPageFinished(view, url);
setTitle(view.getTitle());
//do your stuff ...
}
#Override
public boolean shouldOverrideUrlLoading(WebView view, String url) {
if (url.startsWith("file"))
{
// Keep local assets in this WebView.
return false;
}
}
});
//wv.setWebViewClient(new HelpClient(this));//
wv.clearCache(true);
wv.clearHistory();
wv_settings.setJavaScriptEnabled(true);//XSS vulnerable
wv_settings.setJavaScriptCanOpenWindowsAutomatically(true);
wv.loadUrl("file:///android_asset/connect.php.html");
NOTE this line wv.setWebChromeClient(new WebChromeClient());
In API level 19 (Android 4.4 KitKat), the browser engine switched from Android webkit to chromium webkit, with almost all the original WebView API's wrapped to the counterparts of chromium webkit.
This is the method that gives the error (BindingManagerImpl.java), from Chromium source:
#Override
public void determinedVisibility(int pid) {
ManagedConnection managedConnection;
synchronized (mManagedConnections) {
managedConnection = mManagedConnections.get(pid);
}
if (managedConnection == null) {
Log.w(TAG, "Cannot call determinedVisibility() - never saw a connection for the pid: "
+ "%d", pid);
return;
}
It's a rendering warning from content.
You can dig around forever in that github source code, might be nice to see where the method determinedVisibility (in BindingManagerImpl.java) is called from...(suffix “Impl” for Implementation).
Hope this helps ;O)
This usually pops up when you are overriding the method shouldOverrideUrlLoading().
From my WebView usages on prior apps, this is due to what is being rendered on the WebView, what is being caught on the above method and in turn ignored.
I see this a lot when the websites that I load attempt to load scripts outside of the allowed domain.
I'd like to detect when my webview loads a certain page, for example, an incorrect login page. I've tried using onLoadResource and shouldOverrideUrlLoading, but I can't get either to work, and I'm thinking a better way would to parse the HTML whenever the webview starts loading a page, and if a certain string is found within the HTML, then do whatever.
Is there a method to do this? I've tried using TagSoup, but I have no clue how to relate it into my webview. Here's what my code looks like now:
String fullpost = "pass=" + passwordt + "&user=" + usernamet + "&uuid=" + UUID;
String url = "mydomain.com";
mWebview.postUrl(url, EncodingUtils.getBytes(fullpost, "BASE64"));
mWebview.setWebViewClient(new WebViewClient() {
public void onPageFinished(WebView mWebview, String url) {
String webUrl = mWebview.getUrl();
if (webUrl.contains("/loginf")) {
MainActivity.this.mWebview.stopLoading();
MainActivity.this.setContentView(R.layout.preweb);
}
}
});
Basically, the postUrl is initiated from a user click on a button in a layout, and that's what starts the WebView, and then I call setContentView to the layout that contains the webview.
From there, if the login info is correct, the webpage goes to XXX, and if it's incorrect, it goes to YYY. So, I want to detect immediately (and on every page load from there on out), if YYY is loaded, then //domagic. Hope that makes sense. Being the page redirect from url to XXX or YYY is automatic and not initiated by the user, shouldOverrideUrlLoading doesn't work, and I can't figure out how to use onLoadResource, so I'm just completely lost.
My current thought is loading everything in a separate thread and then using the WebView to display the content (that way I can parse the HTML), but I'm not sure how that'd work or even how to do it.
Anyone have any ideas or suggestions?
I think I've read a way to get a text string of a webview's content. Then, you could use jsoup to parse it. [neh, don't even need jsoup; just indexOf string check]
I'll suggest that you do consider handling the login with an HTTP client. It gives you flexibility, and seems the more proper way to go. I've been using the loopj library for HTTP get and post requests. It allows for simpler code. For me, anyway, a relative Android newbie. Here's some code from my project, to get you thinking. I've left out stuff like progress bar, and cookie management.
private void loginFool() {
String urlString = "http://www.example.com/login/";
// username & password
RequestParams params = new RequestParams();
params.put("username", username.getText().toString());
params.put("password", password.getText().toString());
// send the request
loopjClient.post(urlString, params, new TextHttpResponseHandler() {
#Override
public void onStart() {
// called before request is started
//System.err.println("Starting...");
}
#Override
public void onSuccess(int statusCode, Header[] headers, String responseString) {
// called when response HTTP status is "200 OK"
// see if 'success' was a failed login...
int idx = responseString.indexOf("Please try again!");
if(idx > -1) {
makeMyToast("Sorry, login failed!");
}
// or actual success-ful login
else {
// manage cookies here
// put extractData in separate thread
final String responseStr = responseString;
new Thread(new Runnable() {
public void run(){
extractData(responseStr);
selectData(defaultPrefs.getInt("xyz_display_section", 0));
// start the next activity
Intent intent = new Intent(MainActivity.this, PageViewActivity.class);
startActivity(intent);
finish();
}
}).start();
}
}
#Override
public void onFailure(int statusCode, Header[] headers, String responseString, Throwable throwable) {
// called when response HTTP status is "4XX" (eg. 401, 403, 404)
makeMyToast("Whoops, network error!");
}
#Override
public void onFinish() {
// done
}
});
}
You can see that, in the response handler's onSuccess callback, I can test for a string, to see if the login failed, and, in the onFailure callback, I give a network error message.
I'm not experienced enough to know what percent of web servers this type of post login works on.
The loopj client receives and manages cookies. If you will be accessing pages from the site via a webview you need to copy cookies from the loopj client, over to the webview. I cobbled code from a few online posts, to do that:
// get cookies from the generic http session, and copy them to the webview
CookieSyncManager.createInstance(getApplicationContext());
CookieManager.getInstance().removeAllCookie();
CookieManager cookieManager = CookieManager.getInstance();
List<Cookie> cookies = xyzCookieStore.getCookies();
for (Cookie eachCookie : cookies) {
String cookieString = eachCookie.getName() + "=" + eachCookie.getValue();
cookieManager.setCookie("http://www.example.com", cookieString);
//System.err.println(">>>>> " + "cookie: " + cookieString);
}
CookieSyncManager.getInstance().sync();
// holy crap, it worked; I am automatically logged in, in the webview
EDIT: And, I should have included the class variable definitions and initializations:
private AsyncHttpClient loopjClient = new AsyncHttpClient();
private PersistentCookieStore xyzCookieStore;
xyzCookieStore = new PersistentCookieStore(this);
loopjClient.setCookieStore(Utility.xyzCookieStore);
I have been able to get content out of WebView using javascript and loadUrl() method having specified an interface thats called from javascript string that is injected into WebView.The problem is that this only works for me when the loadUrl() method is present in onPageFinished() method in the WebView client. What I want to do is I want to get the content out of the WebView (with the content already loaded). The WebView is in an activity instrumentation test case and I can for instance use findAll() method and that works fine. For some reason I can not use loadUrl() and get the desired behaviour (which is injecting javascript and getting content out of the WebView with a help of an interface).
PLease help.
Thanks
Pawel
EDIT:
Just adding code to show what I am doing exactly:
Yes I understand that but my problem is that I am trying to do it within a test case this way:
public void testWebView() throws Exception {
solo.sleep(3000); // wait for views to load on the screen
WebView a=null;
ArrayList<View> views = solo.getCurrentViews(); // I am using solo object to get views for the screen currently loaded
for(View s:views)
{
if (s instanceof WebView)
{
a = (WebView)s; // this is where I get my WebView
}
}
Instrumentation inst = getInstrumentation();
inst.runOnMainSync(new Runnable()
{
public void run()
{
int d =a.findAll("something"); // this method runs fine on the object and i get the desired result
WebSettings settings = a.getSettings();
settings.setJavaScriptEnabled(true);
a.loadUrl("javascript:document.location = document.getElementById('google').getAttribute('href')"); // this javascript is never executed and that is my problem
}
});
}
You can inject javascript in a loaded page much the same way you can do it in desktop browsers - via inline javascript entered into navigation bar.
Bind some Java object so that it can be called from Javascript with WebView:
addJavascriptInterface(javaObjectExposed, "JSname")
Force execute javascript within an existing page by
WebView.loadUrl("javascript:window.JSname.passData("some data from page");");
I need for a server side process to be able to produce the same view of the HTML dom for a web page as a web browser (I am aware that the dom representation is browser specific, so don't mind a non cross browser solution).
I need to be able to work my way back to a user selection on a web page at a later date. Since there is no firm relationship between the raw HTML for a page, and the Dom that a browser constructs, this is proving very difficult to say the least!
My thinking is now that if I can produce the same view of the document in a server side process, then I may be able to achieve this.
Does anyone have experience of this?
Thanks
OK, different angle.
What about using the WebBrowser Control?
As far as I know, there's nothing preventing web application from adding reference to System.Windows assembly and using it.
Bit of a long shot, but IMO worth trying!
OK... for what it's worth, I was able to successfully use WebBrowser control (yeah, from System.Windows.Forms) to load remote page and iterate its DOM freely.
The bricks in the wall I faced and destroyed are below.
Full code, which for the sake of example show all images in the remote page:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Threading;
using System.Reflection;
using System.Windows.Forms;
using System.Text;
namespace TestZone
{
public partial class _Default : System.Web.UI.Page
{
private bool waiting = false;
private WebBrowser browser = null;
protected void Page_Load(object sender, EventArgs e)
{
Thread thread = new Thread(new ParameterizedThreadStart(LoadRemotePage));
thread.SetApartmentState(ApartmentState.STA);
waiting = true;
thread.Start(this);
while (waiting)
{
Thread.Sleep(10);
}
}
private void LoadRemotePage(object sender)
{
try
{
browser = new WebBrowser();
browser.Tag = sender;
browser.Navigate("http://stackoverflow.com/questions/4082249/in-a-net-application-is-it-possible-to-get-a-representation-of-the-dom-as-a-web/4085520");
browser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(browser_DocumentCompleted);
while (browser.ReadyState != WebBrowserReadyState.Complete)
System.Windows.Forms.Application.DoEvents();
browser.Dispose();
}
catch (Exception ex)
{
litDebug.Text = "Error while initializing browser control: " + ex.ToString().Replace("\n", "<br />");
(sender as _Default).waiting = false;
}
finally
{
}
//hgcDebug.GetType().InvokeMember("InnerHtml", BindingFlags.SetProperty, null, hgcDebug, new object[] { "done" });
}
void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
try
{
HtmlElementCollection collection = browser.Document.GetElementsByTagName("img");
StringBuilder sb = new StringBuilder();
sb.AppendFormat("Total of {0} images:<br />", collection.Count);
for (int i = 0; i < collection.Count; i++)
sb.AppendFormat("name: {0}, src: {1}<br />", collection[i].GetAttribute("name"), collection[i].GetAttribute("src"));
litDebug.Text = sb.ToString();
}
catch (Exception ex)
{
litDebug.Text = "Error while analyzing remote page: " + ex.ToString().Replace("\n", "<br />");
}
finally
{
((sender as WebBrowser).Tag as _Default).waiting = false;
}
}
}
}
Bumps along the way, if anyone is curious:
Exception while creating the WebBrowser control.. thread was in wrong state. Fixed by moving the code to new thread explicitly setting the ApartmentState to STA.
Document property of the WebBrowser was null. First step of the fix was using the DocumentCompleted event instead of tring to access the Document right after Navigating.
Still no luck though, DocumentCompleted never occurred. To fix that I added the loop waiting until the ReadyState is complete. Done and working, but..
All this done, changing the literal from within the new thread had no effect on the actual GUI.. had to wait in the main thread until everything was done.
Hope this will come handy someday for someone, if not for the OP here. :)
Best you can achieve is using WebRequest to read the raw response (HTML output) of the page and assuming it's valid XHTML throw it into XmlReader and you have kind of DOM at hand, at least the nodes.
I've previously used an HTML parsing library called SgmlReader, which worked well for getting HTML tag soup into a workable DOM. I would be surprised if it always produces a DOM identical to what a browser would produce though.