I am trying to scrape a page to get data for web analytics. I'm on an ecommerce site and have made a dummy purchase. There's a transaction ID on the page but the html surrounding it is not ideal for scraping, the id exists in a unclassified tag with no classes, IDs or other useful attributes of the parent or parents parent.
So, I wnated to view source and ctrl+F the transaction id "123456" and see if it existed anywhere else in the dom.
But, when I view source I get a "confirm form submission" page and don't get to view the html behind the page.
Adding the Javascript tag too in case there's a magical way of searching through all global variables for the value of "123456" anywhere in those variables? If I found the ecommerce data in a global variable object it would be very convenient as opposed to scraping the HTML which, in this case, has few attributes to drill down into
You can just save whole page on your disk as HTML file. In Chrome you can just press Ctrl+S, select your destination and next edit saved file in some text editor.
Related
I am trying to create some editor. What it does is I have several input fields in my current site, when user enter any word , I have seperate html rendered in iframe which is hosted separately which will have some title, subtitle and few other contents . so if user types in title input field. I want to update title in rendered html page located in iframe. please share what is the proper way to setup this scenario?
Tried to manipulate using DOM id, but its not very scalable and also laggy.
Currently checking websockets
I would assign every user an ID which is stored alongside their changes into a database. The iframe would display a php document which loads the values from the db and displays them as intended. On every change you simply reload the php with the corresponding id.
In this case, the answer is here yet: Loading a string if HTML into an iframe using JavaScript
You can do it with
document.getElementById('iframe').src =
"data:text/html;charset=utf-8," + escape(html);
I have a webpage where user will enter his details and then on click of submit, I am invoking a service through ajax call for saving it in db. Then I capture the response from the service written using java REST webservices and display it in the webpage. I am displaying the message "Details saved successfully".This message is read from a properties file in my java service layer and passed back to the web page. I currently need to change font color to the text being returned.
I tried adding the following <font color="red">Details saved successfully</font> to my file data and the same text is passed from service layer. IN the web page I see that the font color is not rendered for the text and that the entire above text with html tags is getting displayed in webpage. I am using $scope.status=response string in js page and in html page, I am rendering as <div id="test">{{status}}</div>
Is there any way to render the html tag when we pass it as string from properties from service layer to js page. My goal is not to change the existing html code and have the html tag rendered when read from properties file and passed as string from services.
Thanks.
Letting the user submit data containing HTML that is rendered when displayed can be a security risk. Users can be very creative in the HTML code that they "inject" in your application. There are lot of sites where you may find more information about that vulnerability.
When you need to add some markup, you might look at alternative markup rendering methods, for example BBCode or MarkDown.
I want to scrape data from a website within my java-application. The data I want to collect is inside a html-table-element. I tried two different methods:
I tried to load the website with a BufferedReader into a String and collect the data from the String.
I tried to use Jsoup to get access to the exact html-element, but it's empty.
Turns out that the table exists, but it is empty as long as the user has not pressed a button (labled "load raw data"). I inspected the sourcecode of the webpage. When the user presses the button, a load_table()-function is called which loads the data into the table. Obviously, the URL remains the same, otherwise I could've just used the other URL where the data is already loaded into the table. Has anyone an idea on how to scrape data from a website although it's only on the website if the user presses a button after the website is loaded?
I'm not really a trained Javascript-coder, but I tried to look through the script which is executed after the user presses the button. It's kind of hard to understand for me but I made a pastebin of the script with a highlighting where I think the rows are added to the table if that helps. The code for the button is:
Load raw data
The code I use to access the html element with Jsoup would be (all the child(x) methods are called on different div-elements to go deeper into the html-document until I finally reach the table-element):
Jsoup.connect(url).get().body().children().get(5).child(0).child(4).child(1).child(1);
As I stated above, the element is empty. I hope the description of my problem is detailed enough and somebody has at least an idea of what I'm trying to say. Sorry for my clumsy expressions. Not a native speaker.
if you are familiar with selenim webdriving you could use selenium to load the page and then pass to source page into beautifulSoup argument.
html = pageSource()
you could parse the page by this method i guess
I am adding a help text model to my rails app, which allows admins to create help text for pages on my site. The form that allows admins to make help text requires text, location, and admin name. I would like the location attribute to have two input fields. The first being the particular page, and the second being the exact html element that the help text is to appear in.
How can I generate a list of the elements on the specified page?
Once the location is selected, I will build a helper to convert that into an xpath. From there, the help text and icon(s) will be added to the page via JavaScript.
If this approach isn't the most effective, what might be another method? RubyGems, scripts, rails magic, etc.
TL;DR How can I generate a list of html elements only knowing the page.
I have a web page that draws data from several other local (same origin) web pages. I collect the data from these other web pages using XMLHttpRequest. I then use the DOM to parse out the needed data from each page. There is one piece of data that I would like to include in each of the other local pages (i.e., in the DOM for each of the other local pages), however, I don't want that data visible when the web page is viewed. (Visible in the source code is OK, just not in the rendered HTML). I can think of a couple of ways of doing that. However, I am not enammered with any of them. I'm wondering what suggestions others might have. Thanks for any input.
Some options:
The hidden attribute:
All HTML elements may have the hidden content attribute
set. The hidden attribute is a boolean attribute. When
specified on an element, it indicates that the element is not yet, or
is no longer, directly relevant to the page's current state, or that
it is being used to declare content to be reused by other parts of the
page as opposed to being directly accessed by the user. User agents
should not render elements that have the hidden attribute
specified.
The template element
The template element is used to declare fragments of HTML that
can be cloned and inserted in the document by script.
In a rendering, the template element represents nothing.
Comments
Depending on the semantics, you can choose one or another. Or even combine them:
<template hidden><!-- Hidden data --></template>
As you mentioned to get through AJAX request, it is in your control where to show or not.
Once you get the result through AJAX, you can store in your script to do some manipulation or show in HTML page itself with parent tag as visible false, so that end user cannot see (except Source code viewing).
What's wrong with a simple hidden div?
<div id="hiddenData" style="display:none;">...</div>
To be honest, it seems like the way you are passing around data is kind of a hack already, so I don't see any real need to be fancy.