I'm trying to figure out how to get all the elements of html. For example, if I load this google search, I'll see this result:
Looking at the source code for that particular section of the page, I saw this:
<a href="https://www.macworld.com/article/3331839/iphone-2019-rumors-everything-you-need-to-know.html" onmousedown="return rwt(this,'','','','38','AOvVaw07dY5FgPEzcYsd8enm-9gs','','2ahUKEwicoNi4yPjhAhVdCTQIHVxICj4QFjAlegQIABAB','','',event)">
<h3 class="LC20lb">iPhone 2019 rumors: Everything you need to know | Macworld</h3><br><div class="TbwUpd">
<cite class="iUh30">https://www.macworld.com/.../iphone-2019-rumors-everything-you-need-to-know.ht...</cite></div></a>
But if I use document.documentElement.innerHTML, I see this:
<div class="g"><h3 class="r">
<a href="/url?q=https://www.macworld.com/article/3331839/iphone-2019-rumors-everything-you-need-to-know.html&sa=U&ved=0ahUKEwiU__rUy_jhAhWIHzQIHTrGBzIQFghLMAo&usg=AOvVaw2C3PdwxIaeNuukMVSwC-5g">
<b>iPhone 2019</b> rumors: Everything you need to know | Macworld</a>
</h3><div class="s"><div class="hJND5c" style="margin-bottom:2px">
My question: why is there a difference between the source code and the output from document.documentElement.innerHTML?
Also, it looks like this when using JavaScript:
<a href="https://www.macworld.com/article/3331839/iphone-2019-rumors-everything-you-need-to-know.html" onmousedown="return rwt(this,'','','','38','AOvVaw07dY5FgPEzcYsd8enm-9gs','','2ahUKEwicoNi4yPjhAhVdCTQIHVxICj4QFjAlegQIABAB','','',event)">
<h3 class="LC20lb">iPhone 2019 rumors: Everything you need to know | Macworld</h3><br><div class="TbwUpd">
<cite class="iUh30">https://www.macworld.com/.../iphone-2019-rumors-everything-you-need-to-know.ht...</cite></div></a>
I wasn't able to re-produce you problem, in my case source code showed exactly the same as document.documentElement.innerHTML. So, I don't really know why in this particular example you have this particular problem.
Even though, source-code of the page frequently may have nothing to do with document's innerHTML.
innerHTML have at least 2 inaccuracies:
It shows result of JS execution that might modify DOM.
For example, here you have the source code of a sample React App.
<body>
<div id="app"></div>
<script src="main.js"></script>
</body>
And here's the output it produces:
In this case, the source is completely different from the innerHTML since we generate new things with js.
However, it'd also be different if we would modify existing markup with JS & It's probable that this is the case with Google's result page.
innerHTML shows what browser have parsed, not the content that was sent from the server.
For example, if I sent a bad HTML from the server like this:
<head>...</head>
<!DOCTYPE html>
<html lang="en">
<body>...</body>
</html>
Then document.documentElement.innerHTML will nicely output my bad markup like this:
<head>...</head>
<body>...</body>
This one probably doesn't affect Google's page but it also worth considering when you build something on the basis of document's innerHTML.
So, if what you really want is the source code of the page, probably, you just need to fetch it from the server directly & just get text out of the response.
In client-side JS you can do so with fetch API. The only problem is that you might not be able to do so from an origin different from google.com since you might run into CORS policy problem.
From the server-side, you certainly would have a tool to do a GET request. So, you might use something like http.get in NodeJs or file_get_contents() in PHP.
Google's HTML tags are way more complex than what you're looking for, but I assume you want something like this
x = document.querySelectorAll('.g')
x.forEach(function(element) {
console.log(element.outerHTML);
});
To me, it looks like certain part of the page is dynamically generated through script at client end and that this script is stored at server side other than google. Therefore you might have to run through CORS policy problem. So, "document.documentElement.innerHTML" will only show the static elements of the page that was written initially to be shown at client side, leaving the script that generated the other elements dynamically.
The returned HTML or XML fragment is generated based on the current contents of the element, so the markup and formatting of the returned fragment is likely not to match the original page markup.
for more detail
Related
I have an aspx page with an IFrame in it. I need to set a string as the IFrame content and have IFrame (or the client, or the server) interpret and display the string as if it were an actual HTML page.
The string would be something like
<!DOCTYPE html><html><head><meta charset="utf-8" /><title></title></head><body><div>Some Text Here</div></body></html>
I can't write the string to an actual HTML page because I'm afraid if two users happened to hit the same page at the same time, they might get each other's content due to latency, etc...
I don't speak Perl/PHP and I've never used JSON, Ajax, JQuery, or anything fancier than HTML, Javascript, .asp, .aspx, CSS, VB/C#, and XML (willing to learn, but time does not permit right now).
Does anyone have any ideas?
Any help would be gratefully accepted and highly appreciated!
You could use srcdoc property like below:
document.getElementById("myFrame").srcdoc = "<p>Some new content inside the iframe!</p>";
My task at hand is to download a file through vba. The problem is, that the page is mostly generated via JavaScript. Sorry that i cannot just share the page with you, because I dont own it, but I will try to make things as clear as possible.
The HTML from the IE source viewer looks similar to this:
<head>
css stuff
jscript link
more jscript links
more css stuff
</head>
<body>
divs and links and so on
<div magic inside that div that shows on browser but not in source code></div>
</body>
I very much believe that the java script generates an iframe and fills it with html code.
Do you think that it is possible to retrieve the finished iframe from the java script? Because I can literally see the HTML code when i use the chrome DOM explorer, but I cannot fetch the HMTL data in vba. It drives me crazy that I dont understand this :D
Thank you for your time
What you have described looks like a typical DHTML that could be generated by JS after XHR request. So open the web page e. g. in Chrome, check the Network tab. After the target content has been generated on the page, you will see all requests on the tab, examine them, usually all the data you need to retrieve are shown there (note that some conversion of the data may be necessary). If you find it then you may just do a XHR with the same parameters to retrieve result. Or another way, you can retrieve the generated HTML content accessing DOM, if the iframe is same origin, as it was mentioned above.
On my website I have a menu button that goes on every page and also a comments section. Instead of copying and pasting this into every single HTML file I created a JavaScript file that creates all of the HTML via the document.write function. This works fine, but as it is getting more and more lengthy and complicated it is also getting harder and harder to find elements and attributes since they are all squashed in one line.
I want to know if there is a better way to do this because I feel this is not the correct way due to it being so messing and disorganized.
I am just using a JavaScript file. It would look something like this:
document.write("<div id="id"></div>");
but with a lot more HTML.
I would suggest templating with a server side language such as PHP. This will allow you to format your different sections so that they are easily readable. Also it will work even if JavaScript is turned off on the browser.
<html>
<head></head>
<?php require("menu.php"); ?>
<!-- HTML body content -->
<?php require("comments.php"); ?>
</html>
If you want to stick with a client side approach then you can just put your menu and comments into separate html files and use jQuery to load it using
$('#Menu').load('menu.html');
$('#CommentSection').load('comments.html');
You can use jquery
Put your button in its own .html file like button.html with .load() in main html file.
$('#WhereYouWantItID').load('whatfolder/button.html');
This will load the button.html file to a specific target on your page
I am looking to have a chunk of html containing a heading which i want to reuse across multiple html pages.
I have tried the EXACT code but it doesn't seem to work. it is displaying the script in HTML rather than actioning it.
index.html:
<html>
<script src="https://code.jquery.com/jquery-1.10.2.js"></script>
<h1>This is a test</h1>
<script>$("#content").load("commonContent.html");</script>
</html>
commonContent.html:
<div id="content"><h2>If this shows my test worked!</h2></div>
Any suggestions would be much appreciated. Please note i am a newbie to javascript!
You need:
To include the jQuery library since your script depends on it
To put your script inside a <script> element
To put an element in the document in which you will load the content (you are trying to use one with id="content" but no such element exists).
I'd recommend using a server side or build time template system instead though. They are more reliable and better food for search engines.
For this type of thing I usually use php like this:
<?php include("youhtmlfile.html"); ?>
It is an advantage because this way you don't have to worry about browser support.
I have partial control of a web page where by I can enter snippets of code at various places, but I cannot remove any preexisting code.
There is a script reference midway through the page
<script src="/unwanted.js" type="text/javascript"></script>
but I do not want the script to load. I cannot access the unwanted.js file. Is there anyway I can use javascript executing above this refernce to cause the unwanted.js file not to load?
Edit: To answer the comments asking what and why:
I'm setting up a Stack Exchange site and the WMD* js file loads halfway down the page. SE will allow you to insert HTML in various parts of the page - so you can have your custom header and footer etc. I want to override the standard WMD code with my own version of it.
I can get around the problem by just loading javascript after the original WMD script loads and replacing the functions with my own - but it would be nice not to have such a large chunk of JS load needlessly.
*WMD = the mark down editor used here at SO, and on the SE sites.
In short, you can't. Even if there is a hack, it would heavily depend on the way browsers parse the HTML and load the scripts and hence wouldn't be compatible with all browsers.
Please tell us exactly what you can and cannot do, and (preferably; this sounds fascinating) why.
If you can, try inserting <!-- before the script include and --> afterwards to comment it out.
Alternatively, look through the script file and see if there's any way that you could break it or nullify its effects. (this would depend entirely on the script itself; if you want more specific advice, please post more details, or preferably, the script itself.
Could you start an HTML comment above it and end below it in another block?
What does the contents of unwanted.js look like?
You can remove a script from the DOM after it is called by using something simple such as:
s = document.getElementById ("my_script");
s.parentNode.removeChild(s);
This will stop all functions of the script but will not take it out of user's cache. However like you wanted it can't be used.
Basically you can't unless you have access to the page content before you render it.
If you can manipulate the HTML before you send it off to the browser, you can write a regular expression that will match the desired piece of code, and remove it.