I have been trying to save the source code of a section of a webpage using PHP. When I extract the content of whole webpage, the source code order is preserved but when I try to get part of the document using
$dom = new DOMDocument;
$dom->loadHTML($webpage);
$xpath = new DOMXPath($dom);
$query_tag = "//div[contains(#class, 'class-name')]";
$result = $dom->saveHTML($xpath->query($query_tag)->item(0));
The script tag gets messed up. Until now, this is the only website where this issue occurred. Are there some limitations of saveHTML function that I am not aware of?
This is what I should be receiving:
<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:\/\/www.searchenginejournal.com\/wp-content\/plugins\/abm-sej\/includes\/category-images\/SPS_128.png","sponsor_text":"<div class=\"taxonomy-description\">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel=\"nofollow\" href=\"http:\/\/sejr.nl\/PowerSuite-2016-5\" onClick=\"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');\" target=\"_blank\">your local SEO strategy<\/a> up to par.<\/div>","logo_url":"http:\/\/sejr.nl\/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}
$('#sponsored-category-header').append('<div class="sponsored-category-logo"></div>');
$('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onClick="__gaTracker(\'send\', \'event\', \'Sponsored Category Click Var 1\', \''+cat_head_params.ga_labels[0]+'\', \''+cat_head_params.ga_labels[0]+'\');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96" /></a>');
$('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
$('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
$('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);
});</script> </div>
This is what I actually get:
<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:\/\/www.searchenginejournal.com\/wp-content\/plugins\/abm-sej\/includes\/category-images\/SPS_128.png","sponsor_text":"<div class=\"taxonomy-description\">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel=\"nofollow\" href=\"http:\/\/sejr.nl\/PowerSuite-2016-5\" onClick=\"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');\" target=\"_blank\">your local SEO strategy<\/a> up to par.<\/div>","logo_url":"http:\/\/sejr.nl\/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}
$('#sponsored-category-header').append('<div class="sponsored-category-logo"></script>
</div>');
$('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onclick="__gaTracker(\'send\', \'event\', \'Sponsored Category Click Var 1\', \''+cat_head_params.ga_labels[0]+'\', \''+cat_head_params.ga_labels[0]+'\');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96"></a>');
$('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
$('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
$('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);
}); </div>
In case you missed it, the ending script tag has moved up a few lines.
Just to be clear, I am not talking about rendered HTML. I am talking about the actual source code that I get after making the request. Any help on how to resolve this issue will be appreciated.
I know that the function saveHTML is causing the issue because when I echo the whole page through PHP, every tag is in the right place.
First of all, your code should be triggering a good bunch of warnings like these:
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in
Entity Warning: DOMDocument::loadHTML(): Unexpected end tag :
strong in Entity Warning: DOMDocument::loadHTML(): Tag header
invalid in Entity
This is to expect with on-the-wild HTML (and this page's code is nor particularly bad) but you haven't even mentioned it, what makes me suspect that you might not have error reporting enabled in your development box.
Additionally, the page has huge amounts of JavaScript and DOMDocument is just an HTML parser.
With that, we can get a clear picture of what's happening. Since DOMDocument is not a full-fledged browser it doesn't understand JavaScript code. That means that it detects the <script> tag but it doesn't handle its contents as JavaScript—it merely looks for a closing tag and the first one he finds is this:
$('#sponsored-category-header').append('<div class="sponsored-category-logo"></div>');
^^^^^^
It doesn't know that it's a JavaScript string and should be ignored. Instead, it thinks the wrong tag is being closed so it attempts to fix what's technically invalid HTML and adds the missing </script> tag.
For this precise reason, the <script>...</script> tag set has traditionally been written this way:
<script type="text/javascript"><!--
var foo = '<p>Escaped end tag<\/p>';
//--></script>
... so user agents that are unaware of JavaScript can safely ignore the whole tag (hey, it's nothing but a good old HTML comment). However, nowadays it's almost universally considered bad practice because "all browsers understand JavaScript".
Final note: the DOM extension is probably aware of the <script> tag and knows it isn't allowed to have other tags inside. That explains why inner opening tags are not considered.
Related
Hi I have a question about automating selecting certain content in an HTML. So if we save an webpage as html only, then we'll get HTML codes along with other stylesheets and javascript codes. However, I only want to extract the HTML codes between <div class='post-content' itemprop='articleBody'>and</div> and then create a new HTML file that has the extracted HTML codes. Is there a possible way to do it? Example codes are down below:
<html>
<script src='.....'>
</script>
<style>
...
</style>
<div class='header-outer'>
<div class='header-title'>
<div class='post-content' itemprop='articleBody'>
<p>content we want</p>
</div>
</div></div>
<div class='footer'>
</div>
</html>
While I'm typing, I'm thinking about javascript, which seems to be able to manipulate HTML DOM elements..Is Ruby able to do that? Can I generate a new clean html that only contains content between <div class='post-content' itemprop='articleBody'>and</div> by using javascript or Ruby? However, as for how to write the actual code, I don't have a clue.
So anybody has any idea about it? Thank you so much!
I'm not quite sure what you're asking, but I'll take a crack at it.
Can Ruby modify the DOM on a webpage?
Short answer, no. Browsers don't know how to run Ruby. They do know how to run javascript, so that's what usually used for real-time DOM manipulation.
Can I generate a new clean html
Yes? At the end of the day, HTML is just a specifically formatted string. If you want to download the source from that page and find everything in the <div class='post-content' itemprop='articleBody'> tag, there are a couple of ways to go about that. The best is probably the nokogiri gem, which is a ruby HTML parser. You'll be able to feed it a string (from a file or otherwise) that represents the old page and strip out what you want. Doing that would look something like this:
require 'nokogiri'
page = Nokogiri::HTML(open("https://googleblog.blogspot.com"))
# finds the first child of the <div class="post-content"> element
text = page.css('.post-content')[0].text
I believe that gives you the text you're looking for. More detailed nokogiri instructions can be found here.
You want to use a regular expression. For example:
//The "m" means multi-line
var regEx = /<div class='post-content' itemprop='articleBody'>([\s\S]*?)<\/div>/m;
//The content (you'll put the javascript at the bottom
var bodyCode = document.body.innerHTML;
var match = bodyCode.match( regEx );
//Prints to the console
console.dir( match );
You can see this in action here: https://regex101.com/r/kJ5kW6/1
So this is actually a very tricky concept to portray so here is my attempt.
I am utilizing an HTML form template in LANDesk Service Desk - tool is irrelevant but important to note that there is back-end code that I cannot touch that is generating HTML.
So basically, the tool is pulling data from a back-end database containing a list of objects. It then inputs this data into an HTML form template that I have created using variables as placeholders for the objects. The HTML is then built on the fly with however many objects are in the database. Thus, I have no way of accessing the head - (which means native JS, and inline CSS).
My template looks like this...
<div class="my-template">
<a class="my-template my-link">My Link</a>
</div>
<script>
var myLinks = document.getElementsByClassName('my-link');
for (var i = 0 ; i < myLinks.length ; i++) {
myLinks[i].style.display = "none";
}
</script>
When I view the source on the loaded page it looks something like this...
<body>
<!--misc. page stuff-->
<!--First Item-->
<div class="auto-create">
<div class="my-template">
<a class="my-template my-link">My-Link</a>
</div>
</div>
<!--Second Item-->
<div class="auto-create">
<div class="my-template">
<a class="my-template my-link">My-Link</a>
</div>
</div>
</body>
All of the elements are formatted the way I want them to be...besides the last element on each page. I have determined that this is because each time the tool is running the object through the template, it is running the script. The issue is, there is a stupid default button that they place at the bottom of each object that is broken. (This is why I have the script changing the style to display: none..should have mentioned this earlier). Basically I want to delay the execution of the script until not only the object has been run through the template...but the entire page has loaded...but I can't seem to get the last button to go away.
I know that this is a lot of poorly written words trying to form an explanation, but I really think this is impossible...but I am convinced there has to be a way. (Also, the company isn't providing us with any help in finding a workaround, so I had to basically MacGyver this one
I am using the LiquidSlider framework and in each tab there is lots of HTML. So I decided to put the HTML into separate .html files to make the main page index.html cleaner.
Here is my HTML:
..
<head>
.. <-- Import jquery, slider files, etc -->
<!-- Import HTML from other files into divs -->
<script>
$(function(){
$("#about-content").load("about.html");
$("#other-content").load("other.html");
$("#help-content").load("help.html");
$("#contact-content").load("contact.html");
});
</script>
</head>
<body>
<section id="navigation">
..
</section>
<div class="liquid-slider" id="main-slider">
<!-- About -->
<div>
<h2 class="title">about</h2>
<div id="about-content"></div>
</div>
<!-- Other -->
<div>
<h2 class="title">other</h2>
<div id="other-content"></div>
</div>
<!-- Help -->
<div>
<h2 class="title">help</h2>
<div id="help-content"></div>
</div>
<!-- Contact -->
<div>
<h2 class="title">contact</h2>
<div id="contact-content"></div>
</div>
</div>
<section id="footer">
..
</section>
</body>
..
So when the document is loaded, theoretically the HTML would be loaded in via the .load calls right? It seems to work fine, until it gets to the very last tab (contact), where it just fails to load any content..
Odd right? I tried moving the divs around to see if it was a problem with my html files, but the last element always fails to load. Then I tried adding another tab, and the last two fail to load. This leads me to believe there is an upper-limit to the number of .load calls, capped at 3?
Anyone have any ideas or see any obvious problems? Or even suggest any better ways of achieving the same thing?
Thanks.
RTM, there's nothing there about a max number of calls, but there's a lot of information (and examples) of what kinds of callbacks you can use, which might just help you to diagnose the problem itself, for example:
$("#contact-content").load("contact.html", function( response, status, xhr )
{
if ( status == "error" )
{
var msg = "Sorry but there was an error: ";
console.log(xhr);//<-- check this
$( "#error" ).html( msg + xhr.status + " " + xhr.statusText );
}
});
As an alternative, just go for the old-school $.get call, since you don't seem to be passing any data to the server:
$.get( "contact.html", function( data )
{
$("#contact-content").html(data);
});
Another thing to consider might be: given that you're using liquidSlider, I take it not all of the content is visible from the off. Why not register a click handler, that .load's that content when the user actually clicks something? That does away with that series of load calls... Perhaps it's a concurrency issue of sorts. By that I mean: browsers restrict the number of concurrent AJAX requests that can be made.Perhaps you're running into that restriction, and have to wait for the requests to be completed? It's a long shot, but you never know... If you want to, check your browser here
But either way, using JS to fetch parts of the content dynamically is all well and good, but remember that I can switch off JS support in my browser. Or that, if your JS contains a syntax error, the script execution grinds to a halt, leaving me with a (half) empty page to gaze at.
Just using any server-side scripting language seems to me to be a better fit:
//index.php -- using PHP as an example
<div id="contact-content"><?php include 'contact.html'; ?></div>
After this gets processed by PHP, the response from the server will be a fully-fledged html page, that doesn't require any JS-on-the-fly loading. It'll almost certainly perform better, and still allows for cleaner html code on your server...
Server Side Includes would seem to me to be a better way of achieving the same thing. Use the right tool for the right job and all that.
<script>
var array = ['about', 'other', 'contact', 'help'];
for (i in array)
{
$('#'+array[i]).load(array[i]+'.html', function(){ });
}
</script>
Describing a scenario:
I am going through the code mentioned below.B asically I am trying to figure out how to program so that
when a user clicks on "Use Template" button , it gets inserted into an editor.
Page 1:
There are lot of templates present
When a user clicks on the "Use Template" button on , it gets inserted into an editor that is present in
the next page (Page 2).
Please find the code snippet below for the first two templates I am going through:
<div id="templatesWrap">
<div class="template" data-templatelocation="templateone" data-templatename="Template ONE" data-templateid="" >
<div class="templateContainer">
<span>
<a href="https://app.abc.com/pqr/core/compose/message/create?token=c1564e8e3cd11bc4t546b587jan31&sMessageTemplateId=templateone&sHubId=&goalComplete=200" title="Use Template">
<img class="thumbnail" src="templatefiles/thumbnail_010.jpg" alt="templateone">
</a>
</span>
<div class="templateName">Template ONE</div>
<p>
Use Template
</p>
</div>
</div>
<div class="template" data-templatelocation="templatetwo" data-templatename="Template TWO" data-templateid="" >
<div class="templateContainer">
<span>
<a href="https://app.abc.com/pqr/core/compose/message/create?token=c1564e8e3cd11bc4t546b587jan31&sMessageTemplateId=templatetwo&sHubId=&goalComplete=200" title="Use Template">
<img class="thumbnail" src="templatefiles/thumbnail_011.jpg" alt="templatetwo">
</a>
</span>
<div class="templateName">Template TWO</div>
<p>
Use Template
</p>
</div>
</div>
And so on ....
How does the link "https://app.abc.com/pqr/core/compose/message/create?token=c1564e8e3cd11bc4t546b587jan31&sMessageTemplateId=templatetwo&sHubId=&goalComplete=200" is inserting the template into the editor which is located on the next page? I haven't understood the token part and lot's of ID's present in the link
which I think are thereason behind inserting the template.
Has anyone come across such link before? Please advise.
Thanks
MORE CLARIFICATIONS:
Thanks for your answer.It did help me somewhat. I have few more questions:
Basically, I am using TinyMCE 4.0.8 version as my editor. The templates, I am using are from here:
https://github.com/mailchimp/email-blueprints/blob/master/templates/2col-1-2-leftsidebar.html
Some questions based on "Tivie" answer.
1) As you can see in the code for "2col-1-2-leftsidebar.html " it's not defined inside <div> tags unlike you defined it in <div> tags. Do you think that I can still
use it using "2col-1-2-leftsidebar.html " name?
2)I believe,for explanation purpose, you have included
`"<div contenteditable="true" id="myEditor">replaced stuff</div>`
and
<button id="btn">Load TPL</button>
<script>
$("#btn").click(function() {
$("#myEditor").load("template.html");
});
</script>
in the same page. Am I right? ( I understand you were trying to make an educated guess here, hence
just asking :) )
In my case, I have a separate page, where I have written code for buttons just like you wrote in editor.html like the following:
<button id="btn">Load TPL</button>. My button is defined inside <div class="templateContainer">.
Also, my templates are defined in a separate folder. So, I will have to grab the content(HTML Template), from
that folder and then insert into TinyMCE 4.08 editor. (Looks like two step process). Could you elaborate
on how should I proceed here?
More Question As of Dec 27
I have modifier my code for the template as follows:
<div class="templateName">Template ONE</div>
<p>
Use Template
</p>
Please note, I have added an additional id attribute for the following purpose.
If I go by the answer mentioned in the Tivia's post, is the following correct?
<script>
$("#temp1").click(function() {
$("#sTextBody").load("FolderURL/template.html");
});
</script>
My editor is defined like the following on Page 2 (Editor Page).
<div class="field">
<textarea id="sTextBody" name="sTextBody" style="width:948px; max-width:948px; height: 70%"></textarea>
</div>
I am confused, like, the script tag I have defined is in Page 1 where I have defined all the template related code
and the Page 2(Editor) page is a different page. It's simply taking me to Editor page (Page 2) and hence not working.
Please advise where I am wrong.
Thanks
MORE QUESTIONS AS of Jan 2
The problem Iam facing is as follows. Basically, for first template , I have the following code.
Code Snippet #1 where "Use "Template" button is present:
<div class="templateName">Template ONE</div>
<p>
Use Template
</p>
And the function suggested in the answer is as follows:
Code Snippet #2 where Editor is present:
<script>
$("#temp1").click(function() {
$("#sTextBody").load("FolderURL/template.html");
});
</script>
Since, I believe I first need to reach to that page after user clicks on "Use Template" button, where the editor is located, I have defined Code Snippet #1 on Page 1 and have defined the Code Snippet #2 and <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script> as the very first two script tags in the Page 2 ( Editor Page). But still when I click on "User Template" button on Page 1, it's just letting me to next page and not loading the template into the editor.
Am I doing something wrong here? Please advise.
P.S. The problem I feel is somehow the click function on Page 2 is not getting activated with the temp1 id button mentioned on Page 1.
Thanks
Well, one can only guess without having access to the page itself (and it's source code). I can, however, make an educated guess on how it works.
The URL params follows a pattern. First you have a token that is equal in all templates. This probably means the token does not have any relevance to the template mechanism itself. Maybe it's an authentication token or something. Not relevant though.
Then you have the template identification (templateOne, templateTwo, etc...) followed by a HubId that is empty. Lastly you have a goalComplete=200 which might correspond to the HTTP success code 200 (OK).
Based on this, my guess would be that they are probably using AJAX on the background, to fetch those templates from the server. Then, via JScript, those templates are inserted into the editor box itself.
Using JQuery, something like this is trivial. here's an example:
template.html
<div>
<h1>TEST</h1>
<span>This is a template</span>
</div>
editor.html
<!DOCTYPE HTML>
<html>
<head>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
</head>
<body>
<div contenteditable="true" id="myEditor">
replaced stuff
</div>
<button id="btn">Load TPL</button>
<script>
$("#btn").click(function() {
$("#myEditor").load("template.html");
});
</script>
</body>
</html>
Edit:
1) Well, since those templates are quite complex and include CSS, you probably want to keep them separated from you editor page (or the template's CSS will mess up your page's css).
However, since you're using TinyMCE, it comes with a template manager built in, so you probably want to use that. Check this link here http://www.tinymce.com/wiki.php/Configuration:templates for documentation.
2) I think 1 answers your question but, just in case, my method above works for any page in any directory, provided it lives on the same domain. Example:
<script>
$("#btn").click(function() {
$("#myEditor").load("someDirectory/template.html");
});
</script>
I recomend you check this page for the specifics on using TinyMCE http://www.tinymce.com/wiki.php/Configuration:templates
EDIT2:
Let me explain the above code:
$("#btn").click(function() { });
This basically tells the browser to run the code inside the brackets when you click the element with an id="btn"
$("#myEditor").load("someDirectory/template.html");
This is an AJAX request (check the documentation here). It grabs the contents of someDirectory/template.html and places them inside the element whose id="myEditor"
The code I want to run upon triggeting the redirect, is to go to another web page (or local html file, either is possible in this situation), however pass some javascript to run on that page, as that page works off embeding content in Iframes. This needs to be done to allow me to specify the content in the iframe upon redirect.
To put it simpler. How can I make it so when you go to website.com/about/, it redirects to website.com/ with the content for /about/ loaded in an iframe?
<head>
<title> CodeBundle </title>
<script>
function home() {document.getElementById("loadedpage").src="home.html";}
function about() {document.getElementById("loadedpage").src="about.html";}
function reviews() {document.getElementById("loadedpage").src="reviews.html";}
function tutorials() {document.getElementById("loadedpage").src="tutorials.html";}
function blog() {document.getElementById("loadedpage").src="blog.html";}
</script>
</head>
<body>
<header>
<br><hr><font size=27><a onClick="home();">Code Bundle</a></font><br><hr>
<div ALIGN=RIGHT>
<font size=6> | <a onClick="about();">About</a> | <a onClick="reviews();">Reviews</a> | <a onClick="tutorials();">Tutorials</a> | <a onClick="blog();">Blog<a> |</font> <hr>
</div>
<iframe id="loadedpage" src=home.html width=100% height=100% frameborder=0>Iframe Failed to Load</iframe>
</header>
</body>
</body>
this is my index.html for website.com/
I want to write a page so that when you go to website.com/about/ it redirects to website.com/ running the javascript function about(), so as to display the about page.
You will have to either pass some data using a query parameter or a fragment identifier.
See:
http://en.wikipedia.org/wiki/Query_string
http://en.wikipedia.org/wiki/Fragment_identifier
In either case you will have something present in the url and it will look like:
http://www.example.com/?page=about
or:
http://www.example.com/#about
or - this would be best:
http://www.example.com/#!/about
because it could let you make the website crawlable. See:
Making AJAX Applications Crawlable
Now after reading your comment to the answer by theredled that you "add new content regularly and loading that in embeded iframes is quicker than writing new html every time" I have to ask this: aren't you using a templating system in your website?
Keep in mind that making AJAX-loaded content and using fragment identifiers to display the right content is not done because the page creation is easier (it isn't) but because the user experience is faster and more responsive. See for example the website for the SoundJS library:
http://www.createjs.com/#!/SoundJS
When you click the link to PreloadJS at the top you go to:
http://www.createjs.com/#!/PreloadJS
The content is reloaded, the address bar changes, but the page is actually not reloaded. (You can see that it is properly crawlable because it shows in the results if you google for ReloadJS.)
Pass content by a user session ?
However, it's a quite dirty case, maybe you already know that :)