How to automate selecting certain codes in an html? - javascript

Hi I have a question about automating selecting certain content in an HTML. So if we save an webpage as html only, then we'll get HTML codes along with other stylesheets and javascript codes. However, I only want to extract the HTML codes between <div class='post-content' itemprop='articleBody'>and</div> and then create a new HTML file that has the extracted HTML codes. Is there a possible way to do it? Example codes are down below:
<html>
<script src='.....'>
</script>
<style>
...
</style>
<div class='header-outer'>
<div class='header-title'>
<div class='post-content' itemprop='articleBody'>
<p>content we want</p>
</div>
</div></div>
<div class='footer'>
</div>
</html>
While I'm typing, I'm thinking about javascript, which seems to be able to manipulate HTML DOM elements..Is Ruby able to do that? Can I generate a new clean html that only contains content between <div class='post-content' itemprop='articleBody'>and</div> by using javascript or Ruby? However, as for how to write the actual code, I don't have a clue.
So anybody has any idea about it? Thank you so much!

I'm not quite sure what you're asking, but I'll take a crack at it.
Can Ruby modify the DOM on a webpage?
Short answer, no. Browsers don't know how to run Ruby. They do know how to run javascript, so that's what usually used for real-time DOM manipulation.
Can I generate a new clean html
Yes? At the end of the day, HTML is just a specifically formatted string. If you want to download the source from that page and find everything in the <div class='post-content' itemprop='articleBody'> tag, there are a couple of ways to go about that. The best is probably the nokogiri gem, which is a ruby HTML parser. You'll be able to feed it a string (from a file or otherwise) that represents the old page and strip out what you want. Doing that would look something like this:
require 'nokogiri'
page = Nokogiri::HTML(open("https://googleblog.blogspot.com"))
# finds the first child of the <div class="post-content"> element
text = page.css('.post-content')[0].text
I believe that gives you the text you're looking for. More detailed nokogiri instructions can be found here.

You want to use a regular expression. For example:
//The "m" means multi-line
var regEx = /<div class='post-content' itemprop='articleBody'>([\s\S]*?)<\/div>/m;
//The content (you'll put the javascript at the bottom
var bodyCode = document.body.innerHTML;
var match = bodyCode.match( regEx );
//Prints to the console
console.dir( match );
You can see this in action here: https://regex101.com/r/kJ5kW6/1

Related

I can't find a way to delete an HTML script from a Header element

I am kind of a starter in html, css and javascript and I am trying to remove a plain script in html with no source file via a WebExtension, probably with javascript but I can't find the solution to my problem.I have looked everywhere in Stack Exchange and other simmilar blogs and forums but nothing worked
HTML Code:
<script>
const SUPPORT_BASE = "https://support.aternos.org/hc/";
const SUPPORT_ARTICLES = {"countdown":360026950972,"uploadworld":360027235751,"connect":360026805072,"size":360035144691,"adb lock":360034748092,"email":360039498492,"pending":360041686352,"domains":360044623491,"deprecated":360033339752,"backups":360044837012};
</script>
I'm lost.Please help me.And the total HTML element is:
<header class="header" style="">
<script>
const SUPPORT_BASE = "https://support.aternos.org/hc/";
const SUPPORT_ARTICLES = {"countdown":360026950972,"uploadworld":360027235751,"connect":360026805072,"size":360035144691,"adblock":360034748092,"email":360039498492,"pending":360041686352,"domains":360044623491,"deprecated":360033339752,"backups":360044837012};
</script>
</header>
Please help me.
The website is https://aternos.org/server/
Please define what you mean with "remove". If you just want to remove the script from the html file just remove the script text?
If you are trying to write code that will remove it when the code is running please provide the js aswell. Wont be hard with some DOM manipulation.

Embed object to display PDF showing up as invisible

I am using Javascript to input an embed tag to my HTML to display a pdf. The object takes up space but doesn't display anything and is essentially invisible. I tried putting it inside an object tag as well but it doesn't work.
// Invisible
var pdfObj = document.createElement("embed");
pdfObj.setAttribute("src", "./test.pdf");
content?.appendChild(pdfObj);
The pdf file does exist and when I simply put this code in the HTML it displays fine but doesn't when I use javascript.
// works fine
<div class="content" id="main_div">
<embed src="../test.pdf" width="500" height="375" />
</div>
Here is how it shows on the HTML when I use Javascript:
Embed Screenshot
Thanks!
Firstly, welcome to Stackoverflow. You need to make clear pathway in javascript.
`pdfObj.setAttribute("src", "./test.pdf");` this is your code.
`pdfObj.setAttribute("src", "../test.pdf");` this is what it needs to be.
and
try firstly create the element. You just gave a name to document.createElement("embed"). So firstly create it then setAttribute. After that it will work
If this won't work let me hear. I will help as possible as i can

Formatting dynamically formed HTML elements created after Script is run

So this is actually a very tricky concept to portray so here is my attempt.
I am utilizing an HTML form template in LANDesk Service Desk - tool is irrelevant but important to note that there is back-end code that I cannot touch that is generating HTML.
So basically, the tool is pulling data from a back-end database containing a list of objects. It then inputs this data into an HTML form template that I have created using variables as placeholders for the objects. The HTML is then built on the fly with however many objects are in the database. Thus, I have no way of accessing the head - (which means native JS, and inline CSS).
My template looks like this...
<div class="my-template">
<a class="my-template my-link">My Link</a>
</div>
<script>
var myLinks = document.getElementsByClassName('my-link');
for (var i = 0 ; i < myLinks.length ; i++) {
myLinks[i].style.display = "none";
}
</script>
When I view the source on the loaded page it looks something like this...
<body>
<!--misc. page stuff-->
<!--First Item-->
<div class="auto-create">
<div class="my-template">
<a class="my-template my-link">My-Link</a>
</div>
</div>
<!--Second Item-->
<div class="auto-create">
<div class="my-template">
<a class="my-template my-link">My-Link</a>
</div>
</div>
</body>
All of the elements are formatted the way I want them to be...besides the last element on each page. I have determined that this is because each time the tool is running the object through the template, it is running the script. The issue is, there is a stupid default button that they place at the bottom of each object that is broken. (This is why I have the script changing the style to display: none..should have mentioned this earlier). Basically I want to delay the execution of the script until not only the object has been run through the template...but the entire page has loaded...but I can't seem to get the last button to go away.
I know that this is a lot of poorly written words trying to form an explanation, but I really think this is impossible...but I am convinced there has to be a way. (Also, the company isn't providing us with any help in finding a workaround, so I had to basically MacGyver this one

JavaScript: How should I generate a lot of HTML? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Is there a best practice for generating html with javascript
I want to generate large parts of a website with JavaScript.
The straightforward way is to form one large string containing all the HTML:
'<div>'
+ '<span>some text</span>'
+ '<form>'
+ '<input type="text" />'
...
But this gets quite annoying when one has to write a few hundred lines in this style. And the pain when such code has to be changed later on...
Can you think of an easier way?
Create snippets as templates, put them into an invisible <div>:
<div style="display: none">
<div id="template1">
<h2 class="header_identifyingClass">Hello from template</h2>
</div>
<div id="template2">
<span class="content">Blah blah</span>
</div>
</div>
Then find it,
document.getElementById("template1");
fill it's internal values, e.g. find inside elements by XPath or jQuery and fill them e.g. using element.innerHTML = "Hello from new value", and move or copy it to the visible part of DOM.
Create multiple templates and copy it multiple times to generate many.
Don't forget to change the ID for copies to keep it working.
PS: I think I used this approach in the code of JUnitDiff project. But it's buried in XSLT which serves another purpose.
By far the best way to do this is to use some kind of JavaScript templating system. The reason why this is better than hiding HTML with CSS is that if (for example) someone has CSS disabled, they'll be able to see your templates, which is obviously not ideal.
With a templating system, you can put the templates in a <script> tag, meaning that they're totally hidden from everything except JavaScript.
My favourite is the jQuery templating system, mostly because jQuery is so ubiquitous these days. You can get it from here: http://api.jquery.com/category/plugins/templates/
An example (taken from the jQuery docs):
<ul id="movieList"></ul>
<!-- the template is in this script tag -->
<script id="movieTemplate" type="text/x-jquery-tmpl">
<li><b>${Name}</b> (${ReleaseYear})</li>
</script>
<!-- this script will fill out the template with the values you assign -->
<script type="text/javascript">
var movies = [
{ Name: "The Red Violin", ReleaseYear: "1998" },
{ Name: "Eyes Wide Shut", ReleaseYear: "1999" },
{ Name: "The Inheritance", ReleaseYear: "1976" }
];
// Render the template with the movies data and insert
// the rendered HTML under the "movieList" element
$( "#movieTemplate" ).tmpl( movies )
.appendTo( "#movieList" );
</script>
It's a simple example, but you could put all of the HTML you'd like to generate in the <script>, making it much more flexible (use the same HTML snippet for various jobs, just fill out the gaps), or even use many templates to build up a larger HTML snippet.
Use a dialect of JavaScript such as CoffeeScript. It has heredocs:
'''
<div>
<span>some text</span>
<form>
<input type="text" />
'''
If you need to throw in an occasional expression, you can use interpolations:
"""
<title>#{title}</title>
"""
If it's static content that you're just adding to the page on a javascript event, you could consider simply having it in your main HTML page all along, but style with display:none;.
Then it's just a case of changing it's style to make it appear on the page. Much easier.
Even if it's dynamic, you could use this technique: have the shell HTML content there hidden in your page, and populate the dynamic bits before making it visible.
hope that helps.

jQuery: Parse/Manipulate HTML without executing scripts

I'm loading some HTML via Ajax with this format:
<div id="div1">
... some content ...
</div>
<div id="div2">
...some content...
</div>
... etc.
I need to iterate over each div in the response and handle it separately. Having a separate string for the HTML content of each div mapped to the id would satisfy my requirements. However, the divs may contain script tags, which I need to preserve but not execute (they'll execute later when I stick the HTML into the document, so executing during parsing would be bad). My first thought was to do something like this:
// data being the result from $.get
var clean = data.replace(/<script.*?</script>/,function() {
// insert some unique token, save the tag, put it back while I'm processing
});
$('<div/>').html(clean).children().each( /* ... process here ... */);
But I worry that some stupid dev is going to come along and put something like this in one of the divs:
<script> var foo = '</script>'; // ... </script>
Which would screw it all up. Not to mention, the whole thing feels like a hack to begin with. Does anyone know a better way?
EDIT: Here's the solution I've come up with:
var divSplitRegex = /(?:^|<\/div>)\s*<div\s+id="prefix-(.+?)">/g,
idReplacement = preDelimeter+'$1'+postDelimeter;
var r = data.replace(<\/div>\s*$/,'').
replace(divSplitRegex,idReplacement).split(preDelimeter);
$.each(r,function() {
var content;
if(this) {
callback.apply(null,this.split(postDelimeter));
}
});
Where preDelimiter and postDelimeter are just unique strings like "###I'd have to be an idiot to embed this string in my content unescaped because it would break everything###', and callback is a function expecting the div id and the div content. This only works because I know that the divs will have only an id atribute, and the id will have a special prefix. I suppose someone could put a div in their content with an id having the same prefix and it would screw things up too.
So, I still don't love this solution. Anyone have a better one?
FYI, Using unescaped in any JavaScript script causes this issue in a browser. Developers have to escape it anyway so there is no excuse. So you can "trust" that would break in any case.
<body>
<div>
<script>
alert('<script> tags </script> are not '+
'valid in regular old HTML without being escaped.');
</script>
</body>
See
http://jsbin.com/itevu
to see it break. :)
In some cases removing script tags results in invalid html:
<html>
<head>
</head>
<body>
<p>This should be
<script type="text/javascript">
document.writeln("<b");
</script>>bolded</b>.
</body>
</html>

Categories

Resources