Modifying a JS browser extension's HTML parser to exclude images

Modifying a JS browser extension's HTML parser to exclude images - javascript

I'll get out of the way that although I have a programming background I have no experience with any of the tools or languages I'm working with here. Sorry if I've made simple misunderstandings or communicate something unclearly.
I'm using an extension to generate EPUBs from website links. The extension has the option to customize the parsing logic from the options dialog. Doing so yields this code for the parser it's currently using:
var parser = new DOMParser();
var dom = parser.parseFromString(source, "text/html");
var new_link = null;
var subchaps = [];
// Wordpress content
var main_cont = dom.querySelector(".entry-content")
if(main_cont != null){
var ancs = main_cont.querySelectorAll("a");
ancs.forEach((element) => {
if(RegExp(/click here to read|read here|continue reading/i).test(element.innerText)){
new_link = helpers["link_fixer"](element.href, url);
} else if (RegExp(/chapter|part/i).test(element.innerText)) {
subchaps.push(helpers["link_fixer"](element.href, url))
}
});
}
if (new_link != null) {
var res = await fetch(new_link);
var r_txt = await res.text();
dom = parser.parseFromString(r_txt, "text/html");
var out = helpers["readability"](dom);
return {title: out.title, html: out.content};
} else if (subchaps.length > 0) {
var html = "";
for(var subc in subchaps){
console.log(subchaps[subc]);
var cres = await fetch(subchaps[subc]);
var c_txt = await cres.text();
var cdom = parser.parseFromString(c_txt, "text/html");
var out = helpers["readability"](cdom);
html += "<h1>"+out.title+"</h1>"+ out.content
}
return {title: title, html: html};
}
var out = helpers["readability"](dom);
return {title: out.title, html: out.content};
I've inspected this code and gathered that it handles three cases: two ways that it needs to follow links deeper before parsing, and the simple case where it is already in the right place. The lions share of the code deals with the first two cases and it's largely the third that I'm interested in. Unfortunately, it appears that the second to last line is where the parsing actually happens:
var out = helpers["readability"](dom);
And this is a magic box to me. I can't for the life of me figure out what this is referencing.
I've searched the full file for a definition of 'helpers' or even 'readability' but come up blank. I was under the impression that the part I was editing was the 'readibility' parser. I thought I'd be able to pop into the parser logic, add a line to exclude nodes with the <img> tag, and live happily ever after. What am I mistaken about? Or is what I want to do impossible, given what the extension is letting me modify?
To be clear, I am not asking for a full guide on how to write parser logic. I considered just parsing and repackaging the document before the line in question, but I didn't want to write the same code 3 times, and I don't like that I can't tell what it's doing. I couldn't even begin to search for documentation, given that I can't find the definition in the first place. Even just explaining what that line does and pointing to any relevant documentation would be a great help.
Thanks in advance.
(And here is the full file, if you feel like verifying that I didn't miss anything.)
main_parser: |
var link = new URL(url);
var parser = new DOMParser();
var dom = parser.parseFromString(source, "text/html");
switch (link.hostname) {
case "www.novelupdates.com":
var paths = link.pathname.split("/");
if (paths.length > 1 && paths[1] == "series") {
return {page_type:"toc", parser:"chaps_nu"};
}
}
// Default to all links
return {page_type:"toc", parser:"chaps_all_links"};
toc_parsers:
chaps_nu:
name: Novel Update
code: |
var parser = new DOMParser();
var dom = parser.parseFromString(source, "text/html");
var chap_popup = dom.querySelector("#my_popupreading");
if (chap_popup == null) {
return []
}
var chap_lis = chap_popup.querySelectorAll("a");
var chaps = [];
chap_lis.forEach((element) => {
if (element.href.includes("extnu")) {
chaps.unshift({
url_title: element.innerText,
url: helpers["link_fixer"](element.href, url),
});
}
});
var tit = dom.querySelector(".seriestitlenu").innerText;
var desc = dom.querySelector("#editdescription").innerHTML;
var auth = dom.querySelector("#authtag").innerText;
var img = dom.querySelector(".serieseditimg > img");
if (img == null){
img = dom.querySelector(".seriesimg > img");
}
return {"chaps":chaps,
meta:{title:tit, description: desc, author: auth, cover: img.src, publisher: "Novel Update"}
};
chaps_name_search:
name: Chapter Links
code: |
var parser = new DOMParser();
var dom = parser.parseFromString(source, "text/html");
var ancs = dom.querySelectorAll("a");
var chaps = []
ancs.forEach((element) => {
if(RegExp(/chap|part/i).test(element.innerText)){
chaps.push({
url_title: element.innerText,
url: helpers["link_fixer"](element.href, url),
});
}
});
return {"chaps":chaps,
meta:{}
};
chaps_all_links:
name: All Links
code: |
var parser = new DOMParser();
var dom = parser.parseFromString(source, "text/html");
var ancs = dom.querySelectorAll("a");
var chaps = []
ancs.forEach((element) => {
chaps.push({
url_title: element.innerText,
url: helpers["link_fixer"](element.href, url),
});
});
return {"chaps":chaps,
meta:{}
};
chap_main_parser: |
var url = new URL(url);
var parser = new DOMParser();
var dom = parser.parseFromString(source, "text/html");
// Generic parser
return {chap_type: "chap", parser:"chaps_readability"};
chap_parsers:
chaps_readability:
name: Readability
code: |
var parser = new DOMParser();
var dom = parser.parseFromString(source, "text/html");
var new_link = null;
var subchaps = [];
// Wordpress content
var main_cont = dom.querySelector(".entry-content")
if(main_cont != null){
var ancs = main_cont.querySelectorAll("a");
ancs.forEach((element) => {
if(RegExp(/click here to read|read here|continue reading/i).test(element.innerText)){
new_link = helpers["link_fixer"](element.href, url);
} else if (RegExp(/chapter|part/i).test(element.innerText)) {
subchaps.push(helpers["link_fixer"](element.href, url))
}
});
}
if (new_link != null) {
var res = await fetch(new_link);
var r_txt = await res.text();
dom = parser.parseFromString(r_txt, "text/html");
var out = helpers["readability"](dom);
return {title: out.title, html: out.content};
} else if (subchaps.length > 0) {
var html = "";
for(var subc in subchaps){
console.log(subchaps[subc]);
var cres = await fetch(subchaps[subc]);
var c_txt = await cres.text();
var cdom = parser.parseFromString(c_txt, "text/html");
var out = helpers["readability"](cdom);
html += "<h1>"+out.title+"</h1>"+ out.content
}
return {title: title, html: html};
}
var out = helpers["readability"](dom);
return {title: out.title, html: out.content};
chaps_raw:
name: No Parse
code: |
return {title: title, html: source}

Related

How to fix string saving as code to word?

I have the following code:
function downloadNotes() {
var information = document.getElementById("text").innerHTML;
var textToBLOB = new Blob([information], { type: 'text/plain' });
var sFileName = 'formData = document.doc';
var newLink = document.createElement("a");
newLink.download = sFileName;
if (window.webkitURL != null) {
newLink.href = window.webkitURL.createObjectURL(textToBLOB);
}
else {
newLink.href = window.URL.createObjectURL(textToBLOB);
newLink.style.display = "none";
document.body.appendChild(newLink);
}
newLink.click();
}
When I save my notes, it successfully saves it to word, but when I open it, it shows the code all compressed rather than the string output:
Here.

Change this line:
var information = document.getElementById("text").innerHTML;
To this:
var information = document.getElementById("text").innerText;
Your code was reading the HTML content of the element instead of the Text value of the element. If this doesn't work you may need to code it to cut out the HTML.

replace all values using javascript in an xml payload

i have a xml string that i will replace all values of specific tags using javascript,
and this is the code :
function replaceDomainName (xmlPayload,domainId)
{
var oldDomain = '<DOMAIN_NAME>OOO';
var oldDomain2 = '<DomainName>OOO';
var newDomain = '<DOMAIN_NAME>'+domainId ;
var newDomain2 = '<DomainName>'+domainId ;
var xmlString = xmlPayload.toString();
var x = xmlString.replace(/oldDomain/g,newDomain)
x = x.replace(/oldDomain2/g,newDomain2)
console.log(x);
return x ;
}
when I try to invoke the function with the following XML it throws error
<TransmissionHeader xmlns:tran="http://xmlns.oracle.com/apps/otm/TransmissionService" xmlns="">
<Version>20b</Version>
<TransmissionCreateDt>
<GLogDate>20200819124057</GLogDate>
<TZId>UTC</TZId>
<TZOffset>+00:00</TZOffset>
</TransmissionCreateDt>
<TransactionCount>1</TransactionCount>
<SenderHostName>https://xxx</SenderHostName>
<SenderSystemID>https:xxx</SenderSystemID>
<UserName>OOO</UserName>
<SenderTransmissionNo>404836</SenderTransmissionNo>
<ReferenceTransmissionNo>0</ReferenceTransmissionNo>
<GLogXMLElementName>PlannedShipment</GLogXMLElementName>
<NotifyInfo>
<ContactGid>
<Gid>
<DomainName>OOO</DomainName>
<Xid>SYS</Xid>
</Gid>
</ContactGid>
<ExternalSystemGid>
<Gid>
<DOMAIN_NAME>OOO</DOMAIN_NAME>
<Xid>IOT_SYSTEM</Xid>
</Gid>
</ExternalSystemGid>
</NotifyInfo>
</TransmissionHeader>
error: unknown: Unexpected token (14:23)

x.replace(/<DOMAIN_NAME>OOO/g,'<DomainName>'+domainId)
use this

While you can get a lot done with Regex, it can get really complicated when parsing XML.
See this example of using DOMParser and XMLSerializer:
https://jsfiddle.net/o1cenvs3/
const XML = `<TransmissionHeader xmlns:tran="http://xmlns.oracle.com/apps/otm/TransmissionService" xmlns="">
<Version>20b</Version>
<TransmissionCreateDt>
<GLogDate>20200819124057</GLogDate>
<TZId>UTC</TZId>
<TZOffset>+00:00</TZOffset>
</TransmissionCreateDt>
<TransactionCount>1</TransactionCount>
<SenderHostName>https://xxx</SenderHostName>
<SenderSystemID>https:xxx</SenderSystemID>
<UserName>OOO</UserName>
<SenderTransmissionNo>404836</SenderTransmissionNo>
<ReferenceTransmissionNo>0</ReferenceTransmissionNo>
<GLogXMLElementName>PlannedShipment</GLogXMLElementName>
<NotifyInfo>
<ContactGid>
<Gid>
<DomainName>OOO</DomainName>
<Xid>SYS</Xid>
</Gid>
</ContactGid>
<ExternalSystemGid>
<Gid>
<DOMAIN_NAME>OOO</DOMAIN_NAME>
<Xid>IOT_SYSTEM</Xid>
</Gid>
</ExternalSystemGid>
</NotifyInfo>
</TransmissionHeader>`;
if(typeof(String.prototype.trim) === "undefined")
{
String.prototype.trim = function()
{
return String(this).replace(/^\s+|\s+$/g, '');
};
}
function replaceDomainName (xmlPayload, oldValue, newValue)
{
const parser = new DOMParser();
const xmlDoc = parser.parseFromString(xmlPayload,"text/xml");
for(let tagName of ['DOMAIN_NAME', 'DomainName']) {
const instances = xmlDoc.getElementsByTagName(tagName);
for (let instance of instances) {
if(instance.innerHTML.trim() == oldValue )
instance.innerHTML = newValue;
}
};
const s = new XMLSerializer();
const d = document;
const result = s.serializeToString(xmlDoc);
return result;
}
const resultXML = replaceDomainName(XML, 'OOO', 'new.com');
console.log('resultXML', resultXML);
const textarea = document.createElement("textarea");
textarea.innerHTML = resultXML;
textarea.cols = 80;
textarea.rows = 24;
document.body.appendChild(textarea);

Explanation of phantomjs code

I was asked to work on web crawler using phantomjs. However, as I read through the example, I was puzzled by some of the code:
Is this a loop? $("table[id^='post']").each(function(index)
What does this line of code mean? var entry = $(this);
How is the id captured? var id = entry.attr('id').substring(4);
This line var poster = entry.find('a.bigusername'); tries to get a username from the page. Is there a tutorial on how to make use of entry.find to scrap data off the page?
page.open(url1, function (status) {
// Check for page load success
if (status !== "success") {
console.log("Unable to access network");
phantom.exit(231);
} else {
if (page.injectJs("../lib/jquery-2.1.0.min.js") && page.injectJs("../lib/moment-with-langs.js") && page.injectJs("../lib/sugar.js") && page.injectJs("../lib/url.js")){
allResults = page.evaluate(function(url) {
var arr = [];
var title = $("meta[property='og:title']").attr('content');
title = title.trim();
$("table[id^='post']").each(function(index){
var entry = $(this);
var id = entry.attr('id').substring(4);
var poster = entry.find('a.bigusername');
poster = poster.text().trim();
var text = entry.find("div[id^='post_message_']");
//remove quotes of other posts
text.find(".quote").remove();
text.find("div[style='margin:20px; margin-top:5px; ']").remove();
text.find(".bbcode_container").remove();
text = text.text().trim();
var postDate = entry.find("td.thead");
postDate = postDate.first().text().trim();
var postUrl = entry.find("a[id^='postcount']");
if (postUrl){
postUrl = postUrl.attr('href');
postUrl = URL.resolve(url, postUrl);
}
else{
postUrl = url;
}
if (postDate.indexOf('Yesterday') >= 0){
postDate = Date.create(postDate).format('{yyyy}-{MM}-{dd} {HH}:{mm}');
}
else if (postDate.indexOf('Today') >= 0){
postDate = Date.create(postDate).format('{yyyy}-{MM}-{dd} {HH}:{mm}');
}
else{
var d = moment(postDate, 'DD-MM-YYYY, hh:mm A');
postDate = d.format('YYYY-MM-DD HH:mm');
}
var obj = {'id': id, 'title': title, 'poster': poster, 'text': text, 'url': postUrl, 'post_date' : postDate, 'location': 'Singapore', 'country': 'SG'};
arr.push(obj);
});
return arr;
}, url);
console.log("##START##");
console.log(JSON.stringify(allResults, undefined, 4));
console.log("##END##");
console.log("##URL=" + url);
fs.write("../cache/" + encodeURIComponent(url), page.content, "w");
phantom.exit();
}
}
});

Is this a loop? $("table[id^='post']").each(function(index)?
Yes
What does this line of code mean? var entry = $(this);
It assigns a jQuery object to variable entry
How is the id captured? var id = entry.attr('id').substring(4);
It uses jQuery which has attr() function.

create object from processor output to append/replaceChild

Attempting to add parameters to an xsl template, for use in a navigation menu.
Trying to figure out how to use the output that IXSLProcessor leaves me with.
I have the following code that works perfectly for Firefox
var xslStylesheet;
var xsltProcessor = new XSLTProcessor();
var myDOM;
var xmlDoc;
var myXMLHTTPRequest = new XMLHttpRequest();
myXMLHTTPRequest.open("GET", "client.xsl", false);
myXMLHTTPRequest.send(null);
xslStylesheet = myXMLHTTPRequest.responseXML;
xsltProcessor.importStylesheet(xslStylesheet);
// load the xml file
myXMLHTTPRequest = new XMLHttpRequest();
myXMLHTTPRequest.open("GET", "client.xml", false);
myXMLHTTPRequest.send(null);
xmlDoc = myXMLHTTPRequest.responseXML;
// set the parameter using the parameter passed to the outputgroup function
xsltProcessor.setParameter(null, "cid", client);
xsltProcessor.setParameter(null, "browser", "other");
var fragment = xsltProcessor.transformToFragment(xmlDoc,document);
document.getElementById("scriptHook").innerHTML = "";
document.getElementById("maincontent").replaceChild(fragment, document.getElementById("scriptHook"));
scroll(0,0);
This is the code I have (mostly pilfered from msdn)
var xslt = new ActiveXObject("Msxml2.XSLTemplate.3.0");
var xsldoc = new ActiveXObject("Msxml2.FreeThreadedDOMDocument.3.0");
var xslproc;
xsldoc.async = false;
xsldoc.load("client.xsl");
if (xsldoc.parseError.errorCode != 0) {
var myErr = xsldoc.parseError;
WScript.Echo("You have error " + myErr.reason);
} else {
xslt.stylesheet = xsldoc;
var xmldoc = new ActiveXObject("Msxml2.DOMDocument.3.0");
xmldoc.async = false;
xmldoc.load("client.xml");
if (xmldoc.parseError.errorCode != 0) {
var myErr = xmldoc.parseError;
WScript.Echo("You have error " + myErr.reason);
} else {
xslproc = xslt.createProcessor();
xslproc.input = xmldoc;
xslproc.addParameter("cid", client);
xslproc.addParameter("browser", "ie");
xslproc.transform();
//somehow convert xslproc.output to object that can be used in replaceChild
document.getElementById("scriptHook").innerHTML = "";
document.getElementById("maincontent").replaceChild(xslproc.output, document.getElementById("scriptHook"));
}
}
Any and all help is appreciated, cheers.

With Mozilla you can exchange nodes between XSLT and DOM but with IE you need to take the XSLT transformation result as a string and feed that to IE's HTML parser; so for your sample I think you want
document.getElementById("scriptHook").outerHTML = xslproc.output;
which will replace the scriptHook element with the result of the transformation.

javascript parser for a string which contains .ini data

If a string contains a .ini file data , How can I parse it in JavaScript ?
Is there any JavaScript parser which will help in this regard?
here , typically string contains the content after reading a configuration file. (reading cannot be done through javascript , but somehow I gather .ini info in a string.)

I wrote a javascript function inspirated by node-iniparser.js
function parseINIString(data){
var regex = {
section: /^\s*\[\s*([^\]]*)\s*\]\s*$/,
param: /^\s*([^=]+?)\s*=\s*(.*?)\s*$/,
comment: /^\s*;.*$/
};
var value = {};
var lines = data.split(/[\r\n]+/);
var section = null;
lines.forEach(function(line){
if(regex.comment.test(line)){
return;
}else if(regex.param.test(line)){
var match = line.match(regex.param);
if(section){
value[section][match[1]] = match[2];
}else{
value[match[1]] = match[2];
}
}else if(regex.section.test(line)){
var match = line.match(regex.section);
value[match[1]] = {};
section = match[1];
}else if(line.length == 0 && section){
section = null;
};
});
return value;
}
2017-05-10 updated: fix bug of keys contains spaces.
EDIT:
Sample of ini file read and parse

You could try the config-ini-parser, it's similar to python ConfigParser without I/O operations
It could be installed by npm or bower. Here is an example:
var ConfigIniParser = require("config-ini-parser").ConfigIniParser;
var delimiter = "\r\n"; //or "\n" for *nux
parser = new ConfigIniParser(delimiter); //If don't assign the parameter delimiter then the default value \n will be used
parser.parse(iniContent);
var value = parser.get("section", "option");
parser.stringify('\n'); //get all the ini file content as a string
For more detail you could check the project main page or from the npm package page

Here's a function who's able to parse ini data from a string to an object! (on client side)
function parseINIString(data){
var regex = {
section: /^\s*\[\s*([^\]]*)\s*\]\s*$/,
param: /^\s*([\w\.\-\_]+)\s*=\s*(.*?)\s*$/,
comment: /^\s*;.*$/
};
var value = {};
var lines = data.split(/\r\n|\r|\n/);
var section = null;
for(x=0;x<lines.length;x++)
{
if(regex.comment.test(lines[x])){
return;
}else if(regex.param.test(lines[x])){
var match = lines[x].match(regex.param);
if(section){
value[section][match[1]] = match[2];
}else{
value[match[1]] = match[2];
}
}else if(regex.section.test(lines[x])){
var match = lines[x].match(regex.section);
value[match[1]] = {};
section = match[1];
}else if(lines.length == 0 && section){//changed line to lines to fix bug.
section = null;
};
}
return value;
}

Based on the other responses i've modified it so you can have nested sections :)
function parseINI(data: string) {
let rgx = {
section: /^\s*\[\s*([^\]]*)\s*\]\s*$/,
param: /^\s*([^=]+?)\s*=\s*(.*?)\s*$/,
comment: /^\s*;.*$/
};
let result = {};
let lines = data.split(/[\r\n]+/);
let section = result;
lines.forEach(function (line) {
//comments
if (rgx.comment.test(line)) return;
//params
if (rgx.param.test(line)) {
let match = line.match(rgx.param);
section[match[1]] = match[2];
return;
}
//sections
if (rgx.section.test(line)) {
section = result
let match = line.match(rgx.section);
for (let subSection of match[1].split(".")) {
!section[subSection] && (section[subSection] = {});
section = section[subSection];
}
return;
}
});
return result;
}

Develop Reference

JavaScript is the programming language of the Web.

Modifying a JS browser extension's HTML parser to exclude images - javascript

Related

How to fix string saving as code to word?

replace all values using javascript in an xml payload

Explanation of phantomjs code

create object from processor output to append/replaceChild

javascript parser for a string which contains .ini data

Categories

Resources