Case insensitive XPath contains() possible? - javascript

I'm running over all textnodes of my DOM and check if the nodeValue contains a certain string.
/html/body//text()[contains(.,'test')]
This is case sensitive. However, I also want to catch Test, TEST or TesT. Is that possible with XPath (in JavaScript)?

This is for XPath 1.0. If your environment supports XPath 2.0, see here.
Yes. Possible, but not beautiful.
/html/body//text()[
contains(
translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'),
'test'
)
]
This would work for search strings where the alphabet is known beforehand. Add any accented characters you expect to see.
If you can, mark the text that interests you with some other means, like enclosing it in a <span> that has a certain class while building the HTML. Such things are much easier to locate with XPath than substrings in the element text.
If that's not an option, you can let JavaScript (or any other host language that you are using to execute XPath) help you with building an dynamic XPath expression:
function xpathPrepare(xpath, searchString) {
return xpath.replace("$u", searchString.toUpperCase())
.replace("$l", searchString.toLowerCase())
.replace("$s", searchString.toLowerCase());
}
xp = xpathPrepare("//text()[contains(translate(., '$u', '$l'), '$s')]", "Test");
// -> "//text()[contains(translate(., 'TEST', 'test'), 'test')]"
(Hat tip to #KirillPolishchuk's answer - of course you only need to translate those characters you're actually searching for.)
This approach would work for any search string whatsoever, without requiring prior knowledge of the alphabet, which is a big plus.
Both of the methods above fail when search strings can contain single quotes, in which case things get more complicated.

XPath 2.0 Solutions
Use lower-case():
/html/body//text()[contains(lower-case(.),'test')]
Use matches() regex matching with its case-insensitive
flag:
/html/body//text()[matches(.,'test', 'i')]

Case-insensitive contains
/html/body//text()[contains(translate(., 'EST', 'est'), 'test')]

Yes. You can use translate to convert the text you want to match to lower case as follows:
/html/body//text()[contains(translate(.,
'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'abcdefghijklmnopqrstuvwxyz'),
'test')]

The way i always did this was by using the "translate" function in XPath. I won't say its very pretty but it works correctly.
/html/body//text()[contains(translate(.,'abcdefghijklmnopqrstuvwxyz',
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),'TEST')]
hope this helps,

If you're using XPath 2.0 then you can specify a collation as the third argument to contains(). However, collation URIs are not standardized so the details depend on the product that you are using.
Note that the solutions given earlier using translate() all assume that you are only using the 26-letter English alphabet.
UPDATE: XPath 3.1 defines a standard collation URI for case-blind matching.

Related

ES6 / JS: Regex for replacing delve with conditional chaining

How to replace delve with conditional chaining in a vs code project?
e.g.
delve(seo,'meta')
delve(item, "image.data.attributes.alternativeText")
desired result
seo?.meta
item?.image.data.attributes.alternativeText
Is it possible using find/replace in Visual Studio Code?
I propose the following RegEx:
delve\(\s*([^,]+?)\s*,\s*['"]([^.]+?)['"]\s*\)
and the following replacement format string:
$1?.$2
Explanation: Match delve(, a first argument up until the first comma (lazy match), and then a second string argument (no care is taken to ensure that the brackets match as this is rather quick'n'dirty anyways), then the closing bracket of the call ). Spacing at reasonable places is accounted for.
which will work for simple cases like delve(someVar, "key") but might fail for pathological cases; always review the replacements manually.
Note that this is explicitly made incapable of dealing with delve(var, "a.b.c") because as far as I know, VSC format strings don't support "joining" variable numbers of captures by a given string. As a workaround, you could explicitly create versions with two, three, four, five... dots and write the corresponding replacements. The version for two dots for example looks as follows:
delve\(([^,]+?)\s*,\s*['"]([^.]+?)\.([^.]+?)['"]\s*\)
and the format string is $1?.$2?.$3.
You write:
e.g.
delve(seo,'meta')
delve(item, "image.data.attributes.alternativeText")
desired result
seo?.meta
item?.image.data.attributes.alternativeText
but I highly doubt that this is intended, because delve(item, "image.data.attributes.alternativeText") is in fact equivalent to item?.image?.data?.attributes?.alternativeText rather than the desired result you describe. To make it handle it that way, simply replace [^.] with . to make it accept strings containing any characters (including dots).

Match a word unless it is preceded by an equals sign?

I have the following string
class=use><em>use</em>
that when searched using us I want to transform into
class=use><em><b>us</b>e</em>
I've tried looking at relating answers but I can't quite get it working the way I want it to. I'm especially interested in this answer's callback approach.
Help appreciated
This is a good exercise for writing regular expressions, and here's a possible solution.
"useclass=use><em>use</em>".replace(/([^=]|^)(us)/g, "$1<b>$2</b>");
// returns "<b>us</b>eclass=use><em><b>us</b>e</em>"
([^=]|^) ensures that the prefix of any matched us is either not an equal sign, or it's the start of the string.
As #jamiec pointed out in the comments, if you are using this to parse/modify HTML, just stop right now. It's mathematically impossible to parse a CFG with a regular grammar (even with enhanced JS regexps you will have a bad time trying to achieve that.)
If you can make any assumptions about the structure of your document, you may be better off using an approach that operates on DOM elements directly rather than parsing the whole document with a regex.
Parsing HTML with a regex has certain problems that can be painful to deal with.
var element = document.querySelector('em');
element.innerHTML = element.innerHTML.replace('us', '<b>us</b>');
<div class=use><em>use</em>
</div>
I would first look for any character other than the equals sign [^=] and separate it by parentheses so that I can use it again in my replacement. Then another set of parentheses around the two characters us ought to do it:
var re = /([^=]|^)(us)/
That will give you two capture groups to work with (inside the parentheses), which you can represent with $1 and $2 in your replacement string.
str.replace( /([^=|^])(us)/, '$1<b>$2</b>' );

Convert non-ASCII characters (umlauts, accents...) to their closest ASCII equivalent (for slug creation)

I am looking for way in JavaScript to convert non-ASCII characters in a string to their closest equivalent, similarly to what the PHP iconv function does. For instance if the input string is Rånades på Skyttis i Ö-vik, it should be converted to Ranades pa skyttis i o-vik. I had a look at phpjs but iconv isn't included.
Is it possible to perform such conversion in JavaScript, if so how?
Notes:
more generally this process of conversion is called transliteration
my use-case is the creation of URL slugs
The easiest way I've found:
var str = "Rånades på Skyttis i Ö-vik";
var combining = /[\u0300-\u036F]/g;
console.log(str.normalize('NFKD').replace(combining, ''));
For reference see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
I would recommend Unicode package, it will also map Greek and Cyrillic letters to their closest ascii symbol:
unidecode('Lillı Celiné Никита Ödipus');
'Lilli Celine Nikita Odipus'
It's because iconv is a native compiled UNIX utility behind the most i18n character map conversion functions.
You won't find it in javascript unless you access some browser component.
Encoding is a property of the document so most javascript implementation just simply dismiss it.
You'll need a pure js library for unaccented strings. It would be the best to have one for the specific language you need.
The simpliest way is via some translate tables or even regex replaces.
like here : http://lehelk.com/2011/05/06/script-to-remove-diacritics/
check this thread too : Replacing diacritics in Javascript

Preparing a regular expression for javascript

I have made this regular expression which does exactly what I want when I test it in e.g. RegExr:
^https?:\/\/(www\.)?(test\.yahoo\.com|sub\.yahoo\.com)?(?!([a-z0-9]+\.)?(localhost|yahoo\.com))(.*)?
However when I test it in javascript it says that the expression is invalid. After hours of debugging I found out that this expression works in javascript:
^https?:\/\/(www\.)?(test\.yahoo\.com|sub\.yahoo\.com)?(?![a-z0-9]+\.)?(localhost|yahoo\.com)(.*)?
However this doesn't do what I want (again testing in RegExr).
Why cannot I use the first expression in javascript? And how do I fix it?
UPDATE JULY 25
Sorry for the lack of info. The way I am using the Regexp is through a jQuery extension which lets me select using regexp. The script can be seen here: http://james.padolsey.com/javascript/regex-selector-for-jquery/
The specific code I am trying to get to work is:
$('a:regex(href, ^https?:\/\/(www\.)?(test\.yahoo\.com|sub\.yahoo\.com)?(?!([a-z0-9]+\.)?(localhost|yahoo\.com))(.*)?)').live('click', function(e) {
After including the linked jQuery plugin. The text strings I am testing are:
http://yahoo.com
http://google.dk
http://subdomain.yahoo.com
http://test.yahoo.com
http://localhost.dk
http://sub.yahoo.com/lalala
Where it is supposed to match "http://google.dk", "http://test.yahoo.com" and "http://sub.yahoo.com/lalala" - which it does when using RegExr but failing (invalid expression) using the jQuery plugin.
The first regular expression is not invalid:
var regexp = /^https?:\/\/(www\.)?(test\.yahoo\.com|sub\.yahoo\.com)?(?!([a-z0-9]+\.)?(localhost|yahoo\.com))(.*)?/;
works fine.
If you want to instantiate the expression from a string, you have to double all the backslashes:
var regexp = new RegExp("^https?:\\/\\/(www\\.)?(test\\.yahoo\\.com|sub\\.yahoo\\.com)?(?!([a-z0-9]+\\.)?(localhost|yahoo\\.com))(.*)?");
When you start from a string, you have to account for the fact that the string constant itself uses backslashes as a quoting mechanism, so there will be two evaluations made: one as a string, and one as a regular expression.
edit — OK I think I see the problem. That plugin you're trying to use is simply attempting to do something that's just not going to work, given the way that Sizzle parses selectors. In other words, the problem is not with your regular expression, it's with the overall selector. It is not even getting far enough to parse the regular expression.
Specifically it seems to be nested parentheses inside the regular expression. Something as simple as
$('a:regex(href, ((abc)))')
causes an error. You can instead do something like this:
$('a').filter(function() {
return /^https?:\/\/(www\.)?(test\.yahoo\.com|sub\.yahoo\.com)?(?!([a-z0-9]+\.)?(localhost|yahoo\.com))(.*)?/.test(this.href);
}).whatever( ... );

Javascript and CSS, using dashes

I'm starting to learn some javascript and understand that dashes are not permitted when naming identifiers. However, in CSS it's common to use a dash for IDs and classes.
Does using a dash in CSS interfere with javascript interaction somehow? For instance if I were to use getElementByID("css-dash-name"). I've tried a few examples using getElementByID with dashes as a name for a div ID and it worked, but I'm not sure if that's the case in all other contexts.
Having dashes and underscores in the ID (or class name if you select by that) that won't have any negative effect, it's safe to use them. You just can't do something like:
var some-element = document.getElementByID('css-dash-name');
The above example is going to error out because there is a dash in the variable you're assigning the element to.
The following would be fine though since the variable doesn't contain a dash:
var someElement = document.getElementByID('css-dash-name');
That naming limitation only exists for the javascript variables themselves.
It's only in the cases where you can access the elements as properties that it makes a difference. For example form fields:
<form>
<input type="text" name="go-figure" />
<input type="button" value="Eat me!" onclick="...">
</form>
In the onclick event you can't access the text box as a property, as the dash is interpreted as minus in Javascript:
onclick="this.form.go-figure.value='Ouch!';"
But you can still access it using a string:
onclick="this.form['go-figure'].value='Ouch!';"
Whenever you have to address a CSS property as a JavaScript variable name, CamelCase is the official way to go.
element.style.backgroundColor = "#FFFFFF";
You will never be in the situation to have to address a element's ID as a variable name. It will always be in a string, so
document.getElementById("my-id");
will always work.
Using Hypen (or dash) is OK
I too is currently studying JavaScript, and as far as I read David Flanagan's book (JavaScript: The Definitive Guide, 5th Edition) — I suggest you read it. It doesn't warn me anything about the use of hypen or dash (-) in IDs and Classes (even the Name attribute) in an HTML document.
Just as what Parrots already said, hypens are not allowed in variables, because the JavaScript interpreter will treat it as a minus and/or a negative sign; but to use it on strings, is pretty much ok.
Like what Parrots and Guffa said, you can use the following ...
[ ] (square brackets)
'' (single quotation marks or single quotes)
"" (double quotation marks or double quotes)
to tell the JavaScript interpreter that your are declaring strings (the id/class/name of your elements for instance).
Use Hyphen (or dash) — for 'Consistency'
#KP, that would be ok if he is using HTML 4.1 or earlier, but if he is using any versions of XHTML (.e.g., XHTML 1.0), then that cannot be possible, because XHTML syntax prohibits uppercase (except the !DOCTYPE, which is the only thing that needs to declared in uppercase).
#Choy, if you're using HTML 4.1 or earlier, going to either camelCase or PascalCase will not be a problem. Although, for consistency's sake as to how CSS use separators (it uses hypen or dash), I suggest following its rule. It will be much more convinient for you to code your HTML and CSS alike. And moreoever, you don't even have to worry if you're using XHTML or HTML.
IDs are allowed to contain hyphens:
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").
And there is no restriction when using IDs in JavaScript except if you want to refer to elements in the global scope. There you need to use:
window['css-dash-name']
Other answers are correct as far as where you can and can't use hyphens, however at the root of the question, you should consider the idea of not using dashes/hyphens in your variable/class/ID names altogether. It's not standard practice, even if it does work and requires careful coding to make use of it.
Consider using either PascalCase (all words begin in capital) or camelCase (first word begins in lowercase, following words being in uppercase). These are the two most common, accepted naming conventions.
Different resources will recommend different choices between the two (with the exception of JavaScript which is pretty much always recommended camelCase). In the end as long as you are consistent in your approach, this is the most important part. Using camel or Pascal case will ensure you don't have to worry about special accessors or brackets in your code.
For JavaScript conventions, try this question/discussion:
javascript naming conventions
Here's another great discussion of conventions for CSS, Html elements, etc:
What's the best way to name IDs and classes in CSS and HTML?
It would cause an error in this case:
const fontSize = element.style.font-size;
Because including a hyphen prevents the property from being accessed via the dot operator. The JavaScript parser would see the hyphen as a subtraction operator. Correct way would be:
const fontSize = element.style['font-size']
No, this won't cause an issue. You're accessing the ID as a string (it's enclosed in quotes), so the dash poses no problem. However, I would suggest not using document.getElementById("css-dash-name"), and instead using jQuery, so you can do:
$("#css-dash-name");
Which is much clearer. the jQuery documentation is also quite good. It's a web developers best friend.

Categories

Resources