Unable to match the entire Cyrillic alphabet using RegExp

Unable to match the entire Cyrillic alphabet using RegExp - javascript

I'm trying to return all cyrillic words from this sentence:
"I like to eat healthy food with a little bit of pepper. Cанкт-Петербу́рг э́то оди́н из са́мых краси́вых городо́в Росси́и.Он был осно́ван импера́тором Петро́м I(пе́рвым).Импера́тор реши́л постро́ить го́род здесь, что́бы откры́ть для Росси́и «окно́ в Евро́пу. La Navidad dura dos semanas y las fiestas más importantes son Nochebuena, Navidad, Nochevieja y Reyes. En las casas se pone el tradicional belén, una maqueta con figuras que representa el nacimiento de Jesús, y un gran árbol donde se colocan los regalos";
I tried to use /\p{sc=Cyrillic}\w+/giu to return all cyrillic words but it's returning null instead; Then I tried /(?<=[\u0400- \u4FF]+\w+)/giu because this range is the Cyrillic alphabet. I've used 7 different RegExp websites but none of them seem to support \p class.
What's wrong?

Your regex is not unicode if you use \u0400. So remove the modifier u.
The class with all allowed characters works quite fine, see https://regex101.com/r/PE4fQT/1
There are 2 different C and с.

Related

Why does JavaScript's string.split() not work correctly in certain cases?

I need to split a string of text into its component words, so I'm using a Regex to split it on the empty spaces (in a Typescript file, btw).
splitIntoWords(text: string) : Array<string> {
const separator = ' ';
const words = text.split(new RegExp(separator, 'g'));
return words;
}
This mostly works, but I've noticed that I regularly get words in the array that still contain spaces. If I copy the text into the Chrome console and split(' ') it I get the correct amount of words, but when I use the variable (even in the console) it invariably fails in some cases. I can't work out what the difference is. This is an example of my text:
"Le coronavirus en France : la décrue se poursuit en réanimation, la reprise économique au cœur des préoccupations. La mise en œuvre du plan de déconfinement élaboré par le gouvernement doit encore faire l’objet, jeudi, d’un « travail de concertation et d’adaptation aux réalités de terrain » avec les responsables et les élus locaux."
The regex never manages to split the substring "économique au" into two components, for instance. Does anyone know why this is happening?

It sounds like the whitespace is occasionally not just a plain space. You can split on all whitespace by using \s for the separator instead, which will match any whitespace, including space characters and tab characters.
const text = "Le coronavirus en France : la décrue se poursuit en réanimation, la reprise économique au cœur des préoccupations. La mise en œuvre du plan de déconfinement élaboré par le gouvernement doit encore faire l’objet, jeudi, d’un « travail de concertation et d’adaptation aux réalités de terrain » avec les responsables et les élus locaux.";
const words = text.split(/\s/);
console.log(words);
Another option would be to use match instead of split, and match non-whitespace characters.
const text = "Le coronavirus en France : la décrue se poursuit en réanimation, la reprise économique au cœur des préoccupations. La mise en œuvre du plan de déconfinement élaboré par le gouvernement doit encore faire l’objet, jeudi, d’un « travail de concertation et d’adaptation aux réalités de terrain » avec les responsables et les élus locaux.";
const words = text.match(/\S+/g);
console.log(words);

Regex: get numbers after a matching pattern with multi lang support

Expected Income/Output
Input: Longines, retailed by Barth, Zurich, ref. 22127, movement no. 5770083,
Desired Output: 5770083
Only digits from this I will build: {"Movement Number": 5770083}
I believe I will need to run multiple regexes against each string as I need to know the following:
Which title belongs to which string ie movement no.= 5770083 etc
Multiple different languages will be used for the same title, for example:
Movement number variations:
Movement no.
mouvement signés.Numérotée
no
MVT
jewels #
Werk-Nr.
Current regex: /movement no. ([^\s]+)/
With the above regex it will also pick up the ,.
It is also case insensitive.
Test String
Longines. A very fine and rare stainless steel water-resistant
chronograph wristwatch with black dial and original box\nSigned
Longines, retailed by Barth, Zurich, ref. 22127, movement no. 5770083,
case no. 46, circa 1941\nCal. 13 ZN nickel-finished lever movement, 17
jewels, the black dial with Arabic numerals, outer railway five minute
divisions and tachymetre scale, two subsidiary dials indicating
constant seconds and 30 minutes register, in large circular
water-resistant-type case with flat bezel, downturned lugs, screw
back, two round chronograph buttons in the band, case and movement
signed by maker, dial signed by maker and retailer\n37 mm. diam.
Test String French
MONTRE BRACELET D'HOMME CHRONOGRAPHE EN OR, PAR LONGINES\n\nDe forme
ronde, le cadran noir à chiffres arabes, cadran auxiliaire pour les
secondes à neuf heures et totalisateur de minutes à trois heures,
mouvement mécanique 13 Z N, vers 1960, poids brut: 44.49 gr., monture
en or jaune 18K (750)\n\nCadran Longines, mouvement no. 3872616, fond
de boîte no. 5872616\nVeuillez noter que les bracelets de montre
pouvant être en cuirs exotiques provenant d'espèces protégées, tels le
crocodile, ils ne sont pas vendus avec les montre même s'ils sont
exposés avec celles-ci. Christie's devra retirer et conserver ces
bracelets avant leur collecte par les acheteur

You can use
\b((?:Movement|mouvement) no\.|mouvement signés\.Numérotée|no|MVT|jewels #|Werk-Nr\.) (\d+)
https://regex101.com/r/thL0wt/1
Start at a word boundary, then inside a capturing group, alternate between all the different possible phrases you want before a number - then, match a space, and capture numeric characters in another group. Your desired result will be in the first and second capturing groups.
const input = `Longines. A very fine and rare stainless steel water-resistant chronograph wristwatch with black dial and original box\nSigned Longines, retailed by Barth, Zurich, ref. 22127, movement no. 5770083, case no. 46, circa 1941\nCal. 13 ZN nickel-finished lever movement, 17 jewels, the black dial with Arabic numerals, outer railway five minute divisions and tachymetre scale, two subsidiary dials indicating constant seconds and 30 minutes register, in large circular water-resistant-type case with flat bezel, downturned lugs, screw back, two round chronograph buttons in the band, case and movement signed by maker, dial signed by maker and retailer\n37 mm. diam.
MONTRE BRACELET D'HOMME CHRONOGRAPHE EN OR, PAR LONGINES\n\nDe forme ronde, le cadran noir à chiffres arabes, cadran auxiliaire pour les secondes à neuf heures et totalisateur de minutes à trois heures, mouvement mécanique 13 Z N, vers 1960, poids brut: 44.49 gr., monture en or jaune 18K (750)\n\nCadran Longines, mouvement no. 3872616, fond de boîte no. 5872616\nVeuillez noter que les bracelets de montre pouvant être en cuirs exotiques provenant d'espèces protégées, tels le crocodile, ils ne sont pas vendus avec les montre même s'ils sont exposés avec celles-ci. Christie's devra retirer et conserver ces bracelets avant leur collecte par les acheteur`;
const matches = {};
let match;
const pattern = /\b((?:Movement|mouvement) no\.|mouvement signés\.Numérotée|no|MVT|jewels #|Werk-Nr\.) (\d+)/gmi;
while (match = pattern.exec(input)) {
matches[match[1]] = match[2];
// or, if you only want a single object:
const obj = {
[match[1]]: match[2]
};
}
console.log(matches);

For movement no. specifically you'll want this regex to get rid of the comma:
movement no. ([^\s\W]+)
In regards to the languages, a set of if statements performing the appropriate term that you want to test against is the only way I can think of unless the RegExp object allows for string substitution. Sorry for not being more help in that area.

You are using negated character class [^\s]+, which matches everything except whitespace. So, if there's another character you don't want to match, i.e. comma ,, then add it to this class: [^\s,].
And you can follow same logic for any character you don't want to match.

var input = "Longines, retailed by Barth, Zurich, ref. 22127, movement no. 5770083";
var output = input.match(/(?<=movement no. )\d+/)

Comparing apparently equal strings JQuery

<a id="jobDescription" class="is_html" tabindex="-1" href="/es/backoffice/jobs?job_id=1"><p>Este puesto implica una gran capacidad de organización: - Recibir pedidos - Bla bla.... Y muchas cosas m</p><p> </p><p>pruebassssssssssssssssss</p><p> </p><p>adsfasdfsafad adfasdf asdfas s gfasdpepeluis don chencho </p></a>
<input type="hidden" value="<p>Este puesto implica una gran capacidad de organización: - Recibir pedidos - Bla bla.... Y muchas cosas m</p><p> </p><p>pruebassssssssssssssssss</p><p> </p><p>adsfasdfsafad adfasdf asdfas s gfasdpepeluis don chencho </p>">
I have got those two HTML tags. I need to compare the values and text they have, but the problem is that whatever I do, I'm getting a false value even though they are actually the same. Here my jQuery code:
oldValue = $(elem).find("input[type='hidden']").val();
newValue = $(elem).children()[0].innerHTML.replace(/ /g,' ');;
As you can see, the hidden element is keeping oldValue, and the a link is keeping newValue. Using Google Chrome's developer tool, I can use the console to print out the values right before comparing them, getting this result:
newValue
"<p>Este puesto implica una gran capacidad de organización: - Recibir pedidos - Bla bla.... Y muchas cosas m</p><p> </p><p>pruebassssssssssssssssss</p><p> </p><p>adsfasdfsafad adfasdf asdfas s gfasdpepeluis don chencho </p>"
oldValue
"<p>Este puesto implica una gran capacidad de organización: - Recibir pedidos - Bla bla.... Y muchas cosas m</p><p> </p><p>pruebassssssssssssssssss</p><p> </p><p>adsfasdfsafad adfasdf asdfas s gfasdpepeluis  don chencho </p>"
newValue == oldValue
false
What could possibly be wrong here??

I see you have replaced &nbsp in newvalue but not in oldvalue maybe that could mean something...
can you post complete code to make a clearer idea of what are you doing?
EDIT:
If it can help: What is the correct way to check for string equality in JavaScript?

apparently, using replace(/ /g,' ') is not enough for compare a string in a variable and a string from an element.
following this example, I managed to create a fiddle which resolve your problem:
link to fiddle

Replace character and words

How can I replace characters to get a text only with words?
Here's the code:
text.replace('/', '');
ley orgánica 4/2013 28 junio reforma consejo general poder judicial modifica ley orgánica 6/1985 1 julio poder judicial
From this text I would like to replace 4/2013 to '' and 6/1985 and the numbers 28 and 1.
Thanks!

I'd suggest, in this limited case:
var text = 'ley orgánica 4/2013 28 junio reforma consejo general poder judicial modifica ley orgánica 6/1985 1 julio poder judicial',
newText = text.replace(/([\/0-9])/g, '');

Regex - Exclude brackets and brackets with special key

I got this string:
[[Fil:Hoganas_hamn.jpg|miniatyr|Höganäs Hamn.]] [[Fil:Hoganas_hamn_kvickbadet.jpg|miniatyr|Höganäs Hamn - Kvickbadet.]] [[Fil:Höganäs Jefast ny redigerad-1.jpg|miniatyr|Jefasthuset sett från väster med en del av den nya bryggan vid Kvickbadet.]] '''Höganäs''' är en [[tätort]] och [[centralort]] i [[Höganäs kommun]] i [[Skåne län]]. Höganäs blev stad 1936. Ursprungligen är Höganäs ett [[fiskeläge]] kring vilket en [[gruvindustri]] utvecklades för brytning av [[kol (bränsle)|kol]] och [[lera|leror]] för tillverkning av [[eldfast]] [[keramik]] ([[Höganäskrus]]). Gruvindustrin är numera nedlagd.
I want to exclude every instance of [[FIL: + dynamic word]] and every [[, ]], but not exclude the word itself when its only [[word]] without the "FIL:" in it.
I've begun doing a regex for it but I'm stuck.
\[\[\Fil:|\]\]
The output Im after should look like this:
'''Höganäs''' är en tätort och centralort i Höganäs kommun i Skåne län. Höganäs blev stad 1936. Ursprungligen är Höganäs ett fiskeläge kring vilket en gruvindustri utvecklades för brytning av kol (bränsle)|kol och lera|leror för tillverkning av eldfast keramik (Höganäskrus). Gruvindustrin är numera nedlagd.
I have JQuery but think .replace should do the trick?

Try replacing all matches for this Regex with an empty string:
\[\[Fil:[^\]]*\]\]|\[\[|\]\]
To break this down:
\[\[Fil:[^\]]*\]\] matches [[Fil:...]]
\[\[ matches remaining [[
\]\] matches remaining ]]
| combines with OR
To get your exact output, you may need to strip some whitespace as well:
\[\[Fil:[^\]]*\]\]\s+|\[\[|\]\]
So, in JavaScript, you could write:
x.replace(/\[\[Fil:[^\]]*\]\]\s+|\[\[|\]\]/g, '');

Try this, maybe you want also to adjust spaces
var string = "[[Fil:Hoganas_hamn.jpg|miniatyr|Höganäs Hamn.]] [[Fil:Hoganas_hamn_kvickbadet.jpg|miniatyr|Höganäs Hamn - Kvickbadet.]] [[Fil:Höganäs Jefast ny redigerad-1.jpg|miniatyr|Jefasthuset sett från väster med en del av den nya bryggan vid Kvickbadet.]] '''Höganäs''' är en [[tätort]] och [[centralort]] i [[Höganäs kommun]] i [[Skåne län]]. Höganäs blev stad 1936. Ursprungligen är Höganäs ett [[fiskeläge]] kring vilket en [[gruvindustri]] utvecklades för brytning av [[kol (bränsle)|kol]] och [[lera|leror]] för tillverkning av [[eldfast]] [[keramik]] ([[Höganäskrus]]). Gruvindustrin är numera nedlagd.";
var result = string.replace(/\[\[Fil:.*?\]\]/g, '').replace(/\[\[(.*?)\]\]/g, '$1');
console.log(result);

You can use a regex like this
\[\[.*?\]\]
And then use the callback function version of replace to check if starts with Fil: then conditionally decide whether you want to return a blank string to erase it, or just the word itself.
Alternately, use 2 regexes. Replace the Fil: ones with a blank string first, and then the rest with just the word. You can use
\[\[(\w+)\]\]
Or something similar to catch the [[word]] ones and then replace it with a backreference to the word, i.e., \1 refers to what's in parentheses.

Develop Reference

JavaScript is the programming language of the Web.