Regex - Exclude brackets and brackets with special key

Regex - Exclude brackets and brackets with special key - javascript

I got this string:
[[Fil:Hoganas_hamn.jpg|miniatyr|Höganäs Hamn.]] [[Fil:Hoganas_hamn_kvickbadet.jpg|miniatyr|Höganäs Hamn - Kvickbadet.]] [[Fil:Höganäs Jefast ny redigerad-1.jpg|miniatyr|Jefasthuset sett från väster med en del av den nya bryggan vid Kvickbadet.]] '''Höganäs''' är en [[tätort]] och [[centralort]] i [[Höganäs kommun]] i [[Skåne län]]. Höganäs blev stad 1936. Ursprungligen är Höganäs ett [[fiskeläge]] kring vilket en [[gruvindustri]] utvecklades för brytning av [[kol (bränsle)|kol]] och [[lera|leror]] för tillverkning av [[eldfast]] [[keramik]] ([[Höganäskrus]]). Gruvindustrin är numera nedlagd.
I want to exclude every instance of [[FIL: + dynamic word]] and every [[, ]], but not exclude the word itself when its only [[word]] without the "FIL:" in it.
I've begun doing a regex for it but I'm stuck.
\[\[\Fil:|\]\]
The output Im after should look like this:
'''Höganäs''' är en tätort och centralort i Höganäs kommun i Skåne län. Höganäs blev stad 1936. Ursprungligen är Höganäs ett fiskeläge kring vilket en gruvindustri utvecklades för brytning av kol (bränsle)|kol och lera|leror för tillverkning av eldfast keramik (Höganäskrus). Gruvindustrin är numera nedlagd.
I have JQuery but think .replace should do the trick?

Try replacing all matches for this Regex with an empty string:
\[\[Fil:[^\]]*\]\]|\[\[|\]\]
To break this down:
\[\[Fil:[^\]]*\]\] matches [[Fil:...]]
\[\[ matches remaining [[
\]\] matches remaining ]]
| combines with OR
To get your exact output, you may need to strip some whitespace as well:
\[\[Fil:[^\]]*\]\]\s+|\[\[|\]\]
So, in JavaScript, you could write:
x.replace(/\[\[Fil:[^\]]*\]\]\s+|\[\[|\]\]/g, '');

Try this, maybe you want also to adjust spaces
var string = "[[Fil:Hoganas_hamn.jpg|miniatyr|Höganäs Hamn.]] [[Fil:Hoganas_hamn_kvickbadet.jpg|miniatyr|Höganäs Hamn - Kvickbadet.]] [[Fil:Höganäs Jefast ny redigerad-1.jpg|miniatyr|Jefasthuset sett från väster med en del av den nya bryggan vid Kvickbadet.]] '''Höganäs''' är en [[tätort]] och [[centralort]] i [[Höganäs kommun]] i [[Skåne län]]. Höganäs blev stad 1936. Ursprungligen är Höganäs ett [[fiskeläge]] kring vilket en [[gruvindustri]] utvecklades för brytning av [[kol (bränsle)|kol]] och [[lera|leror]] för tillverkning av [[eldfast]] [[keramik]] ([[Höganäskrus]]). Gruvindustrin är numera nedlagd.";
var result = string.replace(/\[\[Fil:.*?\]\]/g, '').replace(/\[\[(.*?)\]\]/g, '$1');
console.log(result);

You can use a regex like this
\[\[.*?\]\]
And then use the callback function version of replace to check if starts with Fil: then conditionally decide whether you want to return a blank string to erase it, or just the word itself.
Alternately, use 2 regexes. Replace the Fil: ones with a blank string first, and then the rest with just the word. You can use
\[\[(\w+)\]\]
Or something similar to catch the [[word]] ones and then replace it with a backreference to the word, i.e., \1 refers to what's in parentheses.

Related

Picking multiple random values from array [duplicate]

This question already has answers here:
How to randomize (shuffle) a JavaScript array?
(69 answers)
Closed 4 months ago.
Does anyone know how you can pick more than one random element from an array? The code below is a simplified version of my code, but i think it should be enough.
My code picks a random quote from the array, then fills a div with the random string. But here comes my question, how do i get a new quote when userinput === quoteRandom. The code should be able to do this multiple times.
let quote_array = [
'Gresset er grønnere på andre siden av gjerdet',
'Å være sliten og nedfor og er ikke et tegn på svakhet, mest sannsynlig har du vært sterk for lenge',
'Jeg skulle ikke spise den, jeg skulle bare smake på den',
'Nøtter er ikke noe for en hel rev'
];
//Pick random quote
let quoteRandom = sitat_array[Math.floor(Math.random() * sitat_array.length)];
//Fill a div with quoteRandom
function fillQuote() {
div.innerText = quoteRandom;
}
//If userinput === quoteRandom
function newQuote() {
fillQuote();
}

If you mutate the origanal array with sort and pop, every time you can get a random element form the rest of the array. So you will never get a quote again, only all the elements are poped, and the array is reinitialized.
let quote_array = [];
function getaquote() {
if (quote_array.length === 0) {
quote_array = [
'Gresset er grønnere på andre siden av gjerdet',
'Å være sliten og nedfor og er ikke et tegn på svakhet, mest sannsynlig har du vært sterk for lenge',
'Jeg skulle ikke spise den, jeg skulle bare smake på den',
'Nøtter er ikke noe for en hel rev'
];
}
quoteRandom1 = quote_array.sort(() => (Math.random() > .5) ? 1 : -1).pop();
console.log(quoteRandom1);
}
<button onclick="getaquote()">get a quote</button>

Why does JavaScript's string.split() not work correctly in certain cases?

I need to split a string of text into its component words, so I'm using a Regex to split it on the empty spaces (in a Typescript file, btw).
splitIntoWords(text: string) : Array<string> {
const separator = ' ';
const words = text.split(new RegExp(separator, 'g'));
return words;
}
This mostly works, but I've noticed that I regularly get words in the array that still contain spaces. If I copy the text into the Chrome console and split(' ') it I get the correct amount of words, but when I use the variable (even in the console) it invariably fails in some cases. I can't work out what the difference is. This is an example of my text:
"Le coronavirus en France : la décrue se poursuit en réanimation, la reprise économique au cœur des préoccupations. La mise en œuvre du plan de déconfinement élaboré par le gouvernement doit encore faire l’objet, jeudi, d’un « travail de concertation et d’adaptation aux réalités de terrain » avec les responsables et les élus locaux."
The regex never manages to split the substring "économique au" into two components, for instance. Does anyone know why this is happening?

It sounds like the whitespace is occasionally not just a plain space. You can split on all whitespace by using \s for the separator instead, which will match any whitespace, including space characters and tab characters.
const text = "Le coronavirus en France : la décrue se poursuit en réanimation, la reprise économique au cœur des préoccupations. La mise en œuvre du plan de déconfinement élaboré par le gouvernement doit encore faire l’objet, jeudi, d’un « travail de concertation et d’adaptation aux réalités de terrain » avec les responsables et les élus locaux.";
const words = text.split(/\s/);
console.log(words);
Another option would be to use match instead of split, and match non-whitespace characters.
const text = "Le coronavirus en France : la décrue se poursuit en réanimation, la reprise économique au cœur des préoccupations. La mise en œuvre du plan de déconfinement élaboré par le gouvernement doit encore faire l’objet, jeudi, d’un « travail de concertation et d’adaptation aux réalités de terrain » avec les responsables et les élus locaux.";
const words = text.match(/\S+/g);
console.log(words);

Unable to match the entire Cyrillic alphabet using RegExp

I'm trying to return all cyrillic words from this sentence:
"I like to eat healthy food with a little bit of pepper. Cанкт-Петербу́рг э́то оди́н из са́мых краси́вых городо́в Росси́и.Он был осно́ван импера́тором Петро́м I(пе́рвым).Импера́тор реши́л постро́ить го́род здесь, что́бы откры́ть для Росси́и «окно́ в Евро́пу. La Navidad dura dos semanas y las fiestas más importantes son Nochebuena, Navidad, Nochevieja y Reyes. En las casas se pone el tradicional belén, una maqueta con figuras que representa el nacimiento de Jesús, y un gran árbol donde se colocan los regalos";
I tried to use /\p{sc=Cyrillic}\w+/giu to return all cyrillic words but it's returning null instead; Then I tried /(?<=[\u0400- \u4FF]+\w+)/giu because this range is the Cyrillic alphabet. I've used 7 different RegExp websites but none of them seem to support \p class.
What's wrong?

Your regex is not unicode if you use \u0400. So remove the modifier u.
The class with all allowed characters works quite fine, see https://regex101.com/r/PE4fQT/1
There are 2 different C and с.

How to only get a certain portion of string?

I have this string that I receive from a get request:
Rekryteringstest för anställning
Det här är rekryteringstestet och samtidigt den sida som är data till programmet som ska skrivas
Uppgiften är ganska generellt skriven för att passa både för de som löser den i t ex Java och de som löser den som t ex en webbsida.
Skriv en lösning som:
1. Öppnar ett fönster (om inte resultatet visas i t ex webbläsare)
2. Laddar webbadresser till bilder med tillhörande kommentar (längst ner på den här sidan, nya bilder varje gång sidan laddas!)
3. Laddar och visar bilderna med tillhörande kommentar
4. Laddar om data (från den här sidan!) automatiskt var 30:e sekund, vid omladdning kan gamla bilder tas bort
5. Har en knapp för att manuellt trigga omladdning
6. Visar någon form av status när data laddas
7. Har en knapp för att avsluta applikationen
8. Har en 'Om'-dialog som visar kontaktinformation till dig
9. Lösningen ska vara enkel att testköra och om applicerbart EN körbar fil
A. Skicka in lösningen inklusive all kod till Bouvet
Hur applikationen ser ut är inte lika viktigt som hur applikationen
med tillhörande unit-test är skriven och fungerar.
Data:
https://images.unsplash.com/photo-1514125067037-8e669dd37638?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=1e2adb26fb5dc49fc14efd7f6aeca128&auto=format&fit=crop&w=1650&q=80 Mer publik
---------- END OF THE RESPONSE STRING ------------
Every time i make a request, the https link and text after the link updates.
How can I easily get only these values in this big string?
I have tried this
let splittedArray = response.data.split( "Data:" );
And then I get this
<URL kommentar>
http://3.bp.blogspot.com/-_gbAWeYsKP4/T899GpY3CSI/AAAAAAAAACw/du8qLqu4xEo/s1600/empty.jpg Lådan
https://images.unsplash.com/photo-1514125067037-8e669dd37638?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=1e2adb26fb5dc49fc14efd7f6aeca128&auto=format&fit=crop&w=1650&q=80 Mer publik
for example.
From here I would like to split the https links and the text afterwards in different parts so I can easily use them. At the moment I cannot use split because it is an array (the last part)

As per clarifications in comments, let's start with an example data here :
let splittedArray = [
"part to be discarded",
"<URL kommentar> http://3.bp.blogspot.com/-_gbAWeYsKP4/T899GpY3CSI/AAAAAAAAACw/du8qLqu4xEo/s1600/empty.jpg Lådan
https://images.unsplash.com/photo-1514125067037-8e669dd37638?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=1e2adb26fb5dc49fc14efd7f6aeca128&auto=format&fit=crop&w=1650&q=80 Mer publik"
];
Then, you can't simply use split on the variable splittedArray.
If you want to do further manipulation on the second part (a string that actually contains the links), you need to get this part by referring it as splittedArray[1].
Then you can probably split it by space characters, and keep the ones starting with 'http'.
splittedArray[1].split(/\s+/)
let splittedArray = [
"part to be discarded",
"<URL kommentar> http://3.bp.blogspot.com/-_gbAWeYsKP4/T899GpY3CSI/AAAAAAAAACw/du8qLqu4xEo/s1600/empty.jpg Lådan \
https://images.unsplash.com/photo-1514125067037-8e669dd37638?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=1e2adb26fb5dc49fc14efd7f6aeca128&auto=format&fit=crop&w=1650&q=80 Mer publik"
];
let splittedSecondPart = splittedArray[1].split(/\s+/);
let filteredByHttp = splittedSecondPart.filter(x => x.startsWith('http'));
console.log(filteredByHttp);

Regex: get numbers after a matching pattern with multi lang support

Expected Income/Output
Input: Longines, retailed by Barth, Zurich, ref. 22127, movement no. 5770083,
Desired Output: 5770083
Only digits from this I will build: {"Movement Number": 5770083}
I believe I will need to run multiple regexes against each string as I need to know the following:
Which title belongs to which string ie movement no.= 5770083 etc
Multiple different languages will be used for the same title, for example:
Movement number variations:
Movement no.
mouvement signés.Numérotée
no
MVT
jewels #
Werk-Nr.
Current regex: /movement no. ([^\s]+)/
With the above regex it will also pick up the ,.
It is also case insensitive.
Test String
Longines. A very fine and rare stainless steel water-resistant
chronograph wristwatch with black dial and original box\nSigned
Longines, retailed by Barth, Zurich, ref. 22127, movement no. 5770083,
case no. 46, circa 1941\nCal. 13 ZN nickel-finished lever movement, 17
jewels, the black dial with Arabic numerals, outer railway five minute
divisions and tachymetre scale, two subsidiary dials indicating
constant seconds and 30 minutes register, in large circular
water-resistant-type case with flat bezel, downturned lugs, screw
back, two round chronograph buttons in the band, case and movement
signed by maker, dial signed by maker and retailer\n37 mm. diam.
Test String French
MONTRE BRACELET D'HOMME CHRONOGRAPHE EN OR, PAR LONGINES\n\nDe forme
ronde, le cadran noir à chiffres arabes, cadran auxiliaire pour les
secondes à neuf heures et totalisateur de minutes à trois heures,
mouvement mécanique 13 Z N, vers 1960, poids brut: 44.49 gr., monture
en or jaune 18K (750)\n\nCadran Longines, mouvement no. 3872616, fond
de boîte no. 5872616\nVeuillez noter que les bracelets de montre
pouvant être en cuirs exotiques provenant d'espèces protégées, tels le
crocodile, ils ne sont pas vendus avec les montre même s'ils sont
exposés avec celles-ci. Christie's devra retirer et conserver ces
bracelets avant leur collecte par les acheteur

You can use
\b((?:Movement|mouvement) no\.|mouvement signés\.Numérotée|no|MVT|jewels #|Werk-Nr\.) (\d+)
https://regex101.com/r/thL0wt/1
Start at a word boundary, then inside a capturing group, alternate between all the different possible phrases you want before a number - then, match a space, and capture numeric characters in another group. Your desired result will be in the first and second capturing groups.
const input = `Longines. A very fine and rare stainless steel water-resistant chronograph wristwatch with black dial and original box\nSigned Longines, retailed by Barth, Zurich, ref. 22127, movement no. 5770083, case no. 46, circa 1941\nCal. 13 ZN nickel-finished lever movement, 17 jewels, the black dial with Arabic numerals, outer railway five minute divisions and tachymetre scale, two subsidiary dials indicating constant seconds and 30 minutes register, in large circular water-resistant-type case with flat bezel, downturned lugs, screw back, two round chronograph buttons in the band, case and movement signed by maker, dial signed by maker and retailer\n37 mm. diam.
MONTRE BRACELET D'HOMME CHRONOGRAPHE EN OR, PAR LONGINES\n\nDe forme ronde, le cadran noir à chiffres arabes, cadran auxiliaire pour les secondes à neuf heures et totalisateur de minutes à trois heures, mouvement mécanique 13 Z N, vers 1960, poids brut: 44.49 gr., monture en or jaune 18K (750)\n\nCadran Longines, mouvement no. 3872616, fond de boîte no. 5872616\nVeuillez noter que les bracelets de montre pouvant être en cuirs exotiques provenant d'espèces protégées, tels le crocodile, ils ne sont pas vendus avec les montre même s'ils sont exposés avec celles-ci. Christie's devra retirer et conserver ces bracelets avant leur collecte par les acheteur`;
const matches = {};
let match;
const pattern = /\b((?:Movement|mouvement) no\.|mouvement signés\.Numérotée|no|MVT|jewels #|Werk-Nr\.) (\d+)/gmi;
while (match = pattern.exec(input)) {
matches[match[1]] = match[2];
// or, if you only want a single object:
const obj = {
[match[1]]: match[2]
};
}
console.log(matches);

For movement no. specifically you'll want this regex to get rid of the comma:
movement no. ([^\s\W]+)
In regards to the languages, a set of if statements performing the appropriate term that you want to test against is the only way I can think of unless the RegExp object allows for string substitution. Sorry for not being more help in that area.

You are using negated character class [^\s]+, which matches everything except whitespace. So, if there's another character you don't want to match, i.e. comma ,, then add it to this class: [^\s,].
And you can follow same logic for any character you don't want to match.

var input = "Longines, retailed by Barth, Zurich, ref. 22127, movement no. 5770083";
var output = input.match(/(?<=movement no. )\d+/)

Develop Reference

JavaScript is the programming language of the Web.