Split string with various delimiters while keeping delimiters - javascript

I have the following string:
"dogs#cats^horses^fish!birds"
How can I get the following array back?
['dogs','#cats','^horses','^fish','!birds']
Essentially I am trying to split the string while keeping the delimeters. I've tried string.match with no avail.

Assuming those are your only separators then you can do this:
var string = "dogs#cats^horses^fish!birds";
string.replace(/(#|\^|!)/g, '|$1').split('|');
We basically add our own separator, in this case | and split it based on that.

This does what you want:
str.match(/((^|[^\w])\w+)/g)
Without more test cases though, it's hard to say how reliable it would be.
This is also assuming a large set of possible delimiters. If it's a small fixed amount, Samer's solution would be a good way to go

Related

JavaScript split string by specific character string

I have a text box with a bunch of comments, all separated by a specific character string as a means of splitting them to display each comment individually.
The string in question is | but I can change this to accommodate whatever will work. My only requirement is that it is not likely to be a string of characters someone will type in an everyday sentence.
I believe I need to use the split method and possibly some regex but all the other questions I've seen only seem to mention splitting by one character or a number of different characters, not a specific set of characters in a row.
Can anyone point me in the right direction?
.split() should work for that purpose:
var comments = "this is a comment|and here is another comment|and yet another one";
var parsedComments = comments.split('|');
This will give you all comments in an array which you can then loop over or do whatever you have to do.
Keep in mind you could also change | to something like <--NEWCOMMENT--> and it will still work fine inside the split('<--NEWCOMMENT-->') method.
Remember that split() removes the character it's splitting on, so your resulting array won't contain any instances of <--NEWCOMMENT-->

How to split a huge string by date using JavaScript?

I've got a pdf file turned into a huge string of over 1,000,000 characters. There are dates in the string in the format dd/mm/yyyy. I want to split the string by dates into smaller ones. I tried following:
var sectioned = hugeString.split(/^(0?[1-9]|[12][0-9]|3[01])[\/](0?[1-9]|1[012])[\/\-]\d{4}$/g);
But it's not working. I also tried hugeString.match(), but no good result there.
Is it even possible to accomplish this by string functions or should I think of a different approach?
String snippet:
....Section: 2 Interpretation E.R. 2 of 2012 02/08/2012 .....
You may remove anchors, g modifier (it is redundant) and use non-capturing groups to avoid dates being output as well in the results. Wrap in (?=PATTERN HERE) if you need to split keeping the dates in the split chunks. However, if you prefer this approach, please make sure there are no optional 0s in the pattern at the beginning, or you might get redundant elements in the result.
var s = "....Section: 2 Interpretation E.R. 2 of 2012 02/08/2012 ..... ";
var res = s.split(/(?:0?[1-9]|[12][0-9]|3[01])[\/-](?:0?[1-9]|1[012])[\‌/-]\d{4}/);
console.log(res);
res = s.split(/(?=(?:0[1-9]|[12][0-9]|3[01])[\/-](?:0[1-9]|1[012])[\‌/-]\d{4})/);
console.log(res);
Note you also had a [\/] subpattern without - in the pattern while the other separator character class contained both chars. I suggest using [\/-] in both cases.

Efficiently remove common patterns from a string

I am trying to write a function to calculate how likely two strings are to mean the same thing. In order to do this I am converting to lower case and removing special characters from the strings before I compare them. Currently I am removing the strings '.com' and 'the' using String.replace(substring, '') and special characters using String.replace(regex, '')
str = str.toLowerCase()
.replace('.com', '')
.replace('the', '')
.replace(/[&\/\\#,+()$~%.'":*?<>{}]/g, '');
Is there a better regex that I can use to remove the common patterns like '.com' and 'the' as well as the special characters? Or some other way to make this more efficient?
As my dataset grows I may find other common meaningless patterns that need to be removed before trying to match strings and would like to avoid the performance hit of chaining more replace functions.
Examples:
Fish & Chips? => fish chips
stackoverflow.com => stackoverflow
The Lord of the Rings => lord of rings
You can connect the replace calls to a single one with a rexexp like this:
str = str.toLowerCase().replace(/\.com|the|[&\/\\#,+()$~%.'":*?<>{}]/g, '');
The different strings to remove are inside parentheses () and separated by pipes |
This makes it easy enough to add more string to the regexp.
If you are storing the words to remove in an array, you can generate the regex using the RegExp constructor, e.g.:
var words = ["\\.com", "the"];
var rex = new RegExp(words.join("|") + "|[&\\/\\\\#,+()$~%.'\":*?<>{}]", "g");
Then reuse rex for each string:
str = str.toLowerCase().replace(rex, "");
Note the additional escaping required because instead of a regular expression literal, we're using a string, so the backslashes (in the words array and in the final bit) need to be escaped, as does the " (because I used " for the string quotes).
The problem with this question is that im sure you have a very concrete idea in your mind of what you want to do, but the solution you have arrived at (removing un-informative letters before making a is-identical comparison) may not be the best for the comparison you want to do.
I think perhaps a better idea would be to use a different method comparison and a different datastructure than a string. A very simple example would be to condense your strings to sets with set('string') and then compare set similarity/difference. Another method might be to create a Directed Acyclic Graph, or sub-string Trei. The main point is that it's probably ok to reduce the information from the original string and store/compare that - however don't underestimate the value of storing the original string, as it will help you down the road if you want to change the way you compare.
Finally, if your strings are really really really long, you might want to use a perceptual hash - which is like an MD5 hash except similar strings have similar hashes. However, you will most likely have to roll your own for short strings, and define what you think is important data, and what is superfluous.

How to extract substring between specific characters in javascript

How to extract "51.50431" and "-0.1133" from LatLng(51.50431, -0.1133)
using jquery.
Tried using substring() but not helpful as numbers in LatLng(51.50431, -0.1133) keep on changes in different ranges. Like some time it can come as LatLng(51.50, -0.1).
Any help?
Regular expressions to the rescue:
'LatLng(51.50, -0.1)'.match(/LatLng\(([^,]+),\s*([^)]+)\)/)
// ["LatLng(51.50, -0.1)", "51.50", "-0.1"]

How do I convert domain.com/foo/bar/baz/ into a string like 'foo bar baz'?

Basically I want to be able to grab the ending of an url, and convert it into a string to be used somewhere.
Currently I'm doing this (which is less than optimal):
// grab the path, replace all the forward slashes with spaces
local_path = location.pathname.toString().replace(/\//g,' ');
// strip empty spaces from beginning / end of string
local_path.replace(/^\s+|\s+$/g,""));
But I think there is probably a better way. Help?
Edit: Could I confidently get rid of the .toString method there?
You could do something like this if you want to avoid regular expressions:
location.pathname.substring(1).split('/').join(' ')
That will get rid of the initial slash, but won't take care of a trailing slash. If you need to deal with those, you can omit substring and use trim for modern implementations or a regex:
location.pathname.split('/').join(' ').replace(/^\s+|\s+$/g, '')
What's wrong with what you have? Looks fine to me. That is the easiest way to handle what you want to do.
You could use the regex provided by Douglas Crockford on http://www.coderholic.com/javascript-the-good-parts/ and then split the path at the forward-slash.

Categories

Resources