Regular expression subgroups inside groups and the way to reference them - javascript

I'm trying to parse a document structure like this:
Headline
c=myClass1 myClass2 myClass3
Some text plus a number3gr
More text plus another number2cm
More text plus another number2.2m
I have a regular expression that is capturing the important parts into groups:
/(.*)[\r\n]c=(.*)[\r\n]*([a-zA-Z\s]*)(\d*\.?\d*)(\w*)[\r\n]/g
Later I'm using the groups to build a html-string:
'<xmp><!--begin recipe--\><h2>$1</h2><div class="$2"><div class="serves">Serves: <input type="text" class="servesinput" value="2" size="3"></div><span class="oldMulti">2</span></br><table class="ingredients"><tr><th>Amount:</th><th>Ingredient:</th></tr><tr><td class="amount $5 ">$4</td><td>$3</td></tr></div></xmp>'
This is where I am stuck: after the empty line, there can be any number of lines like these:
Some text plus a number3gr
Is there a way to re-use this part of my reg exp as many times as necessary (as many times as there are those type of rows):
([a-zA-Z\s]*)(\d*\.?\d*)(\w*)[\r\n]
Maybe I can make use of subgroups? But then I have no idea how to repeat the results inside the html-string.

For information on capturing a repeated group: http://www.regular-expressions.info/captureall.html
For a more efficient way, I'd try parsing the file line by line manually, since regular expressions can be quite inefficient.
Once you have the text (see here for example:)
How can you read a file line by line in JavaScript?
I would split into lines (an array) per the example and iterate through them in a for loop.
var headline = "";
var classes = [];
var lineList = [];
var line;
var count = 0;
headline = lines[0];
classes = lines[1].split(" ");
classes[0] = classes[0].substring(2); // cut off "c=" in first token
for (line in lines) {
if (count > 2) {
// line is after the blank line
// do something
}
count += 1;
}

Related

How to define a line break in extendscript for Adobe Indesign

I am using extendscript to build some invoices from downloaded plaintext emails (.txt)
At points in the file there are lines of text that look like "Order Number: 123456" and then the line ends. I have a script made from parts I found on this site that finds the end of "Order Number:" in order to get a starting position of a substring. I want to use where the return key was hit to go to the next line as the second index number to finish the substring. To do this, I have another piece of script from the helpful people of this site that makes an array out of the indexes of every instance of a character. I will then use whichever array object is a higher number than the first number for the substring.
It's a bit convoluted, but I'm not great with Javascript yet, and if there is an easier way, I don't know it.
What is the character I need to use to emulate a return key in a txt file in javascript for extendscript for indesign?
Thank you.
I have tried things like \n and \r\n and ^p both with and without quotes around them but none of those seem to show up in the array when I try them.
//Load Email as String
var b = new File("~/Desktop/Test/email.txt");
b.open('r');
var str = "";
while (!b.eof)
str += b.readln();
b.close();
var orderNumberLocation = str.search("Order Number: ") + 14;
var orderNumber = str.substring(orderNumberLocation, ARRAY NUMBER GOES HERE)
var loc = orderNumberLocation.lineNumber
function indexes(source, find) {
var result = [];
for (i = 0; i < source.length; ++i) {
// If you want to search case insensitive use
// if (source.substring(i, i + find.length).toLowerCase() == find) {
if (source.substring(i, i + find.length) == find) {
result.push(i);
}
}
alert(result)
}
indexes(str, NEW PARAGRAPH CHARACTER GOES HERE)
I want all my line breaks to show up as an array of indexes in the variable "result".
Edit: My method of importing stripped all line breaks from the document. Using the code below instead works better. Now \n works.
var file = File("~/Desktop/Test/email.txt", "utf-8");
file.open("r");
var str = file.read();
file.close();
You need to use Regular Expressions. Depending on the fields do you need to search, you'l need to tweek the regular expressions, but I can give you a point. If the fields on the email are separated by new lines, something like that will work:
var str; //your string
var fields = {}
var lookFor = /(Order Number:|Adress:).*?\n/g;
str.replace(lookFor, function(match){
var order = match.split(':');
var field = order[0].replace(/\s/g, '');//remove all spaces
var value = order[1];
fields[field]= value;
})
With (Order Number:|Adress:) you are looking for the fields, you can add more fields separated the by the or character | ,inside the parenthessis. The .*?\n operators matches any character till the first break line appears. The g flag indicates that you want to look for all matches. Then you call str.replace, beacause it allows you to perfom a single task on each match. So, if the separator of the field and the value is a colon ':', then you split the match into an array of two values: ['Order number', 12345], and then, store that matches into an object. That code wil produce:
fields = {
OrderNumber: 12345,
Adresss: "my fake adress 000"
}
Please try \n and \r
Example: indexes(str, "\r");
If i've understood well, wat you need is to str.split():
function indexes(source, find) {
var order;
var result = [];
var orders = source.split('\n'); //returns an array of strings: ["order: 12345", "order:54321", ...]
for (var i = 0, l = orders.length; i < l; i++)
{
order = orders[i];
if (order.match(/find/) != null){
result.push(i)
}
}
return result;
}

Javascript get all text in between string

I have string content that gets delivered to me via TCP. This info is only relevant because it means that I do not consistently retrieve the same string. I have a <start> and <stop> separator to ensure that any time I get the data via TCP, I am outputting the full content.
My incoming content looks like so:
<start>Apple Bandana Cadillac<stop>
I want to get everything in between <start> and <stop>. So just Apple Bandana Cadillac.
My script to do this looks like so:
servercsv.on("connection", function(socket){
let d_basic = "";
socket.on('data', function(data){
d_basic += data.toString();
let d_csvindex = d_basic.indexOf('<stop>');
while (d_csvindex > -1){
try {
let strang = d_basic.substring(0, d_csvindex);
let dyson = strang.replace(/<start>/g, '');
let dson = papaparse.parse(dyson);
myfunction(dson);
}
catch(e){ console.log(e); }
d_basic = d_basic.substring(d_csvindex+1);
d_csvindex = d_basic.indexOf('<stop>');
}
});
});
What this means is that I am getting everything before the <stop> string and outputting it. I have also included the line let dyson = strang.replace(/<start>/g, ''); because I want to remove the <start> text.
However, because this is TCP, I am not guranteed to get all parts of this string. As a result, I frequently get back stop>Apple Bandana Cadillac<stop> or some variation of this (such as start>Apple Bandana Cadillac<stop>. It is not consistent enough that I can just do strang.replace("start>", "")
Ideally, I would like my separator to select content that is in between <start> and <stop>. Not just <stop>. However, I am unsure how to do so.
Alternatively, I can also settle for a regex that retrieves all combination of <start><stop> strings during my while loop, and just delete them. So check for <, s, t, a, r, t individually and so forth. But unsure how to implement regex to delete portions of a whole string.
Assuming you get full response:
var test = "<start>Apple Bandana Cadillac<stop>";
var testRE = test.match("<start>(.*)<stop>");
testRE[1] //"Apple Bandana Cadillac"
If there are new lines between <start> and <stop>
var test = "<start>Apple Bandana Cadillac<stop>";
var testRE = test.match("<start>([\\S\\s]*)<stop>");
testRE[1] //"Apple Bandana Cadillac"
Using regular expressions capturing group here.
Try this regex with replace() method:
/<st.*?>(.*?)(?!<st)/g
Literal.................................................: <st
Any char zero or more times lazily...: .*?
Literal..................................................: >
Begin capture group..........................: (
Any char zero or more times lazily...: .*?
End capture group.............................: )
Begin negative lookahead.................: (?!
Literal...................................................: <st
End negative lookahead....................: )
In the Demo below notice that the test example consists of multiple lines, and variances of <start> and <stop> (basically <st).
Demo 1
var rgx = /<st.*?>(.*?)(?!<st)/g;
var str = `<start>Apple Bandana Cadillac<stop>
<stop>Grapes Trampoline Ham<stop>
<start>Kebab Matador Pencil<start>`;
var res = str.replace(rgx, `$1`);
console.log(res);
Update
"say I have op>Grapes Trampoline Ham<stop>...still trying to remove all parts of the string <stop>"
/^(.*?>)(.*?)(<.*?)$/gm;
A simple explanation will have to do since a step-by-step such as Demo 1 would take too much time.
This RegEx is multiline. /m
^..........Begin line.
(.*?>)..Lazily capture everything until literal >........[Return as $1]
(.*?)...Then lazily capture everything until................[Return as $2]
(<.*?)..Literal < and lazily capture everything until..[Return as $3]
$...........End line.
The trick is to replace the second capture $2 and leave $1 and $3 alone.
Demo 2
var rgx = /^(.*?>)(.*?)(<.*?)$/gm;
var str = `<start>Apple Bandana Cadillac<stop>
<stop>Grapes Trampoline Ham<stop>
<start>Kebab Matador Pencil<start>
op>Score False Razor<stop>
`;
var res = str.replace(rgx, `$2`);
console.log(res);

Create a text area and analyze button

I am working on my college homework. I am having a lot of difficulty with it and getting stuck. My class mates are not helping me and the instructor hasn't responded. I am hoping I might get some help/understanding here. The current assignment I am working on and it is due today is:
Create a page containing a textarea and an “analyze” button. The results area will display the frequency of words of x characters. For example, the text “one two three” contains 2 3-character words and 1 5-character word. An improvement to the original design would be to strip out any extraneous characters that may skew the count.
I am just starting it now, so I will add the code here as I update. I know I won't have a problem with the HTML part, the JavaScript will be my problem. From what I get, I will need to have a function that counts the words and the characters in each word. But it needs to exclude spaces and characters like: ,.';/. I have not run across this code before, so any input on how I should frame the javascript will be helpful. Also it seems he wants me to list how many words have the same characters? am I reading this right?
My code thus far:
<!DOCTYPE html>
<html>
<body>
<textarea id="txtarea">
</textarea>
<input type="button" id="analyze" value="Analyze" onclick="myFunction()" />
<p id="demo"></p>
<p id="wcnt"></p>
<script>
function myFunction() {
var str = document.getElementById("txtarea").value;
var res = str.split(/[\s\/\.;,\-0-9]/);
var n = str.length;
document.getElementById("demo").innerHTML = "There are " + n + " characters in the text area.";
for (var i = 0; i < res.length; i++) {
s = document.getElementById("txtarea").value;
s = s.replace(/(^\s*)|(\s*$)/gi, "");
s = s.replace(/[ ]{2,}/gi, " ");
s = s.replace(/\n /, "\n");
document.getElementById("wcnt").innerHTML = "There are " + s.split(' ').length + " words in the text area.";
}
}
</script>
</body>
</html>
Now I need to figure out how to make it count the characters of each word then output how many words have x amount of characters. Such as 5 words have 4 characters and so on. Any suggestions?
var textarea = document.getElementById("textarea"),
result = {}; // object Literal to hold "word":NumberOfOccurrences
function analyzeFrequency() {
// Match/extract words (accounting for apostrophes)
var words = textarea.value.match(/[\w']+/g); // Array of words
// Loop words Array
for(var i=0; i<words.length; i++) {
var word = words[i];
// Increment if exists OR assign value of 1
result[word] = ++result[word] || 1;
}
console.log( result );
}
analyzeFrequency(); // TODO: Do this on some button click
<textarea id="textarea">
I am working on my college-homework.
Homework I am having a lot of difficulty with it and getting stuck.
My class mates are not helping me and the instructor hasn't responded.
I am hoping I might get some help/understanding here.
</textarea>
Notice how Homework and homework (lowercase) are registered as two different words, I'll leave it to you to fix that - if necessary and implement the analyzeFrequency() trigger on some button click.
Most likely you will have to use JavaScript's split function with regex to define all the characters you do not want to include. Then loop through the resulting array and count the characters in each word.
var words = document.getElementById("words");
var analyze = document.getElementById("analyze");
analyze.addEventListener("click", function(e) {
var str = words.value;
var res = str.split(/[\s\/\.;,\-0-9]/);
for(var i = 0; i < res.length; i++) {
alert(res[i].length);
}
});
<textarea id="words">This is a test of this word counter thing.</textarea>
<br/>
<button id="analyze">
Analyze
</button>
Your instructor does NOT want you to list how may words have the same characters but rather the same number of characters. The basic algorithm:
Assign the value of the text area to a variable.
Convert that string value into an array. In javascript this could be accomplished with the String split method using a regular expression containing a character class.
Iterate over that array examining each element for its length. For each element, increment a counting object's property whose property name is the length of the element.
Iterate over the counting object's property list. Output to the result area each property name and its value.

Using String.substring for Words

I have the following example where I am putting a limit on the characters entered in the Textarea:
var tlength = $(this).val().length;
$(this).val($(this).val().substring(0, maxchars));
var tlength = $(this).val().length;
remain = maxchars - parseInt(tlength);
$('#remain').text(remain);
where maxchars is the number of characters. How can I change this example to work with words, so instead of restricting chars, I restrict to a number of words.
http://jsfiddle.net/PzESw/106/
I think you need to change one string of your code to something like this:
$(this).val($(this).val().split(' ').slice(0, maxchars).join(' '));
This code splits text in an array of words (you may need another workflow), removes extra words and joins them back
A simple way would be to converting the full String into array of words.
For example you're having a String as:
var words = "Hi, My name is Afzaal Ahmad Zeeshan.";
var arrayOfWords = words.split(" "); // split at a white space
Now, you'll have an array of words. Loop it using
for (i = 0; i < words.length; i++) {
/* write them all, or limit them in the for loop! */
}
This way, you can write the number of words in the document. Instead of characters!

Extract strings in a .txt file with javascript

I have a .txt file with this structure:
chair 102
file 38
green 304
... ...
It has 140.000 elements.
Before introducing the numbers I used javascript and jQuery:
$(function () {
$.get('/words.txt', function (data) {
words = data.split('\n');
});
But because I have now numbers how could I treat separately the strings and the numbers?
Since this helped, I'll post as an answer:
Your format is <word><space><num>\n
You split on new line, so now you have an array of <word><space><num> which you should be able to split on space.
Then you can get the word part as myarray[0] and the number part as myarray[1].
you could split at each new line and then split each element at space, but this will gives you array of array of words .
you could replace line with space and then split at space
ie:
words = data.replace(/\n/g,' ').split(' ');
An efficient way of handling this problem is to replace all the line breaks with spaces, then split the resulting string by the spaces. Then use what you know about the position of the elements to determine whether you're dealing with a number or a string:
var resultArr = data.replace(/\n/g, " ").split(" ")
for(var i = 0; i < resultArr.length; i++) {
if(i % 2) {
// even indexes represent the word
console.info("word = " + resultArr[i]);
} else {
// odd indexes represent the number
console.info("number = " + resultArr[i]);
}
}
Depending on whether or not there's a line break at the end of the set, you may need to handle that case by looking for an empty string.

Categories

Resources