Build array from text file (weird format)

Build array from text file (weird format) - javascript

I have a text file that looks like this:
name
birthday
text
other
name2
birthday2
text2
other2
that goes over 10000 lines.
I want to turn that into a javascript array that looks like this:
[[name,birthday,text,other],[name2,birthday2,text2,other2], ...]
There are 4 lines in between each 2 groups (between "other" and "name2"). It would take me hours to do it manually.
The readfile functions I found for javascript while searching all deal with line by line formats and none have group formatting functions like that.

You can read the text file and parse it accordingly depending the number of lines you expect per group:
const fs = require('fs')
fs.readFile('test.txt', 'utf-8', (err, data) => {
let rows = data.split('\n\n\n\n').map(row => row.split('\n'))
console.log(rows)
})
The first split of '\n\n\n\n' is for the 4 lines separation.
This will print:
[ ['name', 'birthday', 'text', 'other'], ['name2', 'birthday2', 'text2', 'other2'] ]

#dandavis 's answer works perfectly.
Load the entire file into one string then do that:
var myarray = str.split(/\n{3,}/).map(x=>x.trim().split(/\n/));

Related

Scraping javascript website and script tags using python

I am trying to scrape a javascript web page. Having read some of the posts I managed to write the following:
from bs4 import BeautifulSoup
import requests
website_url = requests.get('https://ec.europa.eu/health/documents/community-register/html/reg_hum_atc.htm').text
soup= BeautifulSoup(website_url,'lxml')
print(soup.prettify())
and recover the following scripts as follows:
soup.find_all('script')[3]
which gives:
<script type="text/javascript">
// Initialize script parameters.
var exportTitle ="Centralised medicinal products for human use by ATC code";
// Initialise the dataset.
var dataSet = [
{"id":"A","parent":"#","text":"A - Alimentary tract and metabolism"},
{"id":"A02","parent":"A","text":"A02 - Drugs for acid related disorders"},
{"id":"A02B","parent":"A02","text":"A02B - Drugs for treatment of peptic ulcer"},
{"id":"A02BC","parent":"A02B","text":"A02BC - Proton pump inhibitors"},
{"id":"A02BC01","parent":"A02BC","text":"A02BC01 - omeprazole"},
{"id":"ho15861","parent":"A02BC01","text":"Losec and associated names (referral)","type":"pl"},
...
{"id":"h154","parent":"V09IA05","text":"NeoSpect (withdrawn)","type":"pl"},
{"id":"V09IA09","parent":"V09IA","text":"V09IA09 - technetium (<sup>99m</sup>Tc) tilmanocept"},
{"id":"h955","parent":"V09IA09","text":"Lymphoseek (active)","type":"pl"},
{"id":"V09IB","parent":"V09I","text":"V09IB - Indium (<sup>111</sup>In) compounds"},
{"id":"V09IB03","parent":"V09IB","text":"V09IB03 - indium (<sup>111</sup>In) antiovariumcarcinoma antibody"},{"id":"h025","parent":"V09IB03","text":"Indimacis 125 (withdrawn)","type":"pl"},
...
]; </script>
Now the problem that I am facing is to apply .text() to soup.find_all('script')[3] and recover a json file from that. When I try to apply .text(), the result is an empty string: ''.
So my question is: why is that? Ideally I would like to end up with:
A02BC01 Losec and associated names (referral)
...
V09IA05 NeoSpect (withdrawn)
V09IA09 Lymphoseek
V09IB03 Indimacis 125 (withdrawn)
...

Firstly, you get the text and after that, some string processing - get all the text after 'dataSet = ' and remove the last ';' to have a beautiful JSON array. At the end to process the JSON array in small jsons and print the data.
data = soup.find_all("script")[3].string
dataJson = data.split('dataSet = ')[1].split(';')[0]
jsonArray = json.loads(dataJson)
for jsonElement in jsonArray:
print(jsonElement['parent'], end=' ')
print(jsonElement['text'])

Combining values from two arrays

I am working on an Express app and have an issue trying to match up the values of two arrays
I have a user-entered string which which come through to me from a form (e.g.let analyseStory = req.body.storyText). This string contains line breaks as \r\n\.
An example of string is
In the mens reserve race, Cambridges Goldie were beaten by Oxfords
Isis, their seventh consecutive defeat. \r\n\r\nIn the womens reserve
race, Cambridges Blondie defeated Oxfords Osiris
However before I print this to the browser the string is run through a text analysis library called pos e.g.
const tagger = new pos.Tagger();
res.locals.taggedWords = tagger.tag(analyseStory);
This returns to me an array of words in the string and their grammatical type
[ [ 'In', 'Noun, sing. or mass' ],
[ 'the', 'Determiner' ],
[ 'mens', 'Noun, plural' ],
[ 'reserve', 'Noun, sing. or mass' ],
[ 'race', 'Noun, sing. or mass' ],
[ ',', 'Comma' ],
[ 'Cambridges', 'Noun, plural' ],
[ 'Goldie', 'Proper noun, sing.' ],
[ 'were', 'verb, past tense' ],
[ 'beaten', 'verb, past part' ],
[ 'by', 'Preposition' ],
[ 'Oxfords', 'Noun, plural' ],
....
]
Currently when I print this user-entered text to the screen I loop through the array and print out the key and then wrap that in a class containing the value. This gives a result like:
<span class="noun-sing-or-mass">In</span>
<span class="determiner">the</span>
<span class="noun-plural">mens</span>
so that I can style them.
This all works fine but the problem is that I lose my line breaks in the process. I'm really not sure how to solve this problem but I was thinking that perhaps I could do this on the client side if I break the initial string I get (analyseStory) into an array (where commas, full stops are array items as they are in the above) and then apply the grammatical type supplied in res.locals.taggedWords to the array generated from analyseStory string. However I'm not sure how to do this or even if it is the right solution to the problem.
FWIW if I print analyseStory to the screen without pushng it through text analysis I handle line breaks by wrapping the string in <span style="white-space: pre-line">User entered string</span> rather than converting to <br />.
Any help much appreciated.

This solution uses ES6 Map, and String.replace() with a RegExp to find all words in the analysis, and replace them with a span that has the relevant class name.
You can see in the demo that it preserves the line breaks. Inspect the elements to see the spans with the classes.
const str = 'In the mens reserve race, Cambridges Goldie were beaten by Oxfords Isis, their seventh consecutive defeat. \r\n\r\nIn the womens reserve race, Cambridges Blondie defeated Oxfords Osiris';
const analyzed = [["In","Noun, sing. or mass"],["the","Determiner"],["mens","Noun, plural"],["reserve","Noun, sing. or mass"],["race","Noun, sing. or mass"],[",","Comma"],["Cambridges","Noun, plural"],["Goldie","Proper noun, sing."],["were","verb, past tense"],["beaten","verb, past part"],["by","Preposition"],["Oxfords","Noun, plural"]];
// create Map from the analyzed array. Use Array.map() to change all keys to lower case, and prepare the class name
const analyzedMap = new Map(analyzed.map(([k, v]) =>
[k.toLowerCase(), v.trim().toLowerCase().replace(/\W+/g, '-')]));
// search for a sequence word characters or special characters such as comman and period
const result = str.replace(/(:?\w+|,|.)/gi, (m) => {
// get the class name from the Map
const className = analyzedMap.get(m.toLowerCase());
// if there is a class name return the word/character wrapped with a span
if(className) return `<span class="${className}">${m}</span>`;
// return the word
return m;
});
demo.innerHTML = result;
#demo {
white-space: pre-line;
}
<div id="demo"></div>

<span> is not a block level element. By default it will not line break. You need to either make it block level with css or wrap your text in something that is block level like a <p> tag.
CSS To Make Block
span { display: block; }

You can pre-process text before analyzing text and replace line breaks with some special characters. Something like the following:
const story_with_br = analyseStory.replace(/\n/g, "__br__");
const tagger = new pos.Tagger();
res.locals.taggedWords = tagger.tag(story_with_br);
Hopefully, taggedWords array will contain "__br__" and if it does then while rendering you can add line breaks instead of "__br__"

What you can do is :
Option 1
Edit the library you're using so that it doesn't ignore your \r\n
Option 2
Define a complex key which will define the newlines :
const newlinesKey = 'yourkeyvalue';
Then you replace all newlines by your newlinesKey :
analyseStory.replace(/\r\n/g, newlinesKey);
And after that you can call the text analysis library :
const tagger = new pos.Tagger();
res.locals.taggedWords = tagger.tag(analyseStory);
Like this you would be able to detect when you have to put a new line if the tagger doesn't ignore the keyValue.

In NodeJS, how do you print lines to a file without manually adding the new line character?

I'm trying to figure out how to simply write a string representing a line to a file, where whatever function I call automatically appends a newline character.
I've tried using the default NodeJS file system library for this but I can't get this to work in any way without manually appending '\n' to the string.
Here's the code I tried:
const fs = require('fs');
const writer = fs.createWriteStream('test.out.txt', { flags: 'w' })
writer.write('line 1')
writer.write('line 2');
writer.write('line 3');
writer.end('end');
However, the output file test.out.txt contains the following line with no newline characters:
line 1line 2line 3end
I would like it to look like this:
line 1
line 2
line 3
end
Note that I'm not trying to log messages, and I'm not trying to redirect standard output.
Is there any way to print it this way with the new line characters automatically added?

As mentioned in the comments, you can write a function to add a text as a line
const writeLine = (writerObject, text) => {
writerObject.write(`${text}\n`)
}
writeLine(writer, 'line 1')
writeLine(writer, 'line 2')
writeLine(writer, 'line 3')
Or you can also use a clojure to create a wrapper object that keeps the 'writer' instead of passing it every time
const customWriter = writerObject => {
return text => writerObject.write(`${text}\n`)
}
const yourWriterWithBreakLine = customWriter(writer)
yourWriterWithBreakLine('line 1')
yourWriterWithBreakLine('line 2')
yourWriterWithBreakLine('line 3')

Manually appending the \n isn't so bad.
You could write a wrapper function to avoid having to put the + '\n' everywhere:
const os = require('os');
let writeln = function (writeStream, str) {
writeStream.write(str + os.EOL);
}
writeln(writer, 'line 1');
writeln(writer, 'line 2');
writeln(writer, 'line 3');
writeln(writer, 'end');
From what I can tell by a cursory look over the fs.WriteStream docs, there's no native writeln function, or anything similar.

Is there any generic function for subscripting?

I have a web page in which contents are loaded dynamically from json. Now i need to find the texts like so2,co2,h2o after the page gets loaded and have to apply subscript for those texts. Is it possible to do this?? If yes please let me know the more efficient way of achieving it.
for example :
var json = { chemA: "value of CO2 is", chemB: "value of H2O is" , chemC: "value in CTUe is"};
in the above json i need to change CO2,H2O and e in CTUe as subscript. how to achieve this??

Take a look at this JSfiddle which shows two approaches:
HTML-based using the <sub> tag
Pure Javascript-based by replacing the matched number with the subscript equivalent in unicode:
http://jsfiddle.net/7gzbjxz3/
var json = { chemA: "CO2", chemB: "H2O" };
var jsonTxt = JSON.stringify(json).replace(/(\d)+/g, function (x){
return String.fromCharCode(8320 + parseInt(x));
});
Option 2 has the advantage of being more portable since you're actually replacing the character. I.e., you can copy and paste the text into say notepad and still see the subscripts there.
The JSFiddle shows both approaches. Not sure why the magic number is 8320 when I was expecting it to be 2080...

So you are generating DOM element as per JSON data you are getting. So before displaying it to DOM you can check if that JSON data contains so2,co2,h2o and if it is then replace that with <sub> tag.
For ex:
var text = 'CO2';
text.replace(/(\d+)/g, "<sub>" + "$1" + "</sub>") ;
And this will returns something like this: "CO2".
As per JSON provided by you:
// Only working for integer right now
var json = { chemA: "value of CO2 is", chemB: "value of H2O is" , chemC: "value in CTUe is"};
$.each(json, function(index, value) {
json[index] = value.replace(/(\d+)/g, "<sub>" + "$1" + "</sub>");
});
console.log(json);
Hope this will helps!

To do this, I would create a prototype function extending String and name it .toSub(). Then, when you create your html from your json, call .toSub() on any value that might contain text that should be in subscript:
// here is the main function
String.prototype.toSub = function() {
var str=this;
var subs = [
['CO2','CO<sub>2</sub>'],
['H2O','H<sub>2O</sub>'],
['CTUe','CO<sub>e</sub>'] // add more here as needed.
];
for(var i=0;i<subs.length;i++){
var chk = subs[i][0];
var rep = subs[i][1];
var pattern = new RegExp('^'+chk+'([ .?!])|( )'+chk+'([ .?!])|( )'+chk+'[ .?!]?$','ig'); // makes a regex like this: /^CO2([ .?!])|( )CO2([ .?!])|( )CO2[ .?!]?$/gi using the surrent sub
// the "empty" capture groups above may seem pointless but they are not
// they allow you to capture the spaces easily so you dont have to deal with them some other way
rep = '$2$4'+rep+'$1$3'; // the $1 etc here are accessing the capture groups from the regex above
str = str.replace(pattern,rep);
}
return str;
};
// below is just for the demo
var json = { chemA: "value of CO2 is", chemB: "value of H2O is" , chemC: "value in CTUe is", chemD: "CO2 is awesome", chemE: "I like H2O!", chemF: "what is H2O?", chemG: "I have H2O. Do you?"};
$.each(json, function(k, v) {
$('#result').append('Key '+k+' = '+v.toSub()+'<br>');
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="result"></div>
Note:
Anytime you do something like this with regex, you run the chance of unintentionally matching and converting some unwanted bit of text. However, this approach will have far fewer edge cases than searching and replacing text in your whole document as it is much more targeted.

write file with "\n" with node.js

Im working with a lot of datas which i turned into arrays , for simplicity lets assume i have array that looks like this
["dataone:dataone","datatwo:datatwo","datathree:datathree"]
im writting output to the file using fs.writeFile
but the output is always in the same row e.g dataone:dataone","datatwo:datatwo","datathree:datathree
i would like to output to be like with "\n" e.g
dataone:dataone
datatwo:datatwo
datathree:datathree
is it possible to make output in file look like this? im writting in into .txt file

Join the data with line breaks before writing to file
var os = require('os');
var brk = os.platform().substring(0,3).toLowerCasee() === 'win'
? '\r\n' : '\n';
var data = ["dataone:dataone","datatwo:datatwo","datathree:datathree"]
fs.writeFile(filename, data.join(brk), {encoding : 'utf8'}, function (e) {
// etc
});

You can join your array with \n before writing it to the file:
var arr = ["dataone:dataone","datatwo:datatwo","datathree:datathree"]
var arr2 = arr.join('\n');

Develop Reference

JavaScript is the programming language of the Web.

Build array from text file (weird format) - javascript

#dandavis 's answer works perfectly. Load the entire file into one string then do that: var myarray = str.split(/\n{3,}/).map(x=>x.trim().split(/\n/));

Related

Scraping javascript website and script tags using python

Combining values from two arrays

In NodeJS, how do you print lines to a file without manually adding the new line character?

Is there any generic function for subscripting?

write file with "\n" with node.js

Categories

Resources