Replacing items skips values - javascript

I have a file that includes Database Name and Label - these labels correspond to sectors. My script goes as follows:
I read an excel file that has sector names on it, I then use this to get an allocation of calcrt_field and sector:
es_score_fields['esg_es_score_peers']= es_score_fields.iloc[:,10:100].apply(lambda x: '|'.join(x.dropna().astype(str)), axis=1)
once I have each calcrt_field aligned to the relevant peers (sectors), I read another file that has 2 columns: Database Name and Label. The end goal is to map the score peer sectors to each of these Database Names, examples:
Database Name1: Chemicals (123456)
Label1: Chemicals
Database Name 2: Cement (654321)
Label2: Cement
Once I read the file i use the following (multiple rows) to remove any symbol, space, comma:
score_peers_mapping.Label= BECS_mapping.Label.str.replace('&', '')
this gives me a list with both Database Name (no changes) and Label (all words combined into a single string)
I then map these based on string length as follows:
score_peers_mapping['length'] = score_peers_mapping.Label.str.len()
score_peers_mapping= score_peers_mapping.sort_values('length', ascending=False)
score_peers_mapping
peers_d = score_peers_mapping.to_dict('split')
peers_d = becs_d['data']
peers_d
finally, I do the following:
for item in peers_d:
esg_es_score_peers[['esg_es_score_peers']]= esg_es_score_peers[['esg_es_score_peers']].replace(item[1],item[0],regex=True)
I exported to csv at this stage to see if the mapping was being done correctly but I can see that only some of the fields are correctly being mapped. I think the problem is this replace step
Things I have checked (that might be useless but thought were a good start):
All Labels are already the same as the esg_es_score_peers - no need to substitute labels like i did to remove "&" and so on
Multiple rows have the same string length but the error does not necessarily apply to those ones (my initial thought was that maybe when sorting them by string length something was going wrong whenever there were multiple outputs with the same string length)
Any help will be welcome
thanks so much

Related

Using Google sheets to tally data sets

I have tried many formulas but i am still not able to get what i want. I need help to write an APP SCRIPT code for it. The problem is that I have to match two data sets and return the value of the adjacent cell. I want the sheet to pick a value from first cell of first row from a sheet and match it to entire cells of a row from other sheet (in the same workbook) and then paste the value which was being matched, infront of the cell which matches it. Now the problem is that my data sets are not equal so i can not use vlookup, i want to match and how much percentage it is matching. So highest percentage should be considered as a match. Kindly visit this link for an example in google sheet. [https://docs.google.com/spreadsheets/d/1u_-64UvpirL2JHpgA--GDa263wVb2idIhIYZlFnX2xQ/edit?usp=sharing]
There are a variety of ways to do this sort of partial matching, depending on the real data and how sophisticated you need to match logic to be.
Let's start with the simplest solution first. Did you know you can use wildcards in VLOOKUP? See Vlookup in Google Sheets using wildcards for partial matches.
So for your example data, add a column C to "Set 1" with the formula:
=VLOOKUP("*" & A2 & "*",'Set 2'!A1:A5,1,FALSE)
Obviously, this method fails if "Baseball bat" was supposed the be results for "Ball" instead of "Ballroom". VLOOKUP will simply return the first result that matches. This method also ignores case sensitivity. Finally, this method only works for appending data to set 1 from set 2, not the other way around. Without knowing more about the actual dataset, it's hard to give a solid solution.

How to search for closest tag set match in JavaScript?

I have a set of documents, each annotated with a set of tags, which may contain spaces. The user supplies a set of possibly misspelled tags and I wants to find the documents with the highest number of matching tags (optionally weighted).
There are several thousand documents and tags but at most 100 tags per document.
I am looking on a lightweight and performant solution where the search should be fully on the client side using JavaScript but some preprocessing of the index with node.js is possible.
My idea is to create an inverse index of tags to documents using a multiset, and a fuzzy index that that finds the correct spelling of a misspelled tag, which are created in a preprocessing step in node.js and serialized as JSON files. In the search step, I want to consult for each item of the query set first the fuzzy index to get the most likely correct tag, and, if one exists to consult the inverse index and add the result set to a bag (numbered set). After doing this for all input tags, the contents of the bag, sorted in descending order, should provide the best matching documents.
My Questions
This seems like a common problem, is there already an implementation for it that I can reuse? I looked at lunr.js and fuse.js but they seem to have a different focus.
Is this a sensible approach to the problem? Do you see any obvious improvements?
Is it better to keep the fuzzy step separate from the inverted index or is there a way to combine them?
You should be able to achieve what you want using Lunr, here is a simplified example (and a jsfiddle):
var documents = [{
id: 1, tags: ["foo", "bar"],
},{
id: 2, tags: ["hurp", "durp"]
}]
var idx = lunr(function (builder) {
builder.ref('id')
builder.field('tags')
documents.forEach(function (doc) {
builder.add(doc)
})
})
console.log(idx.search("fob~1"))
console.log(idx.search("hurd~2"))
This takes advantage of a couple of features in Lunr:
If a document field is an array, then Lunr assumes the elements are already tokenised, this would allow you to index tags that include spaces as-is, i.e. "foo bar" would be treated as a single tag (if this is what you wanted, it wasn't clear from the question)
Fuzzy search is supported, here using the query string format. The number after the tilde is the maximum edit distance, there is some more documentation that goes into the details.
The results will be sorted by which document best matches the query, in simple terms, documents that contain more matching tags will rank higher.
Is it better to keep the fuzzy step separate from the inverted index or is there a way to combine them?
As ever, it depends. Lunr maintains two data structures, an inverted index and a graph. The graph is used for doing the wildcard and fuzzy matching. It keeps separate data structures to facilitate storing extra information about a term in the inverted index that is unrelated to matching.
Depending on your use case, it would be possible to combine the two, an interesting approach would be a finite state transducers, so long as the data you want to store is simple, e.g. an integer (think document id). There is an excellent article talking about this data structure which is similar to what is used in Lunr - http://blog.burntsushi.net/transducers/

How to lowercase field name in pdi (pentaho)?

I'm actually new to PDI and i need to do some extract from csv however sometimes field name are in lowercase or uppercase.
I know how to modify it for rows but don't know how to do it for fields names.
Does exist a step to do it?
I tried ${fieldName}.lower(), lower(${fieldName}) in select value and javascript script but without succes
thanks in advance
The quick fix is to right-click the list of column provided by the CSV file input to copy/paste it back and forth into Excel (or whatever).
If you also have 150 input files, the step which changes dynamically the column names (and other metadata like type) is called Metadata Injection, Kettle doc. The Official doc gives details and examples.
Your specific case is covered in BizCubed. Download the sample near the end of the web page, unzip, load the ktr in PDI. You'll need to adapt the Fields step in the MetaDataInjection transformation. It is currently a DataGrid that you may change by a Javascript lowercase (or better a String operation), after having kept the first line only of your CSV (read with header NOT present, include the rownumber and Filter rownumber=1).
If you want to change a column name you can use the 'Select values' step.
There is a 'Rename to' option in the 'Select & Alter' tab as well as the 'Meta-data' tab that you can use to change a column name to whatever you want.

Extract Multiple Values from Dynamic Multi-line String

I'm working on a small Node.js app to parse a running log file in order to extract key values and generate custom alerts based on the results. However, I've now run into an issue for which I can't seem to find a solution.. If it's relevant at all, the specific log being parsed is a MS SourceSafe 2005 journal file.
For clarity, here are three examples of possible journal entries (some details changed for privacy reasons, structure kept intact):
$/path/to/a/project/folder
Version: 84
User: User1 Date: 14/01/27 Time: 12:15p
testBanner.rb added
Comment: Style and content changes based on corporate branding
Remove detector column on sc600 page
Styling tweaks and bug fixes
$/path/to/a/project/file.java
Version: 22
User: User2 Date: 14/01/29 Time: 12:34p
Checked in
Comment: Added fw updates to help fix (xxx) as seen in (yyy):
Changes include:
1) Peak tuning (minimum peak distance, and percentage crosstalk peak)
2) Dynamic pulses adjusted in run time by the sensor for low temperature climate
s
3) Startup noise automatic resets
4) More faults
$/path/to/a/project/folder
Version: 29
User: User3 Date: 14/01/30 Time: 11:54a
Labeled v2.036
Comment: Added many changes at this point, see aaVersion.java for a more comple
te listing.
So far, the following points are known:
First entry line is always the relevant VSS database project or file path.
Second entry line is always the relevant version of the above project or file.
Third entry line always contains three values: User:, Date: and Time:.
Fourth entry line is always the associated action, which can be any one of the following:
Checked in: {file}
{file} added
{folder} created
{file or folder} deleted
{file or folder} destroyed
Labeled: {label}
Fifth entry line is an optional comment block, starting with Comment:. It may contain any type of string input, including new lines, file names, brackets, etc. Basically VSS does not restrict the comment contents at all.
I've found regex patterns to match everything except the "Comment:" section, not knowing how many new line characters may be included in the comment makes this really difficult for someone like me who doesn't speak regex very fluently at all..
So far, I've managed to get my app to watch the journal file for changes and catch only fresh data in a stream. My initial plan was to use .split('\n\n') on the stream output to catch each individual entry, but since comments may also contain any number of new lines at any position, this is not exactly a safe approach.
I found a module called regex-stream, which makes me think I don't need to collect the results in an array of strings before extracting details, but I don't really understand the given usage example. Alternatively, I have no problem with splitting and parsing individual strings, as long as I can find a reliable way to break the stream down into the individual entries.
In the end, I'm looking for an array of objects with the following entry structure for each update of the journal:
{
path: "",
version: "",
user: "",
date: "",
time: "",
action: "",
comment: ""
}
Please note: If 100 files are checked in in one action, VSS will still log an entry for each file. In order to prevent notification spamming, I still need to perform additional validation and grouping before generating any notifications..
The current state of my app can be seen in this Github repo. Could someone please help point me in the right direction here...?
There is no 100% fool-proof way to parse when the Comment section can contain anything. The next best choice would be to make some heuristics, and hoping that there is no crazy comment.
If we can assume that 2 new lines followed by a path signifies the start of an entry, then we can split on this regex (after you replace all variants of line separators to \n):
/\n\n(?=\$\/[^\n]*\n)/
The look-ahead (?=pattern) check that there is a path ahead \$\/[^\n]*\n, without consuming it.
To be extra sure, you can make it checks that the version line follows after the path:
/\n\n(?=\$\/[^\n]*\nVersion: \d+\n)/

Creating a better JSON format

I am returning a query object from Coldfusion as a JSON string which I then parse into JSON in Javascript. It has a bit of a strange format when I finally log it though.
I am faced with two problems. First, I do not know how to access the lowest element (i.e Arthur Weasley) as I cannot use a number in my selector (response.DATA[0].0 doesn't work because the lowest field name is a number). Second, is there any way to assign the values in the columns section to the fields that are numbered 1, 2 and 3?
What I'm really asking is how do I select my lowest level of data? If that can't be done because of the numbers for field names, how do I change the names to something more fitting?
My data logged:
First entry of first entry of DATA = response.DATA[0][0]
So
name = reponse.DATA[0][0];
trainsThing = response.DATA[0][1];

Categories

Resources