Extract Multiple Values from Dynamic Multi-line String

Extract Multiple Values from Dynamic Multi-line String - javascript

I'm working on a small Node.js app to parse a running log file in order to extract key values and generate custom alerts based on the results. However, I've now run into an issue for which I can't seem to find a solution.. If it's relevant at all, the specific log being parsed is a MS SourceSafe 2005 journal file.
For clarity, here are three examples of possible journal entries (some details changed for privacy reasons, structure kept intact):
$/path/to/a/project/folder
Version: 84
User: User1 Date: 14/01/27 Time: 12:15p
testBanner.rb added
Comment: Style and content changes based on corporate branding
Remove detector column on sc600 page
Styling tweaks and bug fixes
$/path/to/a/project/file.java
Version: 22
User: User2 Date: 14/01/29 Time: 12:34p
Checked in
Comment: Added fw updates to help fix (xxx) as seen in (yyy):
Changes include:
1) Peak tuning (minimum peak distance, and percentage crosstalk peak)
2) Dynamic pulses adjusted in run time by the sensor for low temperature climate
s
3) Startup noise automatic resets
4) More faults
$/path/to/a/project/folder
Version: 29
User: User3 Date: 14/01/30 Time: 11:54a
Labeled v2.036
Comment: Added many changes at this point, see aaVersion.java for a more comple
te listing.
So far, the following points are known:
First entry line is always the relevant VSS database project or file path.
Second entry line is always the relevant version of the above project or file.
Third entry line always contains three values: User:, Date: and Time:.
Fourth entry line is always the associated action, which can be any one of the following:
Checked in: {file}
{file} added
{folder} created
{file or folder} deleted
{file or folder} destroyed
Labeled: {label}
Fifth entry line is an optional comment block, starting with Comment:. It may contain any type of string input, including new lines, file names, brackets, etc. Basically VSS does not restrict the comment contents at all.
I've found regex patterns to match everything except the "Comment:" section, not knowing how many new line characters may be included in the comment makes this really difficult for someone like me who doesn't speak regex very fluently at all..
So far, I've managed to get my app to watch the journal file for changes and catch only fresh data in a stream. My initial plan was to use .split('\n\n') on the stream output to catch each individual entry, but since comments may also contain any number of new lines at any position, this is not exactly a safe approach.
I found a module called regex-stream, which makes me think I don't need to collect the results in an array of strings before extracting details, but I don't really understand the given usage example. Alternatively, I have no problem with splitting and parsing individual strings, as long as I can find a reliable way to break the stream down into the individual entries.
In the end, I'm looking for an array of objects with the following entry structure for each update of the journal:
{
path: "",
version: "",
user: "",
date: "",
time: "",
action: "",
comment: ""
}
Please note: If 100 files are checked in in one action, VSS will still log an entry for each file. In order to prevent notification spamming, I still need to perform additional validation and grouping before generating any notifications..
The current state of my app can be seen in this Github repo. Could someone please help point me in the right direction here...?

There is no 100% fool-proof way to parse when the Comment section can contain anything. The next best choice would be to make some heuristics, and hoping that there is no crazy comment.
If we can assume that 2 new lines followed by a path signifies the start of an entry, then we can split on this regex (after you replace all variants of line separators to \n):
/\n\n(?=\$\/[^\n]*\n)/
The look-ahead (?=pattern) check that there is a path ahead \$\/[^\n]*\n, without consuming it.
To be extra sure, you can make it checks that the version line follows after the path:
/\n\n(?=\$\/[^\n]*\nVersion: \d+\n)/

Related

Replacing items skips values

I have a file that includes Database Name and Label - these labels correspond to sectors. My script goes as follows:
I read an excel file that has sector names on it, I then use this to get an allocation of calcrt_field and sector:
es_score_fields['esg_es_score_peers']= es_score_fields.iloc[:,10:100].apply(lambda x: '|'.join(x.dropna().astype(str)), axis=1)
once I have each calcrt_field aligned to the relevant peers (sectors), I read another file that has 2 columns: Database Name and Label. The end goal is to map the score peer sectors to each of these Database Names, examples:
Database Name1: Chemicals (123456)
Label1: Chemicals
Database Name 2: Cement (654321)
Label2: Cement
Once I read the file i use the following (multiple rows) to remove any symbol, space, comma:
score_peers_mapping.Label= BECS_mapping.Label.str.replace('&', '')
this gives me a list with both Database Name (no changes) and Label (all words combined into a single string)
I then map these based on string length as follows:
score_peers_mapping['length'] = score_peers_mapping.Label.str.len()
score_peers_mapping= score_peers_mapping.sort_values('length', ascending=False)
score_peers_mapping
peers_d = score_peers_mapping.to_dict('split')
peers_d = becs_d['data']
peers_d
finally, I do the following:
for item in peers_d:
esg_es_score_peers[['esg_es_score_peers']]= esg_es_score_peers[['esg_es_score_peers']].replace(item[1],item[0],regex=True)
I exported to csv at this stage to see if the mapping was being done correctly but I can see that only some of the fields are correctly being mapped. I think the problem is this replace step
Things I have checked (that might be useless but thought were a good start):
All Labels are already the same as the esg_es_score_peers - no need to substitute labels like i did to remove "&" and so on
Multiple rows have the same string length but the error does not necessarily apply to those ones (my initial thought was that maybe when sorting them by string length something was going wrong whenever there were multiple outputs with the same string length)
Any help will be welcome
thanks so much

How to search for closest tag set match in JavaScript?

I have a set of documents, each annotated with a set of tags, which may contain spaces. The user supplies a set of possibly misspelled tags and I wants to find the documents with the highest number of matching tags (optionally weighted).
There are several thousand documents and tags but at most 100 tags per document.
I am looking on a lightweight and performant solution where the search should be fully on the client side using JavaScript but some preprocessing of the index with node.js is possible.
My idea is to create an inverse index of tags to documents using a multiset, and a fuzzy index that that finds the correct spelling of a misspelled tag, which are created in a preprocessing step in node.js and serialized as JSON files. In the search step, I want to consult for each item of the query set first the fuzzy index to get the most likely correct tag, and, if one exists to consult the inverse index and add the result set to a bag (numbered set). After doing this for all input tags, the contents of the bag, sorted in descending order, should provide the best matching documents.
My Questions
This seems like a common problem, is there already an implementation for it that I can reuse? I looked at lunr.js and fuse.js but they seem to have a different focus.
Is this a sensible approach to the problem? Do you see any obvious improvements?
Is it better to keep the fuzzy step separate from the inverted index or is there a way to combine them?

You should be able to achieve what you want using Lunr, here is a simplified example (and a jsfiddle):
var documents = [{
id: 1, tags: ["foo", "bar"],
},{
id: 2, tags: ["hurp", "durp"]
}]
var idx = lunr(function (builder) {
builder.ref('id')
builder.field('tags')
documents.forEach(function (doc) {
builder.add(doc)
})
})
console.log(idx.search("fob~1"))
console.log(idx.search("hurd~2"))
This takes advantage of a couple of features in Lunr:
If a document field is an array, then Lunr assumes the elements are already tokenised, this would allow you to index tags that include spaces as-is, i.e. "foo bar" would be treated as a single tag (if this is what you wanted, it wasn't clear from the question)
Fuzzy search is supported, here using the query string format. The number after the tilde is the maximum edit distance, there is some more documentation that goes into the details.
The results will be sorted by which document best matches the query, in simple terms, documents that contain more matching tags will rank higher.
Is it better to keep the fuzzy step separate from the inverted index or is there a way to combine them?
As ever, it depends. Lunr maintains two data structures, an inverted index and a graph. The graph is used for doing the wildcard and fuzzy matching. It keeps separate data structures to facilitate storing extra information about a term in the inverted index that is unrelated to matching.
Depending on your use case, it would be possible to combine the two, an interesting approach would be a finite state transducers, so long as the data you want to store is simple, e.g. an integer (think document id). There is an excellent article talking about this data structure which is similar to what is used in Lunr - http://blog.burntsushi.net/transducers/

Is there any documentation about how Revision.Description is populated and under what condition?

Is there any documentation about how Revision.Description is populated and under what condition?
I'm writing a Custom Application for Rally so that I can view changes made to Task and HierarchicalRequirement objects via a table with a rolling 7 day period.
The attributes that I'm interested in are:
HierarchicalRequirement
PlanEstimate
TaskEstimateTotal
TaskActualTotal
TaskRemainingTotal
Task
Estimate
ToDo
Actuals
I'm traversing Revisions to get snapshot views of tasks and stories:
It's easy to retrieve these attributes for the current day. However, I need to traverse RevisionHistory -> Revisions and then parse the Revision.Description to apply the differences for Task and HierarchicalRequirement objects. This may provide a daily snapshot of each object.
For example: the following were appended to Revision.Description after took place:
TASK REMAINING TOTAL changed from [7.0] to [4.0]
TASK ESTIMATE TOTAL changed from [7.0] to [4.0]
The "rolling 7 day" period is just an example. My intention is to create a table with a breakdown of Team -> Story -> Task -> Estimate -> ToDo along the y-axis and Iteration -> daily-date along the x-axis.
Tim.

The Revision.description field on many of the Rally object types was not originally intended for developers to get change information but rather for display purposes for our Rally ALM SaaS tool - that's why changes are put in a Revision attribute called 'description' which is just a text field. So there is no developer documentation on the format of this data since it is a text field and not intended to be parsed and the format could change in the future (in the future there will be a better way to get object change information. More on this later in this post...)
However, there is a pattern in this data. It is:
ATTRIBUTE_NAME action VALUE_CLAUSE
The actions are 'added' or 'changed'.
The value clause format is based on the action type. For the 'added' action the value clause is [value]. For the 'changed' action the value clause is 'from [old value] to [new value]'.
For example, for an existing User Story that had an owner set to 'Newt' from 'No Entry', a new revision instance is created the description would have this contained in it:
OWNER added [Newt]
If then later the user changed the owner to 'John', then a new revision will be created that looks like this:
OWNER changed from [Newt] to [John]
If there is more than one attribute change then the changes are separated by commas and there is no guaranteed sorting order of the changes.
Now for the better way to do this in the future. Since you are not the only developer that wants to get at object changes we have a new product under development that will have WSAPI endpoints exposed where you can get changes for an object in a programatic way that should avoid you needing to parse data. But since this product is under development you'll have to do what you are doing now and hopefully my explanation of the format of the data in the description will help you in the meantime.
Hope this helps.

The data you are seeking may also exist in the IterationCumulativeFlowData or ReleaseCumulativeFlowData objects in Rally's WSAPI:
https://rally1.rallydev.com/slm/doc/webservice/
That should be easier (and perform better) than grepping through all the revision history entries.

The gmail label chooser conundrum - is there a better way to do it?

We are in the midst of implementing a labeling functionality exactly like gmail for our webapp - you can select the posts (checkboxes) and select which labels to apply/delete from a drop down list of 'labels' (which themselves are a set of checkboxes). The problem is "how to go about doing it?" I have a solution and before I tackle it that way I want to get an opinion on whether it is the right way and if it can be simplified using certain jquery/javascript constructs that I might not be aware of. I am not a JavaScript/jQuery pro by any means, yet. :)
Let:
M = {Set of posts}
N = {Set of Labels}
M_N = many to many relation between M and N i.e., the set of posts that have at least one label from N
Output: Given a set of 'selected' posts and the set of 'selected' labels get a array of items for the JSON with the following values:
Post_id, Label_id, action{add, delete}
Here's the approach that I have come up with (naive or optimal, I don't know):
Get current number of selected posts: var selectionCount = 5 (say i.e., 5 posts selected)
Capture the following data set for each item in the selection:
Label_id | numberOfLabelsInSelection| currentStateToShow | newState
4 | 3 | partialTick | ticked (add)
10 | 5 | ticked | none (delete)
12 | 1 | partialTick | partialTick (ignore)
14 | 0 | none | ticked (add)
Basically the above data structure is just capturing the conditions of display i.e., 5 posts are selected overall and only two have label "x" say, then the label list should show a 'partial tick mark' in the checkbox, if all posts have a label "y" then the drop down shows a "full tick". Labels not on the selected set are just unselected but can only toggle to a tick mark or 'none' but not to a partial state (i.e., on/off only. The partialTick has three states so to speak: on/off/partial)
The 'newState' column is basically what has been selected. The output action is based with what the previous state was (i.e., currentStateToShow):
partial to tick implies add label to all posts that didn't have that label
ticked to none implies delete that label from all posts
partial to none implies delete only labels from those selected posts
none to ticked implies add new label to all posts
partial to partial implies ignore, i.e., no change.
Then I can iterate over this set and decide to send the following data to the server:
| Post_id | Label_id | Action |
| 99 | 4 | add |
| 23 | 10 | delete |
...
and so on.
So what's the issue? Well this is QUITE COMPLICATED!!! Javascript doesn't really have the map data structure (does it?) and it would entail too many sequential iterations and check each and every thing and then have a lot of if-else's to ascertain the value of the newState.
I'm not looking for "how to code it" but what can I do to make my life easier? Is there something out there that I can already use? Is the logic correct or is it a bit too convoluted? Any suggestions as to how to attack the problem or some built in data structures (or an external lib) that could make things less rough? Code samples :P ?
I'm working with javascript/jquery + AJAX and restlet/java/mysql and will be sending a JSON data structure for this but I'm quiteeeeeeeeeee confounded by this problem. It doesn't look as easy as I initially thought it to be (I mean I thought it was "easier" than what I'm facing now :)
I initially thought of sending all the data to the server and performing all this on the backend. But after an acknowledgment is received I still need to update the front end in a similar fashion so I was 'back to square one' so to speak since I'd have to repeat the same thing on the front end to decide which labels to hide and which to show. Hence, I thought it'd just be better to just do the whole thing on the client side.
I'm guessing this to be an easy 100-150+ lines of javascript/jquery code as per my 'expertise' so to speak, maybe off...but that's why I'm here :D
PS: I've looked at this post and the demo How can I implement a gmail-style label chooser? But that demo is only for one post at a time and it can be easily done. My problem is aggravated due to the selection set with these partial selections etc.,

Algorithm
I think, the algorithm makes sense.
Although, is there a need for a lot of if-elses to compute output action? Why not just add ticked label to ALL posts—surely you can't add one label to same post twice anyway. I doubt it would hurt the performance… Especially if you fit JSON data for all changed posts into one request anyway (that depends on whether your back-end supports PUTting multiple objects at once).
Beat complexity with MVC
Regarding how it could be made less complex: I think, code organization is a big deal here.
There is something out there that you can use: I suggest you to check libraries that implement some kind of MVC-approach in JavaScript (for example, Backbone.js). You'll end up having a few classes and your logic will fit into small methods on these classes. Your data storage logic will be handled by "model" classes, and display logic by "views". This is more maintainable and testable.
(Please check these two awesome presentations on topic, if you haven't already: Building large jQuery applications, Functionality focused code organization.)
The problem is that the refactoring of existing code may take some time, and it's hard to get it right from the first time. Also, it kinda affects your whole client-side architecture, so that maybe isn't what you wanted.
Example
If I had a similar task, I'd take Backbone.js and do something like that (pseudocode / CoffeeScript; this example is neither good nor complete, the goal is to give a basic idea of class-based approach in general):
apply_handler: ->
# When user clicks Apply button
selectedPosts = PostManager.get_selected()
changedLabels = LabelManager.get_changed()
for label in changedLabels
for post in selectedPosts
# Send your data to the server:
# | post.id | label.id | label.get_action() |
# Or use functionality provided by Backbone for that. It can handle
# AJAX requests, if your server-side is RESTful.
class PostModel
# Post data: title, body, etc.
labels: <list of labels that this post already contains>
checked: <true | false>
view: <PostView instance>
class PostView
model: <PostModel instance>
el: <corresponding li element>
handle_checkbox_click: ->
# Get new status from checkbox value.
this.model.checked = $(el).find('.checkbox').val()
# Update labels representation.
LabelManager.update_all_initial_states()
class PostManager
# All post instances:
posts: <list>
# Filter posts, returning list containing only checked ones:
get_selected: -> this.posts.filter (post) -> post.get('checked') == true
class LabelModel
# Label data: name, color, etc.
initialState: <ticked | partialTick | none>
newState: <ticked | partialTick | none>
view: <LabelView instance>
# Compute output action:
get_action: ->
new = this.newState
if new == none then 'DELETE'
if new == partialTick then 'NO_CHANGE'
if new == ticked then 'ADD'
class LabelView
model: <LabelModel instance>
el: <corresponding li element>
# Get new status from checkbox value.
handle_checkbox_click: ->
# (Your custom implementation depends on what solution are you using for
# 3-state checkboxes.)
this.model.newState = $(this.el).find('.checkbox').val()
# This method updates checked status depending on how many selected posts
# are tagged with this label.
update_initial_state: ->
label = this.model
checkbox = $(this.el).find('.checkbox')
selectedPosts = PostManager.get_selected()
postCount = selectedPosts.length
# How many selected posts are tagged with this label:
labelCount = 0
for post in selectedPosts
if label in post.labels
labelCount += 1
# Update checkbox value
if labelCount == 0
# No posts are tagged with this label
checkbox.val('none')
if labelCount == postCount
# All posts are tagged with this label
checkbox.val('ticked')
else
# Some posts are tagged with this label
checkbox.val('partialTick')
# Update object status from checkbox value
this.initialState = checkbox.val()
class LabelManager
# All labels:
labels: <list>
# Get labels with changed state:
get_changed: ->
this.labels.filter (label) ->
label.get('initialState') != label.get('newState')
# Self-explanatory, I guess:
update_all_initial_states: ->
for label in this.labels
label.view.update_initial_state()
Oops, seems like too much code. If the example is unclear, feel free to ask questions.
(Update just to clarify: you can do exactly the same in JavaScript. You create classes by calling extend() methods of objects provided by Backbone. It just was faster to type it this way.)
You'd probably say that's even more complex than the initial solution. I'd argue: these classes usually lay in separate files [1], and when you're working on some piece (say, the representation of label in DOM), you usually only deal with one of them (LabelView). Also, check out the presentations mentioned above.
[1] About code organization, see about "brunch" project below.
How the above example would work:
User selects some posts:
Click handler on post view:
toggles post's checked status.
makes all LabelManager update states of all labels.
User selects a label:
Click handler on label view toggles label's status.
User clicks "Apply":
apply_handler(): For each of the changed labels, issue appropriate action for each selected post.
Backbone.js
Update in response to a comment
Well, Backbone actually isn't a lot more than a couple of base classes and objects (see annotated source).
But I like it nevertheless.
It offers a well-thought conventions for code organization.
It's a lot like a framework: you can basically take it and concentrate on your information structure, representation, and business logic, instead of "where do I to put this or that so that I won't end up with maintenance nightmare". However, it's not a framework, which means that you still have a lot of freedom to do what you want (including shooting yourself in the foot), but also have to make some design decisions by yourself.
It saves a good amount of boilerplate code.
For example, if you have a RESTful API provided by the back-end, then you can just map it to Backbone models and it will do all synchronization work for you: e.g. if you save a new Model instance -> it issues a POST request to the Collection url, if you update existing object -> it issues a PUT request to this particular object's url. (Request payload is the JSON of model attributes that you've set using set() method.) So all you have to do is basically set up urls and call save() method on the model when you need it to be saved, and fetch() when you need to get its state from the server. It uses jQuery.ajax() behind the scenes to perform actual AJAX requests.
Some references
Introduction to Backbone.js (unofficial but cool) (broken)
The ToDos example
Don't take it as an "official" Backbone.js example, although it's referenced by the docs. For one, it doesn't use routers, which were introduced later. In general, I'd say it's a good example of a small application built on Backbone, but if you're working on something more complex (which you do), you're likely to end up with something a bit different.
While you at it, be sure to check out brunch. It's basically provides a project template, employing CoffeeScript, Backbone.js, Underscore.js, Stitch, Eco, and Stylus.
Thanks to strict project structure and use of require(), it enforces higher level code organization conventions than Backbone.js does alone. (You basically don't need to think not only in what class to put your code, but also in what file to put that class and where to put that file in the filesystem.) However, if you're not a "conventional" type of person, then you'll probably hate it. I like it.
What's great is that it also provides a way to easily build all this stuff. You just run brunch watch, start working on the code, and each time you save changes it compiles and builds the whole project (takes less than a second) into a build directory, concatenating (and probably even minimizing) all resulting javascript into one file. It also runs mini Express.js server on localhost:8080 which immediately reflects changes.
Related questions
What is the purpose of backbone.js?
https://stackoverflow.com/questions/5112899/knockout-js-vs-backbone-js-vs

Autocompletion results formatting with JQuery

I am currently using this autocomplete plugin. It's pretty straightforward. It accepts a URL, and then uses that data to perform an auto-complete.
This is my code to auto-complete it.
autocompleteurl = '/misc/autocomplete/?q='+$("#q").val()
$("#q").autocomplete(autocompleteurl, {multiple:true});
If someone types "apple", that autocompleteurl page will return this result:
apple store,applebees,apple.com,apple trailers,apple store locator,apple vacations,applebees menu,apple iphone,apple tablet,apple tv
However, for some reason, when I actually use this auto-complete, everything is junked together. The plugin treats the entire page as a one big string, instead of separating the commas and treating them as individual items.
Can someone tell me what options I need to put in order to treat them as individual items? I've tried many options but none work.

From the manual (http://docs.jquery.com/Plugins/Autocomplete/autocomplete#url_or_dataoptions)
A value of "foo" would result in this
request url:
my_autocomplete_backend.php?q=foo&limit=10
The result must return with one value
on each line. The result is presented
in the order the backend sends it.
From what you have posted it seems like you have it comma separated.

The plugin automatically adds the q to the querystring and uses the current value of the text box as the value.
This should be sufficient as long as you're returning the data in the correct format:
$("#q").autocomplete('/misc/autocomplete/', {multiple:true});

#alex I'm getting quirky behavior too - for 2/3/4 alphabets.
See http://docs.jquery.com/Plugins/Autocomplete/autocomplete#toptions .
If you set the minChars option to 2 or 3 it makes things more sane.
There's funny behavior when you have 5 results for "ab" and the same 5 results for "abc" - it does nothing, giving the impression that it is not working!
But it is working and I suspect it has to do with caching options.

Develop Reference

JavaScript is the programming language of the Web.