Querying MarkLogic merged collection - javascript

I'm trying to write a query to get the unique values of an attribute from the final merged collection(sm-Survey-merged). Something like:
select distinct(participantID) from sm-Survey-merged;
I get a tree-cache error with the below equivalent JS query. Can someone help me with a better query?
[...new Set (fn.collection("sm-Survey-merged").toArray().map(doc => doc.root.participantID.valueOf()).sort(), "unfiltered")]

If there are a lot of documents, and you attempt to read them all in a single query, then you run the risk of blowing out the Expanded Tree Cache. You can try bumping up that limit, but with a large database with a lot of documents you are still likely to hit that limit.
The fastest and most efficient way to produce a list of the unique values is to create a range index, and select the values from that lexicon with cts.values().
Without an index, you could attempt to perform iterative queries that search and retrieve a set of random values, and then perform additional searches excluding those previously seen values. This still runs the risk of either blowing out the Expanded Tree Cache, timeouts, etc. So, may not be ideal - but would allow you to get some info now without reindexing the data.
You could experiment with the number of iterations and search page size and see if that stays within limits, and provides consistent results. Maybe add some logging or flags to know if you have hit the iteration limit, but still have more values returned to know if it's a complete list or not. You could also try running without an iteration limit, but run the risk of blowing OOM or ETC errors.
function distinctParticipantIDs(iterations, values) {
const participantIDs = new Set([]);
const docs = fn.subsequence(
cts.search(
cts.andNotQuery(
cts.collectionQuery("sm-Survey-merged"),
cts.jsonPropertyValueQuery("participantID", Array.from(values))
),
("unfiltered","score-random")),
1, 1000);
for (const doc of docs) {
const participantID = doc.root.participantID.valueOf();
participantIDs.add(participantID);
}
const uniqueParticipantIDs = new Set([...values, ...participantIDs]);
if (iterations > 0 && participantIDs.size > 0) {
//there are still unique values, and we haven't it our iterations limit, so keep searching
return distinctParticipantIDs(iterations - 1, uniqueParticipantIDs);
} else {
return uniqueParticipantIDs;
}
}
[...distinctParticipantIDs(100, new Set()) ];
Another option would be to run a CoRB job against the database, and apply the EXPORT-FILE-SORT option with ascending|distinct or descending|distinct, to dedup the values produced in an output file.

Related

What's the best way to filter an array of objects to only show those objects which were added since the last time it was filtered?

My first function scrapes my employers site for a list of users who have completed a task and outputs a json file containing the results. The json file is organized as follows:
{"Completed":[{"task":"TitleOfTaskAnd01/01/2019", "name":"UsersFullName"},{"task":"TitleOfTaskAnd01/01/2019", "name":"UsersFullName"}...]}
My second function uses the aforementioned json file to automatically generate receipts. On calling these two functions again I would like to leave out all of the previously utilized data, and only generate receipts for the tasks that were not in the results of any previous calls, therefore avoiding the generation of duplicates.
I tried to filter the first array by the elements of the second array, however as far as I can tell you cannot compare objects, or even arrays for that matter. Here is the function I tried to adjust to my needs:
let myArray = myArray.filter( ( el ) => !toRemove.includes( el ) );
I expect that my use case is not too uncommon and there is already a body of experience regarding best practices in this situation. I prefer solutions that use just javascript, so that I can understand how to navigate the situation better in the future. If however you have a library/module solution that is welcomed as well. Thanks in advance.
The problem is that two objects are never equal (except they are references to the same object). To check for structural equality, you have to manually compare their properties:
myArray.filter(el => !toRemove.some(el2 => el.task === el2.task && el.name === el2.name));
While that works, it will be quite slow for a lot of elements as you compare each object of myArray against all objects of toRemove. To improve that, you could generate a unique hash out of the properties and add that hash into a Set:
const hash = obj => JSON.stringify([obj.name, obj.task]);
const remove = new Set(toRemove.map(hash));
const result = myArray.filter(el => !remove.has(hash(el)));
This will be O(n + m), whereas the previous solutions was O(n * m).

Promises and upserting to database in bulk

I am currently parsing a list of js objects that are upserted to the db one by one, roughly like this with Node.js:
return promise.map(list,
return parseItem(item)
.then(upsertSingleItemToDB)
).then(all finished!)
The problem is that when the list sizes grew very big (~3000 items), parsing all the items in parallel is too memory heavy. It was really easy to add a concurrency limit with the promise library and not run out of memory that way(when/guard).
But I'd like to optimize the db upserts as well, since mongodb offers a bulkWrite function. Since parsing and bulk writing all the items at once is not possible, I would need to split the original object list in smaller sets that are parsed with promises in parallel and then the result array of that set would be passed to the promisified bulkWrite. And this would be repeated for the remaining sets if list items.
I'm having a hard time wrapping my head around how I can structure the smaller sets of promises so that I only do one set of parseSomeItems-BulkUpsertThem at time (something like Promise.all([set1Bulk][set2Bulk]), where set1Bulk is another array of parallel parser Promises?), any pseudo code help would be appreciated (but I'm using when if that makes a difference).
It can look something like this, if using mongoose and the underlying nodejs-mongodb-driver:
const saveParsedItems = items => ItemCollection.collection.bulkWrite( // accessing underlying driver
items.map(item => ({
updateOne: {
filter: {id: item.id}, // or any compound key that makes your items unique for upsertion
upsert: true,
update: {$set: item} // should be a key:value formatted object
}
}))
);
const parseAndSaveItems = (items, offset = 0, limit = 3000) => { // the algorithm for retrieving items in batches be anything you want, basically
const itemSet = items.slice(offset, limit);
return Promise.all(
itemSet.map(parseItem) // parsing all your items first
)
.then(saveParsedItems)
.then(() => {
const newOffset = offset + limit;
if (items.length >= newOffset) {
return parseAndSaveItemsSet(items, newOffset, limit);
}
return true;
});
};
return parseAndSaveItems(yourItems);
The first answer looks complete. However here are some other thoughts that came to mind.
As a hack-around, you could call a timeout function in the callback of your write operation before the next write operation performs. This can give your CPU and Memory a break inbetween calls. Even if you add one millisecond between calls, that is only adding 3 seconds if you have a total of 3000 write objects.
Or you can segment your array of insertObjects, and send them to their own bulk writer.

Complex challenge about complexity and intersection

Preface
Notice: This question is about complexity. I use here a complex design pattern, which you don't need to understand in order to understand the question. I could have simplified it more, but I chose to keep it relatively untouched for the sake of preventing mistakes. The code is written in TypeScript which is a super-set of JavaScript.
The code
Regard the following class:
export class ConcreteFilter implements Filter {
interpret() {
// rows is a very large array
return (rows: ReportRow[], filterColumn: string) => {
return rows.filter(row => {
// I've hidden the implementation for simplicity,
// but it usually returns either an empty array or a very short one.
}
}).map(row => <string>row[filterColumn]);
}
}
}
It receives an array of report row, then it filters the array by some logic that I've hidden. Finally it does not return the whole row, but only one stringy column that is mentioned in filterColumn.
Now, take a look at the following function:
function interpretAnd (filters: Filter[]) {
return (rows: ReportRow[], filterColumn: string) => {
var runFilter = filters[0].interpret();
var intersectionResults = runFilter(rows, filterColumn);
for (var i=1; i<filters.length; i++) {
runFilter = filters[i].interpret();
var results = runFilter(rows, filterColumn);
intersectionResults = _.intersection(intersectionResults, results);
}
return intersectionResults;
}
}
It receives an array of filters, and returns a distinct array of all the "filterColumn"s that the filters returned.
In the for loop, I get the results (string array) from every filter, and then make an intersection operation.
The problem
The report row array is large so every runFilter operation is expensive (while on the other hand the filter array is pretty short). I want to iterate the report row array as fewer times as possible. Additionally, the runFilter operation is very likely to return zero results or very few.
Explanation
Let's say that I have 3 filters, and 1 billion report rows. the internal iterration, i.e. the iteration in ConcreteFilter, will happen 3 billion times, even if the first execution of runFilter returned 0 results, so I have 2 billion redundant iterations.
So, I could, for example, check if intersectionResults is empty in the beginning of every iteration, and if so, then break the loop. But I'm sure that there are better solutions mathematically.
Also if the first runFIlter exectuion returned say 15 results, I would expect the next exectuion to receive an array of only 15 report rows, meaning I want the intersection operation to influence the input of the next call to runFilter.
I can modify the report row array after each iteration, but I don't see how to do it in an efficient way that won't be even more expensive than now.
A good solution would be to remove the map operation, and then passing the already filtered array in each operation instead of the entire array, but I'm not allowed to do it because I must not change the results format of Filter interface.
My question
I'd like to get the best solution you could think of as well as an explanation.
Thanks a lot in advance to every one who would spend his time trying to help me.
Not sure how effective this will be, but here's one possible approach you can take. If you preprocess the rows by the filter column you'll have a way to retrieve the matched rows. If you typically have more than 2 filters then this approach may be more beneficial, however it will be more memory intensive. You could branch the approach depending on the number of filters. There may be some TS constructs that are more useful, not very familiar with it. There are some comments in the code below:
var map = {};
// Loop over every row, keep a map of rows with a particular filter value.
allRows.forEach(row => {
const v = row[filterColumn];
let items;
items = map[v] = map[v] || [];
items.push(row)
});
let rows = allRows;
filters.forEach(f => {
// Run the filter and return the unique set of matched strings
const matches = unique(f.execute(rows, filterColumn));
// For each of the matched strings, go and look up the remaining rows and concat them for the next filter.
rows = [].concat(...matches.reduce(m => map[v]));
});
// Loop over the rows that made it all the way through, extract the value and then unique() the collection
return unique(rows.map(row => row[filterColumn]));
Thinking about it some more, you could use a similar approach but just do it on a per filter basis:
let rows = allRows;
filters.forEach(f => {
const matches = f.execute(rows, filterColumn);
let map = {};
matches.forEach(m => {
map[m] = true;
});
rows = rows.filter(row => !!map[row[filterColumn]]);
});
return distinctify(rows.map(row => row[filterColumn]));

JavaScript - Can't seem to 'reset' array passed to a function

I am somewhat new to JavaScript, but I am reasonably experienced with programming in general. I suspect that my problem may have something to do with scoping, or the specifics of passing arrays as parameters, but I am uncertain.
The high-level goal is to have live plotting with several 'nodes', each of which generates 50 points/sec. I have gotten this working running straight into an array and rendered by dygraphs and C3.js and quickly realized that this is too much data to continually live render. Dygraphs seems to start impacting the user experience after about 30s and C3.js seems to choke at around 10s.
The next attempt is to decimate the plotted data based on zoom level.
I have data saved into an 'object' which I am using somewhat like a dictionary in other languages. This is going well using AJAX requests. The idea is to create a large data buffer using AJAX requests and use the keys to store the data generated by units according to the serial number as the keys. This is working well and the object is being populated as expected. I feel that it is informative to know the 'structure' of this object before I get to my question. It is as follows:
{
1: [[x10,y10], [x11,y11], [...], [x1n, y1n]],
2: [[x20,y20], [x21,y21], [...], [x2n, y2n]],
... : [ ... ]
a: [[xa0,ya0], [xa1,ya1], [...], [xan, yan]]
}
Periodically, a subset of that data will be used to generate a dygraphs plot. I am decimating the stored data and creating a 'plot buffer' to hold a subset of the actual data.
The dygraphs library takes data in several ways, but I would like to structure it 'natively', which is just an array of arrays. Each array within the array is a 'row' of data. All rows must have the same number of elements in order to line up into columns. The data generated may or may not be at the same time. If the data x values perfectly match, then the resulting data would look like the following for only two nodes since x10 = x20 = xn0:
[
[x10, y10, y20],
[x11, y11, y21],
[ ... ],
[xan, yan, yan]
]
Note that this is just x and y in rows. In reality, the times for each serial number may not line up, so it may be much closer to:
[
[x10, y10, null],
[x20, null, y20],
[x11, y11, y21],
[ ... ],
[xan, yan, yan]
]
Sorry for all of the background. We can get to the code tha tI'm having trouble with. I'm periodically attempting to create the plot buffer using the following code:
window.intervalId = setInterval(
function(){
var plotData = formatData(nodeData, 45000, 49000, 200);
/* dygraphs stuff here */
},
500
);
function formatData(dataObject, start, end, stride){
var smallBuffer = [];
var keys = Object.keys(dataObject);
keys.forEach(
function(key){
console.log('key: ', key);
mergeArrays(dataObject[key], smallBuffer, start, end, stride);
}
);
return smallBuffer;
}
function mergeArrays(sourceData2D, destDataXD, startInMs, endInMs, strideInMs){
/* ensure that the source data isn't undefined */
if(sourceData2D){
/* if the destDataXD is empty, then basically copy the
* sourceData2D into it as-is taking the stride into account */
if(destDataXD.length == 0){
/* does sourceData2D have a starting point in the time range? */
var startIndexSource = indexNear2D(sourceData2D, startInMs);
var lastTimeInMs = sourceData2D[startIndexSource][0];
for(var i=startIndexSource; i < sourceData2D.length; i++){
/* start to populate the destDataXD based on the stride */
if(sourceData2D[i][0] >= (lastTimeInMs + strideInMs)){
destDataXD.push(sourceData2D[i]);
lastTimeInMs = sourceData2D[i][0];
}
/* when the source data is beyond the time, then break the loop */
if(sourceData2D[i][0] > endInMs){
break;
}
}
}else{
/* the destDataXD already has data in it, so this needs to use that data
* as a starting point to merge the new data into the destination array */
var finalColumnCount = destDataXD[0].length + 1;
console.log('final column count: ', finalColumnCount);
/* add the next column to each existing row as 'null' */
destDataXD.forEach(
function(element){
element.push(null);
}
);
/* TODO: move data into destDataXD from sourceData2D */
}
}
}
To add some information since it probably isn't self-explanatory without some effort. I create two functions, 'formatData' and 'mergeArrays'. These could have been done in a single function, but it was easier for me to separate out the 'object' domain and the 'array' domain conceptually. The 'formatData' function simply iterates through all of the data stored in each key, calling the 'mergeArray' routine each time through. The 'mergeArray' routine is not yet complete and is where I'm having my issue.
The first time through, formatData should be creating an empty array - smallBuffer - into which data is merged using mergeArrays. The first time executing 'mergeArrays' I see that the smallBuffer is indeed being created and is an empty array. This empty array is supplied as a parameter to 'mergeArrays' and - the first time through - this works perfectly. The next time through, the 'smallBuffer' array is no longer empty, so the second case in 'mergeArrays' gets executed. The first step for me was to calculate the number of columns so that I could pad each row appropriately. This worked fine, but helped point out the problem. The next step was to simply append an empty column of 'null' values to each row. This is where things got weird. After the 1st time through 'mergeData', the destDataXD still contained 'null' data from the previous executions. In essence, it appears that the 'var smallBuffer = [];' doesn't actually clear and retains something. That something is not apparent until near the end. I can't explain exactly what is going on b/c I don't fully understand it, but destDataXD continually grows 'nulls' at the end without ever being properly reset as expected.
Thank you for the time and I look forward to hearing your thoughts, j
Quickly reading through the code, the danger point I see is where you first add an element to destDataXD.
destDataXD.push(sourceData2D[i]);
Note that you are not pushing a copy of the array. You are adding a reference to that array. destDataXD and sourceData2D are now sharing the same data.
So, of course, when you push any null values onto an array in destDataXD, you are also modifying sourceData2D.
You should use the javascript array-copying method slice
destDataXD.push(sourceData2D[i].slice());

How to change result position based off parameter in a mongodb / mongoose query?

So I am using mongoose and node.js to access a mongodb database. I want to bump up each result based on a number (they are ordered by date created if none are bumped up). For example:
{ name: 'A',
bump: 0 },
{ name: 'B',
bump: 0 },
{ name: 'C',
bump: 2 },
{ name: 'D',
bump: 1 }
would be retreived in the order: C, A, D, B. How can this be accomplished (without iterating through every entry in the database)?
Try something like this. Store a counter tracking the total # of threads, let's call it thread_count, initially set to 0, so have a document somewhere that looks like {thread_count:0}.
Every time a new thread is created, first call findAndModify() using {$inc : {thread_count:1}} as the modifier - i.e., increment the counter by 1 and return its new value.
Then when you insert the new thread, use the new value for the counter as the value for a field in its document, let's call it post_order.
So each document you insert has a value 1 greater each time. For example, the first 3 documents you insert would look like this:
{name:'foo', post_order:1, created_at:... } // value of thread_count is at 1
{name:'bar', post_order:2, created_at:... } // value of thread_count is at 2
{name:'baz', post_order:3, created_at:... } // value of thread_count is at 3
etc.
So effectively, you can query and order by post_order as ASCENDING, and it will return them in the order of oldest to newest (or DESCENDING for newest to oldest).
Then to "bump" a thread in its sorting order when it gets upvoted, you can call update() on the document with {$inc:{post_order:1}}. This will advance it by 1 in the order of result sorting. If two threads have the same value for post_order, created_at will differentiate which one comes first. So you will sort by post_order, created_at.
You will want to have an index on post_order and created_at.
Let's guess your code is the variable response (which is an array), then I would do:
response.sort(function(obj1, obj2){
return obj2.bump - obj1.bump;
});
or if you want to also take in mind name order:
response.sort(function(obj1, obj2){
var diff = obj2.bump - obj1.bump;
var nameDiff = (obj2.name > obj1.name)?-1:((obj2.name < obj1.name)?1:0);
return (diff == 0) ? nameDiff : diff;
});
Not a pleasant answer, but the solution you request is unrealistic. Here's my suggestion:
Add an OrderPosition property to your object instead of Bump.
Think of "bumping" as an event. It is best represented as an event-handler function. When an item gets "bumped" by whatever trigger in your business logic, the collection of items needs to be adjusted.
var currentOrder = this.OrderPosition
this.OrderPosition = currentOrder - bump; // moves your object up the list
// write a foreach loop here, iterating every item AFTER the items unadjusted
// order, +1 to move them all down the list one notch.
This does require iterating through many items, and I know you are trying to prevent that, but I do not think there is any other way to safely ensure the integrity of your item ordering - especially when relative to other pulled collections that occur later down the road.
I don't think a purely query-based solution is possible with your document schema (I assume you have createdDate and bump fields). Instead, I suggest a single field called sortorder to keep track of your desired retrieval order:
sortorder is initially the creation timestamp. If there are no "bumps", sorting by this field gives the correct order.
If there is a "bump," the sortorder is invalidated. So simply correct the sortorder values: each time a "bump" occurs swap the sortorder fields of the bumped document and the document directly ahead of it. This literally "bumps" the document up in the sort order.
When querying, sort by sortorder.
You can remove fields bump and createdDate if they are not used elsewhere.
As an aside, most social sites don't directly manipulate a post's display position based on its number of votes (or "bumps"). Instead, the number of votes is used to calculate a score. Then the posts are sorted and displayed by this score. In your case, you should combine createdDate and bumps into a single score that can be sorted in a query.
This site (StackOverflow.com) had a related meta discussion about how to determine "hot" questions. I think there was even a competition to come up with a new formula. The meta question also shared the formulas used by two other popular social news sites: Y Combinator Hacker News and Reddit.

Categories

Resources