Promises and upserting to database in bulk - javascript

I am currently parsing a list of js objects that are upserted to the db one by one, roughly like this with Node.js:
return promise.map(list,
return parseItem(item)
.then(upsertSingleItemToDB)
).then(all finished!)
The problem is that when the list sizes grew very big (~3000 items), parsing all the items in parallel is too memory heavy. It was really easy to add a concurrency limit with the promise library and not run out of memory that way(when/guard).
But I'd like to optimize the db upserts as well, since mongodb offers a bulkWrite function. Since parsing and bulk writing all the items at once is not possible, I would need to split the original object list in smaller sets that are parsed with promises in parallel and then the result array of that set would be passed to the promisified bulkWrite. And this would be repeated for the remaining sets if list items.
I'm having a hard time wrapping my head around how I can structure the smaller sets of promises so that I only do one set of parseSomeItems-BulkUpsertThem at time (something like Promise.all([set1Bulk][set2Bulk]), where set1Bulk is another array of parallel parser Promises?), any pseudo code help would be appreciated (but I'm using when if that makes a difference).

It can look something like this, if using mongoose and the underlying nodejs-mongodb-driver:
const saveParsedItems = items => ItemCollection.collection.bulkWrite( // accessing underlying driver
items.map(item => ({
updateOne: {
filter: {id: item.id}, // or any compound key that makes your items unique for upsertion
upsert: true,
update: {$set: item} // should be a key:value formatted object
}
}))
);
const parseAndSaveItems = (items, offset = 0, limit = 3000) => { // the algorithm for retrieving items in batches be anything you want, basically
const itemSet = items.slice(offset, limit);
return Promise.all(
itemSet.map(parseItem) // parsing all your items first
)
.then(saveParsedItems)
.then(() => {
const newOffset = offset + limit;
if (items.length >= newOffset) {
return parseAndSaveItemsSet(items, newOffset, limit);
}
return true;
});
};
return parseAndSaveItems(yourItems);

The first answer looks complete. However here are some other thoughts that came to mind.
As a hack-around, you could call a timeout function in the callback of your write operation before the next write operation performs. This can give your CPU and Memory a break inbetween calls. Even if you add one millisecond between calls, that is only adding 3 seconds if you have a total of 3000 write objects.
Or you can segment your array of insertObjects, and send them to their own bulk writer.

Related

Querying MarkLogic merged collection

I'm trying to write a query to get the unique values of an attribute from the final merged collection(sm-Survey-merged). Something like:
select distinct(participantID) from sm-Survey-merged;
I get a tree-cache error with the below equivalent JS query. Can someone help me with a better query?
[...new Set (fn.collection("sm-Survey-merged").toArray().map(doc => doc.root.participantID.valueOf()).sort(), "unfiltered")]
If there are a lot of documents, and you attempt to read them all in a single query, then you run the risk of blowing out the Expanded Tree Cache. You can try bumping up that limit, but with a large database with a lot of documents you are still likely to hit that limit.
The fastest and most efficient way to produce a list of the unique values is to create a range index, and select the values from that lexicon with cts.values().
Without an index, you could attempt to perform iterative queries that search and retrieve a set of random values, and then perform additional searches excluding those previously seen values. This still runs the risk of either blowing out the Expanded Tree Cache, timeouts, etc. So, may not be ideal - but would allow you to get some info now without reindexing the data.
You could experiment with the number of iterations and search page size and see if that stays within limits, and provides consistent results. Maybe add some logging or flags to know if you have hit the iteration limit, but still have more values returned to know if it's a complete list or not. You could also try running without an iteration limit, but run the risk of blowing OOM or ETC errors.
function distinctParticipantIDs(iterations, values) {
const participantIDs = new Set([]);
const docs = fn.subsequence(
cts.search(
cts.andNotQuery(
cts.collectionQuery("sm-Survey-merged"),
cts.jsonPropertyValueQuery("participantID", Array.from(values))
),
("unfiltered","score-random")),
1, 1000);
for (const doc of docs) {
const participantID = doc.root.participantID.valueOf();
participantIDs.add(participantID);
}
const uniqueParticipantIDs = new Set([...values, ...participantIDs]);
if (iterations > 0 && participantIDs.size > 0) {
//there are still unique values, and we haven't it our iterations limit, so keep searching
return distinctParticipantIDs(iterations - 1, uniqueParticipantIDs);
} else {
return uniqueParticipantIDs;
}
}
[...distinctParticipantIDs(100, new Set()) ];
Another option would be to run a CoRB job against the database, and apply the EXPORT-FILE-SORT option with ascending|distinct or descending|distinct, to dedup the values produced in an output file.

Is there a more elegant way to push and return an array?

I am still figuring out Promises, but while working with them, I've realized it would be nice to reduce an array of fetch objects and put some throttles next to them. While creating my slow querying function, I realized I couldn't think of an elegant way to push onto an array and return that array better than this.
SO. My question is; Is there a more elegant way of pushing to an array and returning an array in one step in Javascript than this?
const mQry = q => fetch(q).then(r=>r.json()); // Fetches and returns json
const throttle = t => new Promise(r=>setTimeout(r,t)); // adds a promised timeout
const slowQrys = (q,t) => // pass in an array of links, and a number of milliseconds
Promise.all(q.reduce((r,o)=> // reduce the queries
// Here's the big issue. Is there any more elegant way
// to push two elements onto an array and return an array?
[...r, mQry(...o), throttle(t)]
,[]);
And before anyone says, I am super aware that splitting out an array could be not efficient, but I'm probably never using more than 10 items, so it's not a super big deal.
A cleaner and more efficient equivalent of the general operation
q.reduce((r, o) =>
[...r, f(...o), g(t)])
uses flatMap:
q.flatMap(o =>
[f(...o), g(t)])
However, in the context of your question, creating a throttle(t) next to each fetch operation in a Promise.all is completely and unambiguously wrong. All of the setTimeout timers will be running in parallel and resolve at the same time, so there’s no point in creating more than one. They don’t interact with the fetch operations, either, just delay the overall fulfilment of the promise slowQrys returns and muddle the array it resolves to.
I would guess that your intent is to chain your fetch(s), such that two consecutive fetch are mandatorily spaced by at least t ms
The chaining is thus
Promise.all([fetch, wait]), Promise.all([fetch, wait]), ...
The way to write that thus be
const slowQrys = (links, t)=>links.reduce((p, link)=>{
return p.then(_=>{
return Promise.all([
fetch(link),
wait(t)
])
})
}, Promise.resolve())

RxJS Run Promise, then combineLatest

I apologize for so many observable questions lately, but I'm still having a really tough time grasping how to chain everything together.
I have a user, who is using promise-based storage to store the names of feeds they do not want to see. On the Social Feeds widget, they get to see the latest article from each feed that they have not filtered out.
I'd like to take a union on the hard-coded list of feeds and the feeds they want to hide. To work with the API I've been given, I need to make multiple calls to the service to retrieve each feed individually.
After I make that union, I'm looking to combine, sequentially, the observable that the utility getFeed method produces.
Here's what I'm looking to do with some pseduocode.
/**
* This gets the top items from all available social media sources.
* #param limit {number} The number of items to get per source.
* #returns {Observable<SocialItem[]} Returns a stream of SocialItem arrays.
*/
public getTopStories(limit: number = 1): Observable<SocialItem[]> {
// Merge the list of available feeds with the ones the user wants to hide.
const feedsToGet = this.storage.get('hiddenFeeds')
.then(hiddenFeeds => _.union(FeedList, hiddenFeeds));
// Let's use our function that retrieves the feeds and maps them into an Observable<SocialItem[]>.
// We need to splice the list because only 'limit' amount of articles can come back from each feed, and the API cannot accommodate sending anything else than 25 items at a time.
// We need to do mergeMap in order to return a single array of SocialItem, instead of a 2D array.
const feeds$ = feedsToGet.map(feed => this.getFeed(feed).map(res = res ? res.slice(0, limit) : []).mergeMap(val => val));
// Let's combine them and return
return Observable.combineLatest(feed$);
}
Edit: Again, sorry for sparse code before.
The only issue with your example is that you are doing your manipulation in the wrong time frame. combineLatest needs an Observable array, not a Future of an Observable array, a hint that you need to combineLatest in the promise handler. The other half is the last step to coerce your Promise<Observable<SocialItem[]>> to Observable<SocialItem[]>, which is just another mergeMap away. All in all:
public getTopStories(limit: number = 1): Observable<SocialItem[]> {
// Merge the list of available feeds with the ones the user wants to hide.
const feeds_future = this.storage.get('hiddenFeeds')
.then(hiddenFeeds => Observable.combineLatest(_.map(
_.union(FeedList, hiddenFeeds),
feed => this.getFeed(feed).mergeMap(res => res ? res.slice(0, limit) : [])
))); // Promise<Observable<SocialItem[]>>
return Observable.fromPromise(feeds) // Observable<Observable<SocialItem[]>>
.mergeMap(v => v); // finally, Observable<SocialItem[]>
}
P.S. the projection function of mergeMap means you can map your values to Observables in the same step as they are merged, rather than mapping them and merging them separately.

Javascript Rsjx: sorting chunks of data by date

[{"creationDate":"2011-03-13T00:17:25.000Z","fileName":"IMG_0001.JPG"},
{"creationDate":"2009-10-09T21:09:20.000Z","fileName":"IMG_0002.JPG"}]
[{"creationDate":"2012-10-08T21:29:49.800Z","fileName":"IMG_0004.JPG",
{"creationDate":"2010-08-08T18:52:11.900Z","fileName":"IMG_0003.JPG"}]
I use a HTTP get method to receive data. Unfortunately, while I do receive this data in chunks, it is not sorted by creationDate DESCENDING.
I need to sort these objects by creationDate my expected result would be.
[{"creationDate":"2012-10-08T21:29:49.800Z","fileName":"IMG_0004.JPG"},
{"creationDate":"2011-03-13T00:17:25.000Z","fileName":"IMG_0001.JPG"}]
[{"creationDate":"2010-08-08T18:52:11.900Z","fileName":"IMG_0003.JPG"},
{"creationDate":"2009-10-09T21:09:20.000Z","fileName":"IMG_0002.JPG"}]
Here's what I tried:
dataInChunks.map(data => {
return data.sort((a,b)=> {
return new Date(b.creationDate).getTime() - new Date(a.creationDate).getTime();
});
})
.subscribe(data => {
console.log(data);
})
This works only but only 1 object at a time which results in giving me the very top result. I need some way to join these chunks together and sort them and in some way break the whole object again into chunks of two.
Are there any RSJX operators I can use for this?
If you know the call definitely completes (which it should) then you can just use toArray which as the name suggests returns an array, which you can then sort. The point of toArray is that it won't produce a stream of data but will wait until the observer completes and return all values:
var allData = dataInChunks.toArray().sort(/*sorting logic*/);
However, if you are required to show the data in the browser as it arrives (if the toArray() approach makes the UI feel unresponsive), then you will have to re-sort the increasing dataset as it arrives:
var allData =[];
dataInChunks
.bufferWithCount(4)
.subscribe(vals => {
allData = allData.concat(vals);
allData.sort(/* sort logic*/);
})
This is slightly hacky as it's relying on a variable outside the stream, but yet get the idea. It uses a buffer bufferWithCount which will allow you to limit the number of re-sorts you do.
TBH, I would just go with the toArray approach, which begs the question why it's an observable in the first place! Good luck.

Complex challenge about complexity and intersection

Preface
Notice: This question is about complexity. I use here a complex design pattern, which you don't need to understand in order to understand the question. I could have simplified it more, but I chose to keep it relatively untouched for the sake of preventing mistakes. The code is written in TypeScript which is a super-set of JavaScript.
The code
Regard the following class:
export class ConcreteFilter implements Filter {
interpret() {
// rows is a very large array
return (rows: ReportRow[], filterColumn: string) => {
return rows.filter(row => {
// I've hidden the implementation for simplicity,
// but it usually returns either an empty array or a very short one.
}
}).map(row => <string>row[filterColumn]);
}
}
}
It receives an array of report row, then it filters the array by some logic that I've hidden. Finally it does not return the whole row, but only one stringy column that is mentioned in filterColumn.
Now, take a look at the following function:
function interpretAnd (filters: Filter[]) {
return (rows: ReportRow[], filterColumn: string) => {
var runFilter = filters[0].interpret();
var intersectionResults = runFilter(rows, filterColumn);
for (var i=1; i<filters.length; i++) {
runFilter = filters[i].interpret();
var results = runFilter(rows, filterColumn);
intersectionResults = _.intersection(intersectionResults, results);
}
return intersectionResults;
}
}
It receives an array of filters, and returns a distinct array of all the "filterColumn"s that the filters returned.
In the for loop, I get the results (string array) from every filter, and then make an intersection operation.
The problem
The report row array is large so every runFilter operation is expensive (while on the other hand the filter array is pretty short). I want to iterate the report row array as fewer times as possible. Additionally, the runFilter operation is very likely to return zero results or very few.
Explanation
Let's say that I have 3 filters, and 1 billion report rows. the internal iterration, i.e. the iteration in ConcreteFilter, will happen 3 billion times, even if the first execution of runFilter returned 0 results, so I have 2 billion redundant iterations.
So, I could, for example, check if intersectionResults is empty in the beginning of every iteration, and if so, then break the loop. But I'm sure that there are better solutions mathematically.
Also if the first runFIlter exectuion returned say 15 results, I would expect the next exectuion to receive an array of only 15 report rows, meaning I want the intersection operation to influence the input of the next call to runFilter.
I can modify the report row array after each iteration, but I don't see how to do it in an efficient way that won't be even more expensive than now.
A good solution would be to remove the map operation, and then passing the already filtered array in each operation instead of the entire array, but I'm not allowed to do it because I must not change the results format of Filter interface.
My question
I'd like to get the best solution you could think of as well as an explanation.
Thanks a lot in advance to every one who would spend his time trying to help me.
Not sure how effective this will be, but here's one possible approach you can take. If you preprocess the rows by the filter column you'll have a way to retrieve the matched rows. If you typically have more than 2 filters then this approach may be more beneficial, however it will be more memory intensive. You could branch the approach depending on the number of filters. There may be some TS constructs that are more useful, not very familiar with it. There are some comments in the code below:
var map = {};
// Loop over every row, keep a map of rows with a particular filter value.
allRows.forEach(row => {
const v = row[filterColumn];
let items;
items = map[v] = map[v] || [];
items.push(row)
});
let rows = allRows;
filters.forEach(f => {
// Run the filter and return the unique set of matched strings
const matches = unique(f.execute(rows, filterColumn));
// For each of the matched strings, go and look up the remaining rows and concat them for the next filter.
rows = [].concat(...matches.reduce(m => map[v]));
});
// Loop over the rows that made it all the way through, extract the value and then unique() the collection
return unique(rows.map(row => row[filterColumn]));
Thinking about it some more, you could use a similar approach but just do it on a per filter basis:
let rows = allRows;
filters.forEach(f => {
const matches = f.execute(rows, filterColumn);
let map = {};
matches.forEach(m => {
map[m] = true;
});
rows = rows.filter(row => !!map[row[filterColumn]]);
});
return distinctify(rows.map(row => row[filterColumn]));

Categories

Resources