Our MongoDB database contains a list of all user accounts, where each new registration has a 'created_at' field in the account document with the current date and time when it was created.
We wanted to find out how many new registrations there were or each day, so put together a MapReduce query to find this out for us.
db.accounts.mapReduce(
function() {
var date = this.created_at.toLocaleDateString();
emit(date, 1);
},
function(key, values) {
return values.length;
},
{ out: "output" })
Our first attempt was above. For each registration, it emits a value of 1 for that date. The length of each array is then used to determine how many registrations there were on that day.
However, while the results were mostly correct, there were notable inaccuracies. For example the first day gave us a value in double figures when we know the actual figure was far higher. Some values changed after running the map reduce function a second time, despite operating on the same data.
We changed the function to instead sum up the values of the array (which, remember, should only consist of 1's and therefore be identical to array.length.
db.accounts.mapReduce(
function() {
var date = this.created_at.toLocaleDateString();
emit(date, 1);
},
function(key, values) {
var sum = 0;
for(var i = 0; i < values.length; i++) {
sum += values[i];
};
return sum;
},
{ out: "output" })
To our surprise, this gave the correct result for every date that was wrong before.
Does anyone know why the first map reduce did not operate as intended?
Reduce may be called multiple times for emit-ed values with later calls being passed the output of earlier calls to reduce. When you only look at the length of the array, you miss the fact that you may be looking at partially aggregated data. Summing the values will make the earlier aggregations accumulate, which is what you want.
Related
I'm dealing with an array of "events" where the key of the array is the Unix Timestamp of the event. In other words let's assume we have the following array of event objects in JS:
var MyEventsArray=[];
MyEventsArray[1513957775]={lat:40.671978333333, lng:14.778661666667, eventcode:46};
MyEventsArray[1513957845]={lat:40.674568332333, lng:14.568661645667, eventcode:23};
MyEventsArray[1513957932]={lat:41.674568332333, lng:13.568661645667, eventcode:133};
and so on for thousands rows...
Data are sent along Ajax call and encoded in JSON to be processed with JS. When the data set is received, I have another Unix Timestamp let say 1513957845, coming from another source and I want to find the event that happened at that time...it's quite easy I just need to take the element from the array having the given index (the second in the list above).
Now the question: immagine that the given index is not found (imagine we are looking for UXTimestamp=1513957855) and that this index was not existing in the array but I want to take the closest index (in the example above I would take the element MyEventsArray[1513957845] as it's index 1513957845 is the closest to 1513957855). What can I do to obtain this result?
My difficulties are in handling array index as I when I receive the array I don't know where the index begins.
How the machine will handle situations like that?
Will the machine allocate (and waste) memory for dummy/empty elements placed between each rows or the compiler have some kind of ability to build it's own index and optimize the space? In other words: is it safe to play with index as we're doing or it's better to allocate the array as:
var MyEventsArray=[];
MyEventsArray['1513957775']={lat:40.671978333333, lng:14.778661666667, eventcode:46};
MyEventsArray['1513957845']={lat:40.674568332333, lng:14.568661645667, eventcode:23};
MyEventsArray['1513957932']={lat:41.674568332333, lng:13.568661645667, eventcode:133};
and so on for thousands rows...
In this case the key and the index are clearly different so here it's possible to get the first element with MyArray[0] despite we don't know the key value. Is this approach more expensive (here we must save index and key) in term of memory or the effects are the same for the compiler?
There is no difference between MyEventsArray[1513957775] and MyEventsArray['1513957775']. Deep down, array indexes are just property names, and property names are strings.
Regarding the question of whether these sparse indices will lead to millions of empty cells being allocated, no, that won't happen. Sparse arrays only store what you put in them, not empty space.
If you want to find a key quickly, you can obtain an array of the keys, sort them, and then find the one you want:
var MyEventsArray=[];
MyEventsArray[1513957775]={lat:40.671978333333, lng:14.778661666667, eventcode:46};
MyEventsArray[1513957845]={lat:40.674568332333, lng:14.568661645667, eventcode:23};
MyEventsArray[1513957932]={lat:41.674568332333, lng:13.568661645667, eventcode:133};
var target = 1513957855;
var closest= Object.keys(MyEventsArray)
.map(k => ({ k, delta: Math.abs(target - k) }))
.sort((a, b) => a.delta - b.delta)[0].k;
console.log(closest);
You could take Array#some which allowes to exits the iteration if the delta is getting greater than the last delta.
var array = [];
array[1513957775] = { lat: 40.671978333333, lng: 14.778661666667, eventcode: 46 };
array[1513957845] = { lat: 40.674568332333, lng: 14.568661645667, eventcode: 23 };
array[1513957932] = { lat: 41.674568332333, lng: 13.568661645667, eventcode: 133 };
var key = 0,
search = 1513957855;
Object.keys(array).some(function (k) {
if (Math.abs(k - search) > Math.abs(key - search)) {
return true;
}
key = k;
});
console.log(key);
You can use Object.keys(MyEventsArray) to get an array of the keys (which are strangely expressed as strings); you could then iterate through that and find the closest match.
var MyEventsArray=[];
MyEventsArray[1513957775]={lat:40.671978333333, lng:14.778661666667, eventcode:46};
MyEventsArray[1513957845]={lat:40.674568332333, lng:14.568661645667, eventcode:23};
MyEventsArray[1513957932]={lat:41.674568332333, lng:13.568661645667, eventcode:133};
Object.keys(MyEventsArray)
["1513957775", "1513957845", "1513957932"]
Reference: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array
Preface
Notice: This question is about complexity. I use here a complex design pattern, which you don't need to understand in order to understand the question. I could have simplified it more, but I chose to keep it relatively untouched for the sake of preventing mistakes. The code is written in TypeScript which is a super-set of JavaScript.
The code
Regard the following class:
export class ConcreteFilter implements Filter {
interpret() {
// rows is a very large array
return (rows: ReportRow[], filterColumn: string) => {
return rows.filter(row => {
// I've hidden the implementation for simplicity,
// but it usually returns either an empty array or a very short one.
}
}).map(row => <string>row[filterColumn]);
}
}
}
It receives an array of report row, then it filters the array by some logic that I've hidden. Finally it does not return the whole row, but only one stringy column that is mentioned in filterColumn.
Now, take a look at the following function:
function interpretAnd (filters: Filter[]) {
return (rows: ReportRow[], filterColumn: string) => {
var runFilter = filters[0].interpret();
var intersectionResults = runFilter(rows, filterColumn);
for (var i=1; i<filters.length; i++) {
runFilter = filters[i].interpret();
var results = runFilter(rows, filterColumn);
intersectionResults = _.intersection(intersectionResults, results);
}
return intersectionResults;
}
}
It receives an array of filters, and returns a distinct array of all the "filterColumn"s that the filters returned.
In the for loop, I get the results (string array) from every filter, and then make an intersection operation.
The problem
The report row array is large so every runFilter operation is expensive (while on the other hand the filter array is pretty short). I want to iterate the report row array as fewer times as possible. Additionally, the runFilter operation is very likely to return zero results or very few.
Explanation
Let's say that I have 3 filters, and 1 billion report rows. the internal iterration, i.e. the iteration in ConcreteFilter, will happen 3 billion times, even if the first execution of runFilter returned 0 results, so I have 2 billion redundant iterations.
So, I could, for example, check if intersectionResults is empty in the beginning of every iteration, and if so, then break the loop. But I'm sure that there are better solutions mathematically.
Also if the first runFIlter exectuion returned say 15 results, I would expect the next exectuion to receive an array of only 15 report rows, meaning I want the intersection operation to influence the input of the next call to runFilter.
I can modify the report row array after each iteration, but I don't see how to do it in an efficient way that won't be even more expensive than now.
A good solution would be to remove the map operation, and then passing the already filtered array in each operation instead of the entire array, but I'm not allowed to do it because I must not change the results format of Filter interface.
My question
I'd like to get the best solution you could think of as well as an explanation.
Thanks a lot in advance to every one who would spend his time trying to help me.
Not sure how effective this will be, but here's one possible approach you can take. If you preprocess the rows by the filter column you'll have a way to retrieve the matched rows. If you typically have more than 2 filters then this approach may be more beneficial, however it will be more memory intensive. You could branch the approach depending on the number of filters. There may be some TS constructs that are more useful, not very familiar with it. There are some comments in the code below:
var map = {};
// Loop over every row, keep a map of rows with a particular filter value.
allRows.forEach(row => {
const v = row[filterColumn];
let items;
items = map[v] = map[v] || [];
items.push(row)
});
let rows = allRows;
filters.forEach(f => {
// Run the filter and return the unique set of matched strings
const matches = unique(f.execute(rows, filterColumn));
// For each of the matched strings, go and look up the remaining rows and concat them for the next filter.
rows = [].concat(...matches.reduce(m => map[v]));
});
// Loop over the rows that made it all the way through, extract the value and then unique() the collection
return unique(rows.map(row => row[filterColumn]));
Thinking about it some more, you could use a similar approach but just do it on a per filter basis:
let rows = allRows;
filters.forEach(f => {
const matches = f.execute(rows, filterColumn);
let map = {};
matches.forEach(m => {
map[m] = true;
});
rows = rows.filter(row => !!map[row[filterColumn]]);
});
return distinctify(rows.map(row => row[filterColumn]));
Problem:
I have a DB containing math exercises, split by difficulty levels and date taken.
i want to generate a diagram of the performance over time.
to achieve this, i loop through the query results, and ++ a counter for the level and day the exercise was taken.
example: level 2 exercise was taken at 01.11.2015.
this.levels[2].daysAgo[1].amountTaken++;
with this, i can build a diagram, where day 0 is always today, and the performance over days is shown.
now levels[] has a predefined amount of levels, so there is no problem with that.
but daysAgo[] is very dynamic (it even changes daily with the same data), so if there was only one exercise taken, it would wander on a daily basis (from daysAgo[0] to daysAgo[1] and so on).
the daysAgo[] between that would be empty (because there are no entries).
but for evaluating the diagram, i need them to have an initialized state with amountTaken: 0, and so on.
problem being: i can't know when the oldest exercise was.
Idea 1:
First gather all entries in a kind of proxy object, where i have a var maxDaysAgo that holds the value for the oldest exercise, then initialize an array daysAgo[maxDaysAgo] that gets filled with 0-entries, before inserting the actual entries.
that seems very clumsy and overly complicated
Idea 2:
Just add the entries this.level[level].daysAgo[daysAgo].amountTaken++;, possibly leaving the daysAgo array with a lot of undefined keys.
Then, after all entries are added, i would loop over the daysAgokeys with
for (var i = 1; i < this.maxLevel; i++) { // for every level
for (var j = 0; j < this.levels[i].daysAgo.length; j++) {
but daysAgo.lengthwill not count undefined fields, will it?
So if i have one single entry at [24], length will still be 1 :/
Question:
How can I find out the highest key in an array and loop until there, when there are undefined keys between?
How can i adress all undefined keys up until the highest (and not any more)?
Or: what would be a different, more elegant way to solve this whole problem altogether?
Thanks :)
array.length returns one higher than the highest numerical index, so can be used to loop though even undefined values
as a test:
var a=[]
a[24]=1
console.log(a.length)
outputs 25 for me (in chrome and firefox).
i have one array of ids and one JavaScript objects array. I need to filter/search the JavaScript objects array with the values in the array in Node JS.
For example
var id = [1,2,3];
var fullData = [
{id:1, name: "test1"}
,{id:2, name: "test2"}
,{id:3, name: "test3"}
,{id:4, name: "test4"}
,{id:5, name: "test5"}
];
Using the above data, as a result i need to have :
var result = [
{id:1, name: "test1"}
,{id:2, name: "test2"}
,{id:3, name: "test3"}
];
I know i can loop through both and check for matching ids. But is this the only way to do it or there is more simple and resource friendly solution.
The amount of data which will be compared is about 30-40k rows.
This will do the trick, using Array.prototype.filter:
var result = fullData.filter(function(item){ // Filter fulldata on...
return id.indexOf(item.id) !== -1; // Whether or not the current item's `id`
}); // is found in the `id` array.
Please note that this filter function is not available on IE 8 or lower, but the MDN has a polyfill available.
As long as you're starting with an unsorted Array of all possible Objects, there's no way around iterating through it. #Cerbrus' answer is one good way of doing this, with Array.prototype.filter, but you could also use loops.
But do you really need to start with an unsorted Array of all possible Objects?
For example, is it possible to filter these objects out before they ever get into the Array? Maybe you could apply your test when you're first building the Array, so that objects which fail the test never even become part of it. That would be more resource-friendly, and if it makes sense for your particular app, then it might even be simpler.
function insertItemIfPass(theArray, theItem, theTest) {
if (theTest(theItem)) {
theArray.push(theItem);
}
}
// Insert your items by using insertItemIfPass
var i;
for (i = 0; i < theArray.length; i += 1) {
doSomething(theArray[i]);
}
Alternatively, could you use a data structure that keeps track of whether an object passes the test? The simplest way to do this, if you absolutely must use an Array, would be to also keep an index to it. When you add your objects to the Array, you apply the test: if an object passes, then its position in the Array gets put into the index. Then, when you need to get objects out of the Array, you can consult the index: that way, you don't waste time going through the Array when you don't need to touch most of the objects in the first place. If you have several different tests, then you could keep several different indexes, one for each test. This takes a little more memory, but it can save a lot of time.
function insertItem(theArray, theItem, theTest, theIndex) {
theArray.push(theItem);
if (theTest(theItem)) {
theIndex.push(theArray.length - 1);
}
}
// Insert your items using insertItem, which also builds the index
var i;
for (i = 0; i < theIndex.length; i += 1) {
doSomething(theArray[theIndex[i]]);
}
Could you sort the Array so that the test can short-circuit? Imagine a setup where you've got your array set up so that everything which passes the test comes first. That way, as soon as you hit your first item that fails, you know that all of the remaining items will fail. Then you can stop your loop right away, since you know there aren't any more "good" items.
// Insert your items, keeping items which pass theTest before items which don't
var i = 0;
while (i < theArray.length) {
if (!theTest(theArray[i])) {
break;
}
doSomething(theArray[i]);
i += 1;
}
The bottom line is that this isn't so much a language question as an algorithms question. It doesn't sound like your current data structure -an unsorted Array of all possible items- is well-suited for your particular problem. Depending on what else the application needs to do, it might make more sense to use another data structure entirely, or to augment the existing structure with indexes. Either way, if it's planned carefully, will save you some time.
I am writing my second mapReduce to get the top ten songs played for every user for the last week from a collection that contains "activity" nested document that has an array of song_id, counter and date. Counter means the "play times" of the song.
I tried to use mapReduce and I was able to accomplish this task and output the needed results using only "map" without the need to reduce the emitted values. Is this a wrong approach I am using? what is the best approach of doing this.
Here is the map function:
var map = function() {
user_top_songs = [];
user_songs = [];
limit = 10;
if(this.activities !== undefined){
key = {user_id:this.id};
for (var i=0; i < this.activities.songs.length; i++){
if (this.activities.songs !== undefined && this.activities.songs[i].date.getDate() > (new Date().getDate()-7))
user_songs.push([this.activities.songs[i].song_id, this.activities.songs[i].counter]);
}
if(user_songs.length !== 0){
user_songs.sort(function(a,b){return b[1]-a[1]});
if(user_songs.length < 10 )
limit = user_songs.length;
for(var j=0; j < limit; j++)
user_top_songs.push(user_songs[j]);
}
value = {songs:user_top_songs};
emit(key,value);
}
}
Here is the empty reduce method:
var reduce = function(key, values) {};
You shouldn't need a reduce function. Based on the input data it won't be necessary, and I'll explain why.
To recall in a simplified manner, in MapReduce the mapper function takes the input and splits it up by key then passes the (key,value) pairs to the reducer. The reducer then aggregates the (key, [list of values]) pairs into some useful output.
In your case, the key is the user ID, and the value is top 10 songs they listened to. Just by the way the data is laid out, it is already organized into (key,[list of values]) pairs. You already have the key with the list of every value that is associated with it following it. The user ID is listed with every song they listenend to right after it, so there is no need to reduce.
Basically, the reduce step would be combining each (user ID, song) pair into a list of the user's songs. But that's already been done. It's inherent in the data. So, in this specific case, the mapper is the only necessary function to accomplish what you need in this case.