I want to perform MapReduce job on data in Riak DB using javascript. But stuck in very begining, i couldnot understand how it is returning value.
client = riak.RiakClient()
query = client.add('user')
query.map("""
function(v){
var i=0;
i++;
return [i];
}
""")
for result in query.run():
print "%s" % (result);
For simplicity i have checked the above example.
Here query is bucket and user contain five sets of data in RiakDB.
i think map() returns single value but it returns array with 5 value, i think equivalent to five set of data in RiakDB.
1
1
1
1
1
And here, why I can return only array? it treats each dataset independently, and returns for each. so i think i have five 1's. Due to this reason when i process fetched data inside map(), returns gives unexpected result for me.
so please give me some suggestion. i think it is basic thing but i couldnot get it. i highly appreciate your help.
When you run a MapReduce job, the map phase code is sent out to the vnodes where the data is stored and executed for each value in the data. The resulting arrays are collected and passed to a single reduce phase, which also returns an array. If there are sufficiently many results, the reduce phase may be run multiple times, with the previous reduce result and a batch of map results as input.
The fact that you are getting 5 results implies that 5 keys were seen in your bucket. There is no global state shared between instances of the map phase function, so each will have an independent i, which is why each result is 1.
You might try returning [v.key] so that you have something unique for each one, or if the values are expected to be small, you could return [JSON.stringify(v)] so you can see the entire structure that is passed to the map.
You should note that according to the docs site javascript Map Reduce has been officially deprecated, so you may want to use Erlang functions for new development.
Related
So yesterday I started messing around with MongoDB in Node and when it comes to retrieving data I encountered a weird practice.
You retrieve data from a collection that is nested within a database by calling.
data = client.db([dbname]).collection([collectionname]).find([searchcriteria])
and this returns what seems to be an object at least in the eyes of typeof
the sample code then uses the following lines to log it to the console:
function iterate(x){
console.log(x)
}
data.forEach(iterate)
The output is as expected in this case two objects with 2 pairs everything is fine so far.
I thought it is a bit unnecessary to have the iterate function so I changed that to just
console.log(data)
expecting the 2 objects in a array or nested in another object but what i get is this huge object with all kinds of different things in it EXCEPT the two objects that we saw before.
So now to my Question and what I need deeper explanation on:
why can i actually use .forEach() on this object I cannot recreate this on other objects.
and the second thing is why is console.log(data) giving me all this output that is hidden if I call it through .forEach()?
and is there any other way to quickly within one or 2 lines of code retrieve data from Mongo ?
this seems to be a very not useful way of doing things.
and how does this .forEach() thing on objects work I found a article here on stack however this was not very detailed and not very easy to understand.
The find function returns a cursor - this is the huge object that you are seeing. Checkout documentation for more details here: https://docs.mongodb.com/manual/reference/method/db.collection.find/#db.collection.find
The reason why you can call forEach on the returned object (=cursor), is because it is one of its methods. See https://docs.mongodb.com/manual/reference/method/cursor.forEach/#cursor.forEach
Overview of all cursor methods is here: https://docs.mongodb.com/manual/reference/method/js-cursor/
To get the array of data that you are looking for you need to use the toArray method, like so:
const data = client.db([dbname]).collection([collectionname]).find([searchcriteria]).toArray()
I'm a bit sorry about tags, probably I understood my problem not right and used them wrong but..
The problem I'm faced with my project is new for me and I never before experienced it. So in my case I have a huge dataset response from DB (Mongo, 100'000+ docs) and I needed to http-request every specific field from doc.
Example array from dataset will be like:
{
_id: 1,
http: http.request.me
},
{
//each doc of 99k docs more
}
So guess you already understood that I cannot use default for loop because
if it async I'll be made a huge amount request to API and will
be banned/restricted/whatever
if I made it one-by-one it will take me about 12-23H of
waiting before my loop completes itself. (actually, this way is in
use)
This is what I'm trying to do right now
there is also another way and that's why I'm here. I could split my huge array in to chunks for example each 5/10/100..N and request them one-by-one
│→await[request_map 0,1,2,3,4]→filled
│→await[request_map 5..10]→filled
│→await[request_map n..n+5]→filled
↓
According to the Split array into chunks I could easily do it. But then I should use 2 for cycles, first one will split default array and second async-request this new array (length 5/10/100...N)
But I have recently heard about reactive paradigm and RxJS that (probably) could solve this. Is this right? What operator should I use? What keyword should I use to find relative problems? (if I google reactive programming I'll receive a lot of useless result with react.js but not what I want)
So should I care about all this and just write an unoptimized code or there is an npm-module for that or another-better-pattern/solution?
Probably I found and answer here
RxJS 1 array item into sequence of single items - operator I'm checking it now, but I also appreciate any relevant contribution
to this question
RxJS has truly been helpful in this case and worth looking. It's an
elegant solution for this kind of problems
Make use of bufferCount and concatMap
range(0,100).pipe(
// save each http call into array as observable but not executing them
map(res=>http(...)),
//5 at a time
bufferCount(5),
//execute calls concurrently and in a queue of 5 calls each time
concatMap(res=>forkJoin(res))
).subscribe(console.log)
There's actually an even easier way to do what you want with mergeMap operator and it's second optional argument which sets the number of concurrent inner Observables:
from([obj1, obj2, obj3, ...]).pipe(
mergeMap(obj => /* make a request out of `obj` */, 5), // keep only 5 concurrent requests
).subscribe(result => ...)
I'm trying to iterate through an array and create records for each iteratee. This is what I am doing like mentioned here at another question:
async.each(data, (datum, callback) => {
console.log('Iterated')
Datum.create({
row: datum,
}).exec((error) => {
if (error) return res.serverError(error)
console.log('Created')
callback()
})
})
Unfortunately, it results in this:
Iterated
Iterated
Iterated
Created
Created
Created
Not this as wanted:
Iterated
Created
Iterated
Created
Iterated
Created
What I'm doing wrong?
async.eachSeries() will run one iteration at a time and wait for each iteration to be terminated before pursuing the next step.
I create an unique user friendly identifier before each creation (like 1, 2, 3 and so on). For that, I've to query the data base to find the latest identifier and increment it which is not available because the records are nearly created at the same time.
This sounds like here's the bottleneck. I don't like running async code in series because this usually slows processes down. How about this approach:
Due to data you know how many identifier you'll need.
implement a function in the backend that doesn't create a single but n such identifier at a time (including the necessary incrementing, etc.) and return that Array to the frontend. Now you can run your regular requests in paralell using/mapping that array of precomputed IDs to the data-array.
This should reduces the runtime from (createAnId + request) * data.length pretty much down to to the runtime of a single iteration. Due to the fact that all these requests can run in paralell, and therefore mostly overlap.
It looks like Datum.create is an asynchronous function.
The forEach whips through each of the three elements of the array, logging them in turn. and since JavaScript won't block prior to the asynchronous events being returned, you get each of the console.logs in turn.
Then after some amount of time, the results come in and "created" is logged to the console.
You seem to be using an asynchronous data processing library. For the result you intend to get, you need to process the data synchronously. Here's how you could do it:
data.forEach(function(datum) {
console.log('Iterated')
Datum.create({
row: datum,
}).exec((error) => {
if (error) return res.serverError(error)
console.log('Created')
callback()
})
})
You may also want to remove the callback function entirely now since the data is processed synchronously.
It is my understanding that -- from a performance perspective -- direct assignment is more desirable than .push() when populating an array.
My code is currently as follows:
for each (var e in Collection) {
do {
DB_Query().forEach(function(e){data.push([e.title,e.id])});
} while (pageToken);
}
DB_Query() method runs a Google Drive query and returns a list.
My issue arises because DB_Query() can return a list of variable length. As such, if I construct data = new Array(100), direct assignment has the potential to go out of bounds.
Is there a method by which I could try and catch an Out of Bounds exception to have values directly assigned for the 100 pre-allocated indices, but use .push() for any overflow? The expectation here is that an OOB exception will not occur often.
Also, I'm not sure if it matters, but I am clearing the array after a counter variable is >=100 using the following method:
while(data.length > 0) {data.pop()}
In Javascript, if you set a value at an index bigger than the array length, it'll automatically "stretch" the array. So there's no need to bother with this. If you can make a good guess about your array size, go for it.
About your clearing loop: that's correct, and it seems that pop is indeed the fastest way.
My original suggestion was to set the array length back to zero: data.length = 0;
Now a tip that I think really makes a performance difference here: you're worrying with the wrong part!
In Apps Script, what takes long is not resizing arrays dynamically, or working your data, that's fast. The issue is always with the "API calls". That is, using UrlFetch or Spreadsheet.Range.getValue and so on.
You should take care to make the minimum amount of API calls possible and in your case (I'm guessing now, since I haven't seen your whole code) you seem to be doing it wrong. If DB_Query is costly (in API calls terms) you should not have it nested under two loops. The best solution usually involves figuring out everything you'll need before-hand (do as many loops you need, if it doesn't call anywhere), then pass all parameters to do a bulk operation and gather it all at once (in one API call), even if it involves getting more data than you needed. Then, with the whole data at hand, loop through and transform it as required (that's the fast part).
I have a javascript application, that calls an api, and the api returns json. With the json, I select a specific object, and loop through that.
My code flow is something like this:
Service call -> GetResults
Loop through Results and build Page
The problem though, is sometimes that api returns only one result, so that means it returns an object instead of an array, so I cant loop through results. What would be the best way to go around this?
Should i convert my object, or single result to an arrary? Put/Push it inside an array? or should I do a typeof and check if the element is an array, then do the looping?
Thanks for the help.
//this is what is return when there are more than one results
var results = {
pages: [
{"pageNumber":204},
{"pageNumber":1024},
{"pageNumber":3012}
]
}
//this is what is returned when there is only one result
var results = {
pages: {"pageNumber": 105}
}
My code loops through results, just using a for loop, but it will create errors, since sometimes results is not an array. So again, do I check if its an array? Push results into a new array? What would be better. Thanks
If you have no control over the server side, you could do a simple check to make sure it's an array:
if (!(results.pages instanceof Array)) {
results.pages = [results.pages];
}
// Do your loop here.
Otherwise, this should ideally happen on the server; it should be part of the contract that the results can always be accessed in a similar fashion.
Arrange whatever you do to your objects inside the loop into a separate procedure and if you discover that the object is not an array, apply the procedure to it directly, otherwise, apply it multiple times to each element of that object:
function processPage(page) { /* do something to your page */ }
if (pages instanceof Array) pages.forEach(processPage);
else processPage(pages);
Obvious benefits of this approach as compared to the one, where you create a redundant array is that, well, you don't create a redundant array and you don't modify the data that you received. While at this stage it may not be important that the data is intact, in general it might cause you more troubles, when running integration and regression tests.