CouchDB MapReduce query for relational data

CouchDB MapReduce query for relational data - javascript

Before I start, please don't tell me to use SQL. This is a small problem in a bigger context and I can't and don't want to use a relational database here. I know the problem is pretty easy to solve in SQL.
There are three types of documents:
Players
Teams
Sponsors
Teams belong to one sponsor but have many players. A player can be in multiple teams and a sponsor can have multiple teams.
Players 1 --- N Teams N --- 1 Sponsors
I put the player document ids into the Teams document in an array:
players: ["payer1","player2",...]
Now I want all (distinct, only named once) sponsors for a specific player:
Player1: Sponsor1, Sponsor2, Sponsor3,...
It's a bit like the n-n example in the CouchDB Wiki, but since there are multiple keys in the team-players, that doesn't really work.
I created a gist with example data..
Any idea how to write the MapReduce function to get to this result?
The closest I got, but shows sponsors multiple times, with group level 1:
function(doc) {
if(doc.type == "team")
emit(doc.players,doc.sponsor);
}
function(keys, values) {
return (values);
}

You are close with your view, here it is modified slightly:
"playerSponsers": {
"map": "function(doc) {
if(doc.type == "team" && doc.players) {
for (i in doc.players) {
emit([doc.players[i], doc.sponsor],1);
}
}
}",
"reduce": "_sum"
}
And here is the query:
http://localhost:5984/test/_design/sports/_view/playerSponsors?group=true&group_level=2
You will get results like this:
{"rows":[
{"key":["Player1","Sponsor1"],"value":1},
{"key":["Player1","Sponsor2"],"value":1},
{"key":["Player1","Sponsor3"],"value":1},
{"key":["Player2","Sponsor1"],"value":1},
{"key":["Player2","Sponsor2"],"value":2},
{"key":["Player2","Sponsor3"],"value":1}
]}
If you want to get the sponsors for just one player, you can query like this
http://localhost:81/test/_design/sports/_view/playerSponsors?group=true&group_level=2&startkey=[%22Player1%22]&endkey=[%22Player1%22,{}]

Related

Firebase Query for Products in Categories (database setup and best practice)

TL;DR
products have multiple categories.
View should show only all subcategories and products that have the category assigned.
How to setup the DB and queries?
I'm learning Vue and Firebase at the moment coming from a C# and SQL background, I need some help and advice on the noSQL side of things.
EDIT: categories and products are two separate collections (at the moment).
I have products, which can have multiple categories (see product 5)
products: {
prod1-id: {
name:'apple-type-A',
price: 2
cats: cat1-id
}
prod2-id: {
name:'apple-type-B',
price: 2.5
cats: cat1-id
}
prod3-id: {
name:'banana-type-A',
price: 1.6
cats: cat2-id
}
prod4-id: {
name:'banana-type-B',
price: 1.9
cats: cat2-id
}
prod5-id: {
name:'smoothie',
price: 5,
cats: [cat2-id, subCat1-id, subCat2-id]
}
}
Those categories are a tree.
categories: {
cat1-id: {
name: 'fruits',
subCat1-id: {
name: 'apple'
}
subCat2-id: {
name: 'banana'
}
},
cat2-id: {
name: 'MySmooth'
}
}
The customer should see the first tree.
The landing page should only show the first depth of the category tree and every product without a category (maybe add a category called 'no category').
When you click on a category is should show all the subcategories and products, that have this category.
This goes on until the deepest branch.
I tried to sketch my idea:
VIEW
For programming I use Vue, with vuex and vuexfire and as framework Vuetify.
I have the complete product management setup but I don't know how to query for this view.
My idea was to reuse a <v-card v-for="p of products"> I already have and is working fine.
But this shows only products, not categories. How do I get the categories into the mix?
QUERY
Working with vuexfire this is quite simple.
bindSoldProducts: firestoreAction(({ bindFirestoreRef }) => bindFirestoreRef('products', productsColl.where('isSold', '==', true))
but how can I get the categories and products now show side by side?
I've started to get the products together by querying, but have no idea how to put the categories into the mix:
firebase.firestore.collection('products').orderBy('name', 'asc').onSnapshot(snap => {
const productsArray: any = []
snap.forEach(doc => {
const product = doc.data()
product.id = doc.id
productsArray.push(product)
})
store.commit('setAllProducts', productsArray)
})
Do I need to structure my database differently?
Do I just query the products and use "some js logic/magic" to show the categories? But how would I then get the views.
Are the collection group queries of firebase the right way to go? If yes, how?
Please advice

Try a for in all Categories and get the fruits:
const categories = {
"cat1-id": []
}
Object.keys(categories).forEach(cat => {
const res = db.collection('products').where('cats', 'array-contains', cat).get()
// Resolve and put (push) in categories[cat]
})
See: https://firebase.google.com/docs/firestore/query-data/queries

I'm writing my response an answer as it was too long for a comment.
If I understood your use case correctly, you want to have a DB structure where you have a list of products and their respective categories which both will be displayed simultaneously side by side on your Vue application view.
While I'm not a Vue expert, I can suggest using the following structure for your Cloud Firestore database: Products/{product}/categories/{category}, which translates to collection/{document}/collection/{document}.
In the Cloud Firestore data model you can have a collection within another collection, and this is called a subcollection. For example, you could have Products/fruits/Categories/banana and inside the banana document you could have type, price, or id fields among many others.
Then, you could use Collection group queries to retrieve documents from a collection group instead of from a single collection. A collection group consists of all collections with the same ID. By default, queries retrieve results from a single collection in your database.
For example, you could create a categories collection group by adding a Categories subcollection to each product and then retrieve results from every product's categories subcollection at once, let's say bananas:
const querySnapshot = await db.collectionGroup('categories').where('productName', '==', 'banana').get();
querySnapshot.forEach((doc) => {
console.log(doc.id, ' => ', doc.data());
// Banana from Mexico, Banana from Honduras, etc.
});
Before using a collection group query, do not forget to create an index that supports your collection group query. You can create an index through an error message, the console, or the Firebase CLI.

How to do a simple join in GraphQL?

I am very new in GraphQL and trying to do a simple join query. My sample tables look like below:
{
phones: [
{
id: 1,
brand: 'b1',
model: 'Galaxy S9 Plus',
price: 1000,
},
{
id: 2,
brand: 'b2',
model: 'OnePlus 6',
price: 900,
},
],
brands: [
{
id: 'b1',
name: 'Samsung'
},
{
id: 'b2',
name: 'OnePlus'
}
]
}
I would like to have a query to return a phone object with its brand name in it instead of the brand code.
E.g. If queried for the phone with id = 2, it should return:
{id: 2, brand: 'OnePlus', model: 'OnePlus 6', price: 900}

TL;DR
Yes, GraphQL does support a sort of pseudo-join. You can see the books and authors example below running in my demo project.
Example
Consider a simple database design for storing info about books:
create table Book ( id string, name string, pageCount string, authorId string );
create table Author ( id string, firstName string, lastName string );
Because we know that Author can write many Books that database model puts them in separate tables. Here is the GraphQL schema:
type Query {
bookById(id: ID): Book
}
type Book {
id: ID
title: String
pageCount: Int
author: Author
}
type Author {
id: ID
firstName: String
lastName: String
}
Notice there is no authorId on the Book type but a type Author. The database authorId column on the book table is not exposed to the outside world. It is an internal detail.
We can pull back a book and it's author using this GraphQL query:
{
bookById(id:"book-1"){
id
title
pageCount
author {
firstName
lastName
}
}
}
Here is a screenshot of it in action using my demo project:
The result nests the Author details:
{
"data": {
"book1": {
"id": "book-1",
"title": "Harry Potter and the Philosopher's Stone",
"pageCount": 223,
"author": {
"firstName": "Joanne",
"lastName": "Rowling"
}
}
}
}
The single GQL query resulted in two separate fetch-by-id calls into the database. When a single logical query turns into multiple physical queries we can quickly run into the infamous N+1 problem.
The N+1 Problem
In our case above a book can only have one author. If we only query one book by ID we only get a "read amplification" against our database of 2x. Imaging if you can query books with a title that starts with a prefix:
type Query {
booksByTitleStartsWith(titlePrefix: String): [Book]
}
Then we call it asking it to fetch the books with a title starting with "Harry":
{
booksByTitleStartsWith(titlePrefix:"Harry"){
id
title
pageCount
author {
firstName
lastName
}
}
}
In this GQL query we will fetch the books by a database query of title like 'Harry%' to get many books including the authorId of each book. It will then make an individual fetch by ID for every author of every book. This is a total of N+1 queries where the 1 query pulls back N records and we then make N separate fetches to build up the full picture.
The easy fix for that example is to not expose a field author on Book and force the person using your API to fetch all the authors in a separate query authorsByIds so we give them two queries:
type Query {
booksByTitleStartsWith(titlePrefix: String): [Book] /* <- single database call */
authorsByIds(authorIds: [ID]) [Author] /* <- single database call */
}
type Book {
id: ID
title: String
pageCount: Int
}
type Author {
id: ID
firstName: String
lastName: String
}
The key thing to note about that last example is that there is no way in that model to walk from one entity type to another. If the person using your API wants to load the books authors the same time they simple call both queries in single post:
query {
booksByTitleStartsWith(titlePrefix: "Harry") {
id
title
}
authorsByIds(authorIds: ["author-1","author-2","author-3") {
id
firstName
lastName
}
}
Here the person writing the query (perhaps using JavaScript in a web browser) sends a single GraphQL post to the server asking for both booksByTitleStartsWith and authorsByIds to be passed back at once. The server can now make two efficient database calls.
This approach shows that there is "no magic bullet" for how to map the "logical model" to the "physical model" when it comes to performance. This is known as the Object–relational impedance mismatch problem. More on that below.
Is Fetch-By-ID So Bad?
Note that the default behaviour of GraphQL is still very helpful. You can map GraphQL onto anything. You can map it onto internal REST APIs. You can map some types into a relational database and other types into a NoSQL database. These can be in the same schema and the same GraphQL end-point. There is no reason why you cannot have Author stored in Postgres and Book stored in MongoDB. This is because GraphQL doesn't by default "join in the datastore" it will fetch each type independently and build the response in memory to send back to the client. It may be the case that you can use a model that only joins to a small dataset that gets very good cache hits. You can then add caching into your system and not have a problem and benefit from all the advantages of GraphQL.
What About ORM?
There is a project called Join Monster which does look at your database schema, looks at the runtime GraphQL query, and tries to generate efficient database joins on-the-fly. That is a form of Object Relational Mapping which sometimes gets a lot of "OrmHate". This is mainly due to Object–relational impedance mismatch problem.
In my experience, any ORM works if you write the database model to exactly support your object API. In my experience, any ORM tends to fail when you have an existing database model that you try to map with an ORM framework.
IMHO, if the data model is optimised without thinking about ORM or queries then avoid ORM. For example, if the data model is optimised to conserve space in classical third normal form. My recommendation there is to avoid querying the main data model and use the CQRS pattern. See below for an example.
What Is Practical?
If you do want to use pseudo-joins in GraphQL but you hit an N+1 problem you can write code to map specific "field fetches" onto hand-written database queries. Carefully performance test using realist data whenever any fields return an array.
Even when you can put in hand written queries you may hit scenarios where those joins don't run fast enough. In which case consider the CQRS pattern and denormalise some of the data model to allow for fast lookups.
Update: GraphQL Java "Look-Ahead"
In our case we use graphql-java and use pure configuration files to map DataFetchers to database queries. There is a some generic logic that looks at the graph query being run and calls parameterized sql queries that are in a custom configuration file. We saw this article Building efficient data fetchers by looking ahead which explains that you can inspect at runtime the what the person who wrote the query selected to be returned. We can use that to "look-ahead" at what other entities we would be asked to fetch to satisfy the entire query. At which point we can join the data in the database and pull it all back efficiently in the a single database call. The graphql-java engine will still make N in-memory fetches to our code. The N requests to get the author of each book are satisfied by simply lookups in a hashmap that we loaded out of the single database call that joined the author table to the books table returning N complete rows efficiently.
Our approach might sound a little like ORM yet we did not make any attempt to make it intelligent. The developer creating the API and our custom configuration files has to decide which graphql queries will be mapped to what database queries. Our generic logic just "looks-ahead" at what the runtime graphql query actually selects in total to understand all the database columns that it needs to load out of each row returned by the SQL to build the hashmap. Our approach can only handle parent-child-grandchild style trees of data. Yet this is a very common use case for us. The developer making the API still needs to keep a careful eye on performance. They need to adapt both the API and the custom mapping files to avoid poor performance.

GraphQL as a query language on the front-end does not support 'joins' in the classic SQL sense.
Rather, it allows you to pick and choose which fields in a particular model you want to fetch for your component.
To query all phones in your dataset, your query would look like this:
query myComponentQuery {
phone {
id
brand
model
price
}
}
The GraphQL server that your front-end is querying would then have individual field resolvers - telling GraphQL where to fetch id, brand, model etc.
The server-side resolver would look something like this:
Phone: {
id(root, args, context) {
pg.query('Select * from Phones where name = ?', ['blah']).then(d => {/*doStuff*/})
//OR
fetch(context.upstream_url + '/thing/' + args.id).then(d => {/*doStuff*/})
return {/*the result of either of those calls here*/}
},
price(root, args, context) {
return 9001
},
},

Join in Meteor with publishComposite

I have removed autopublish from my Meteor app. Now I'm publishing my collections manually. I have some related collections. I want to increase performance as much as possible.
If I'm, for instance, looking at a post and want to see all the comments related to this post, I have to query the database with both post: Posts.findOne(postId) AND comments: Comments.find({postId: postId}). I am querying the two collections in the data field with iron-router so they are present in my template but I'm also subscribing the publications in waitOn. Now I have found https://github.com/englue/meteor-publish-composite which lets me publish multiple collections at the same time. But I don't quite understand it. If I'm using Meteor.publishComposite('postAndComments', ...) in server/publish.js, subscribing postAndComments in waitOn, and setting both post and comments in data as I normally do, will I then have saved a demand on the database? To me it looks like I still query the database the same number of times. But is the queries done when publishing the only queries made while the queries done i data is only a way to retrieve what has already been queried from the database?
Besides, in example 1, it is shown how to publish top posts with the belonging comments and post/comment authors, but in the template, only the posts are outputted. How can I also output the comments and authors? Have I misunderstood the potentials of publishComments? I understand it as a kind of a join.

I used publishComposite successfully. In my example below I'm subscribing to Units matching to filter as well as Properties those units belong to.
Meteor.publishComposite('unitsWithProperties', function (filter) {
var filter = filter || {};
console.log(filter);
return {
find: function () {
var units;
units = Units.find(filter);
return units;
},
children: [
{
collectionName: 'properties',
find: function (unit) {
return Properties.find({ _id: unit.propertyId });
}
}
]
};
});
In your case I believe you can:
subscribe to Posts matching TopPosts criteria
subscribe to Comments and Authors of those posts in children: array of cursors
.
Hope this helps.
Alex

mongodb insert embedded documents from other collections massive collections using map reduce

these files that i will be getting have at least a million rows each, max 1.5 billion. The data is normalized when i get it. I need a way to store it in one document. For the most part i am not 100% how the data will be given to me. it could be csv, Fixed Width Text File or tsv or something else.
currently i have some collections that i imported from some sample csv's.
bellow is a small representation of my data missing fields
in my beneficaries.csv the data is repeated
beneficaries.csv over 6 million records
record # 1
{"userid":"a9dk4kJkj",
"gender":"male",
"dob":20080514,
"start_date":20000101,
"end_date":20080227}
record # 2
{"userid":"a9dk4kJkj",
"gender":"male",
"dob":20080514,
"start_date":20080201,
"end_date":00000000}
same user different start and end dates
claims.csv over 200 million records
{"userid":"a9dk4kJkj",
"date":20080514,
"code":"d4rd3",
"blah":"data"}
lab.csv over 10 million records
{"userid":"a9dk4kJkj",
"date":20080514,
"lab":"mri",
"blah":"data"}
From my limited knowledge i have three option
sort the files, read x amount in our c++ Member objects from the data files, stop at y, insert the members into mongodb, move on to starting at y for x members until we are done. This is Tested and Working but sorting such massive files will kill our machine for hours.
load data to sql, read one by one into c++ Member objects, bulk load the data in mongo. Tested and works but, but i would like to avoid this very much.
load the documents in mongo in seperate collections and perform a map-reduction with out parameter to wrtie to collection. I have the documents loaded (as shown above) in there own collections for each file. Unfortunately i am new to mongo and on a deadline. The map-reduction concept is difficult for me to wrap my head around and implement. I have read the docs and tried using this answer on stack overflow MongoDB: Combine data from multiple collections into one..how?
The output member collection should look like this.
{"userid":"aaa4444",
"gender":"female",
"dob":19901225,
"beneficiaries":[{"start_date":20000101,
"end_date":20080227},
{"start_date":20008101,
"end_date":00000000}],
"claims":[{"date":20080514,
"code":"d4rd3",
"blah":"data"},
{"date":20080514,
"code":"d4rd3",
"blah":"data"}],
"labs":[{"date":20080514,
"lab":"mri",
"blah":"data"}]}
Would the performance of load data to sql, read in c++ and insert into mongodb beat the map reduction? if so i will stick with that method

IMHO, your data are good candidates for map-reduce, hence would be better to go for option 3: load the documents in mongo in 3 seperate collections: beneficiaries, claims, labs and perform map-reduce on the userid key on each collections. Finally, integrate the data from 3 collections into single collections using find and insert on the userid key.
Assume you load beneficiaries.csv into collection beneficiaries, this is the sample code for map-reduce on beneficiaries:
mapBeneficiaries = function() {
var values = {
start_date: this.start_date,
end_date: this.end_date,
userid: this.userid,
gender: this.gender,
dob: this.dob
};
emit(this.userid, values);
};
reduce = function(k, values) {
list = { beneficiaries: [], gender : '', dob: ''};
for(var i in values) {
list.beneficiaries.push({start_date: values[i].start_date, end_date: values[i].end_date});
list.gender = values[i].gender;
list.dob = values[i].dob;
}
return list;
};
db.beneficiaries.mapReduce(mapBeneficiaries, reduce, {"out": {"reduce": "mr_beneficiaries"}});
The output in mr_beneficiaries will be like this:
{
"_id" : "a9dk4kJkj",
"value" : {
"beneficiaries" : [
{
"start_date" : 20080201,
"end_date" : 0
},
{
"start_date" : 20080201,
"end_date" : 0
}
],
"gender" : "male",
"dob" : 20080514
}
}
Do the same thing to obtain mp_claims and mp_labs. Then integrate into singledocuments:
db.mr_beneficiaries.find().forEach(function(doc) {
var id = doc._id;
var claims = db.mr_claims.findOne({"_id":id});
var labs = db.mr_lab.findOne({"_id":id});
db.singledocuments.insert({"userid":id,
"gender":doc.value.gender,
"dob":doc.value.dob,
"beneficiaries":doc.value.beneficiaries,
"claims":claims.value.claims,
"labs":labs.value.labs});
});

Modelling blogs and ratings in mongodb and nodejs

I have a blogs collection that contains title, body and agrregate rating that the users have given to them. Another collection 'Ratings' whose schema has reference to the blog, user who rated(if at all he rates them) it in the form of their ObjectIds and the rating they have given ie., +1 or -1.
When a particular user browses through blogs in the 'latest first' order (say 40 of them per page. Call them an array of blogs[0] to blogs[39]) I have to retrieve the rating documents related to this particular user and those 40 blogs if at all the user rated them and notify him of what ratings he has given those blogs.
I tried to extract all rating documents of a particular user in which blog reference objectIds lie between blogs[0]._id and blogs[39]._id which returns empty list in my case. May be objectIds cant be compared using $lt and $gt queries. In that case how should I go about it? Should I redesign my schemas to fit to this scenario?
I am using mongoosejs driver for this case. Here are the relevant parts of the code which differ a bit in execution but youu get the idea.
Schemas:
Client= new mongoose.Schema({
ip:String
})
Rates = new mongoose.Schema({
client:ObjectId,
newsid:ObjectId,
rate:Number
})
News = new mongoose.Schema({
title: String,
body: String,
likes:{type:Number,default:0},
dislikes:{type:Number,default:0},
created:Date,
// tag:String,
client:ObjectId,
tag:String,
ff:{type:Number,default:20}
});
models:
var newsm=mongoose.model('News', News);
var clientm=mongoose.model('Client', Client);
var ratesm=mongoose.model('Rates', Rates);
Logic:
newsm.find({tag:tag[req.params.tag_id]},[],{ sort:{created:-1},limit: buffer+1 },function(err,news){
ratesm.find({client:client._id,newsid:{$lte:news[0]._id,$gte:news.slice(-1)[0]._id}},function(err,ratings){
})
})
Edit:
While implementing the below said schema, I had to do this query in mongoose.js
> db.blogposts.findOne()
{ title : "My First Post", author: "Jane",
comments : [{ by: "Abe", text: "First" },
{ by : "Ada", text : "Good post" } ]
}
> db.blogposts.find( { "comments.by" : "Ada" } )
How do I do this query in mongoose?

A good practice with MongoDB (and other non-relational data stores) is to model your data so it is easy to use/query in your application. In your case, you might consider denormalizing the structure a bit and store the rating right in the blog collection, so a blog might look something like this:
{
title: "My New Post",
body: "Here's my new post. It is great. ...",
likes: 20,
dislikes: 5,
...
rates: [
{ client_id: (id of client), rate: 5 },
{ client_id: (id of another client), rate: 3 },
{ client_id: (id of a third client), rate: 10 }
]
}
The idea being that the objects in the rates array contains all the data you'll need to display the blog entry, complete with ratings, right in the single document. If you also need to query the rates in another way (e.g. find all the ratings made by user X), and the site is read-heavy, you may consider also storing the data in a Rates collection as you're doing now. Sure, the data is in two places, and it's harder to update, but it may be an overall win after you analyze your app and how it accesses your data.
Note that you can apply indexes deep into a document's structure, so for example you can index News.rates.client_id, and then you can quickly find any documents in the News collection that a particular user has rated.

Develop Reference

JavaScript is the programming language of the Web.