Join in Meteor with publishComposite - javascript

I have removed autopublish from my Meteor app. Now I'm publishing my collections manually. I have some related collections. I want to increase performance as much as possible.
If I'm, for instance, looking at a post and want to see all the comments related to this post, I have to query the database with both post: Posts.findOne(postId) AND comments: Comments.find({postId: postId}). I am querying the two collections in the data field with iron-router so they are present in my template but I'm also subscribing the publications in waitOn. Now I have found https://github.com/englue/meteor-publish-composite which lets me publish multiple collections at the same time. But I don't quite understand it. If I'm using Meteor.publishComposite('postAndComments', ...) in server/publish.js, subscribing postAndComments in waitOn, and setting both post and comments in data as I normally do, will I then have saved a demand on the database? To me it looks like I still query the database the same number of times. But is the queries done when publishing the only queries made while the queries done i data is only a way to retrieve what has already been queried from the database?
Besides, in example 1, it is shown how to publish top posts with the belonging comments and post/comment authors, but in the template, only the posts are outputted. How can I also output the comments and authors? Have I misunderstood the potentials of publishComments? I understand it as a kind of a join.

I used publishComposite successfully. In my example below I'm subscribing to Units matching to filter as well as Properties those units belong to.
Meteor.publishComposite('unitsWithProperties', function (filter) {
var filter = filter || {};
console.log(filter);
return {
find: function () {
var units;
units = Units.find(filter);
return units;
},
children: [
{
collectionName: 'properties',
find: function (unit) {
return Properties.find({ _id: unit.propertyId });
}
}
]
};
});
In your case I believe you can:
subscribe to Posts matching TopPosts criteria
subscribe to Comments and Authors of those posts in children: array of cursors
.
Hope this helps.
Alex

Related

Pass query criteria to mongoDB aggregation

our current setup is: SPA frontend, Azure functions with mongoose middleware, MongoDB
(Maybe first read the question***)
Since we have a lot of documents in our DB and our customer wants to query them we are facing the following problem:
The user is assigned to his organization. He wants to search for Doc1s he has not responded to.
Doc1
{
_id
organization -> partitionKey
content
}
By creating doc2 with reference to doc1 he can respond.
Doc2
{
_id
organization -> partitionKey
Doc1ref
content
}
We have a 1:n relationship.
At the moment we filter just by query criteria of doc1 with limit and skip options.
But the new requirement is to filter the same way by referring doc2s.
I was thinking of:
Doing it in my code => Problem: after we have read with limit=100 and I filter it by my code, the result is not 100 anymore.
Extending doc1 by doc2 arrays => Must be the last option
Dynamic aggregation, Prepared in the code and executed at runtime => Don't want to user dynamic aggregations and the benefits of mongoose are almost lost.
Create a MongoDB view with lookup aggregation (populating doc1 by doc1.respondedOrganizations) => Problem is see here is the performance. When searching a lot of documents and then joining them by a non partitionKey.
*** So, I come to my question:
Is it possible to pass a virtual (not existing) query criteria...
doc1.find({ alreadyResponded : my.organization } )
...and use it as input variable in an aggregation
{
$lookup: {
from: Doc2s,
localField: _id,
foreignField: Doc1ref,
as: < output array field >
pipeline: [{
$match: {
$organization: {
$eq: $$alreadyResponded
}]
}
}
It would reduce query performance extremly.
Thanks

Firebase Realtime Database - Best practice for additional data nodes (lookups and initial load)

I have a Firebase Realtime Database with the below structure.
I wish to fetch all "notes" a user has access to and, initially, only show titles for the notes.
notes: {
"noteId-1345" : {
"access" : {
"author": "1234567890"
"members": {
"1234567890": 0 <--- Author
"0987654321": 1 <--- Member
}
},
"data" : {
"title": "Hello",
"content": "Konichiwa!",
"comment": "123"
}
}
}
To be able to fetch all notes a user has access to I have expanded my data model by keeping an additional user_notes node in the root:
Whenever I associate a user (update of members) with a note, I write that relation both in /notes/$noteid and in /user_notes/$uid.
user_notes: {
"$uid": {
"noteId-1345": true
}
}
When fetching initial data I only need the "notes" the user has access to - including titles.
(I only fetch the entire "note" if a user wants to view a complete "note")
I begin by fetching the ids for notes the user has access to and then I have to do another lookup to fetch the titles.
let titles = []
database.ref(`user_notes/${uid}`)
.on('value', (snaps) => {
snaps.forEach((snap) => {
const noteId = snap.key
database.ref(`notes/${noteId}/data/title`).on('value', (noteSnap) => {
const title = noteSnap.val()
titles.push(title)
}
})
})
Is this the most efficient approach? - It seems inefficient to do double lookups.
Should I store title, and other data needed for initial load, in the user_notes node as well to avoid double lookups?
What is considered to be best practice in cases like this when using a NoSQL database?
Kind regards /K
What you're doing is indeed the common approach. It is not nearly as slow as you may initially think, since Firebase pipelines the requests over a single connection.
A few things to consider:
I'd typically move the members for each note under a top-level node note_members. Separating the types of data typically makes it much easier to keep your security rules reasonable, and is documented under keep your data structure flat.
If you'd like to get rid of the lookup, you can consider storing the title of each note under each user_notes node where you have the ID. You'd essentially replace the true with the name:
user_notes: {
"$uid": {
"noteId-1345": "Hello"
}
}
This simplifies the lookup code (its main advantage) and makes it a bit faster.
This sort of data duplication is quite common when using Firebase and other NoSQL databases, you trade write complexity and extra data storage, for reach simplicity and scalability.

How to do a simple join in GraphQL?

I am very new in GraphQL and trying to do a simple join query. My sample tables look like below:
{
phones: [
{
id: 1,
brand: 'b1',
model: 'Galaxy S9 Plus',
price: 1000,
},
{
id: 2,
brand: 'b2',
model: 'OnePlus 6',
price: 900,
},
],
brands: [
{
id: 'b1',
name: 'Samsung'
},
{
id: 'b2',
name: 'OnePlus'
}
]
}
I would like to have a query to return a phone object with its brand name in it instead of the brand code.
E.g. If queried for the phone with id = 2, it should return:
{id: 2, brand: 'OnePlus', model: 'OnePlus 6', price: 900}
TL;DR
Yes, GraphQL does support a sort of pseudo-join. You can see the books and authors example below running in my demo project.
Example
Consider a simple database design for storing info about books:
create table Book ( id string, name string, pageCount string, authorId string );
create table Author ( id string, firstName string, lastName string );
Because we know that Author can write many Books that database model puts them in separate tables. Here is the GraphQL schema:
type Query {
bookById(id: ID): Book
}
type Book {
id: ID
title: String
pageCount: Int
author: Author
}
type Author {
id: ID
firstName: String
lastName: String
}
Notice there is no authorId on the Book type but a type Author. The database authorId column on the book table is not exposed to the outside world. It is an internal detail.
We can pull back a book and it's author using this GraphQL query:
{
bookById(id:"book-1"){
id
title
pageCount
author {
firstName
lastName
}
}
}
Here is a screenshot of it in action using my demo project:
The result nests the Author details:
{
"data": {
"book1": {
"id": "book-1",
"title": "Harry Potter and the Philosopher's Stone",
"pageCount": 223,
"author": {
"firstName": "Joanne",
"lastName": "Rowling"
}
}
}
}
The single GQL query resulted in two separate fetch-by-id calls into the database. When a single logical query turns into multiple physical queries we can quickly run into the infamous N+1 problem.
The N+1 Problem
In our case above a book can only have one author. If we only query one book by ID we only get a "read amplification" against our database of 2x. Imaging if you can query books with a title that starts with a prefix:
type Query {
booksByTitleStartsWith(titlePrefix: String): [Book]
}
Then we call it asking it to fetch the books with a title starting with "Harry":
{
booksByTitleStartsWith(titlePrefix:"Harry"){
id
title
pageCount
author {
firstName
lastName
}
}
}
In this GQL query we will fetch the books by a database query of title like 'Harry%' to get many books including the authorId of each book. It will then make an individual fetch by ID for every author of every book. This is a total of N+1 queries where the 1 query pulls back N records and we then make N separate fetches to build up the full picture.
The easy fix for that example is to not expose a field author on Book and force the person using your API to fetch all the authors in a separate query authorsByIds so we give them two queries:
type Query {
booksByTitleStartsWith(titlePrefix: String): [Book] /* <- single database call */
authorsByIds(authorIds: [ID]) [Author] /* <- single database call */
}
type Book {
id: ID
title: String
pageCount: Int
}
type Author {
id: ID
firstName: String
lastName: String
}
The key thing to note about that last example is that there is no way in that model to walk from one entity type to another. If the person using your API wants to load the books authors the same time they simple call both queries in single post:
query {
booksByTitleStartsWith(titlePrefix: "Harry") {
id
title
}
authorsByIds(authorIds: ["author-1","author-2","author-3") {
id
firstName
lastName
}
}
Here the person writing the query (perhaps using JavaScript in a web browser) sends a single GraphQL post to the server asking for both booksByTitleStartsWith and authorsByIds to be passed back at once. The server can now make two efficient database calls.
This approach shows that there is "no magic bullet" for how to map the "logical model" to the "physical model" when it comes to performance. This is known as the Object–relational impedance mismatch problem. More on that below.
Is Fetch-By-ID So Bad?
Note that the default behaviour of GraphQL is still very helpful. You can map GraphQL onto anything. You can map it onto internal REST APIs. You can map some types into a relational database and other types into a NoSQL database. These can be in the same schema and the same GraphQL end-point. There is no reason why you cannot have Author stored in Postgres and Book stored in MongoDB. This is because GraphQL doesn't by default "join in the datastore" it will fetch each type independently and build the response in memory to send back to the client. It may be the case that you can use a model that only joins to a small dataset that gets very good cache hits. You can then add caching into your system and not have a problem and benefit from all the advantages of GraphQL.
What About ORM?
There is a project called Join Monster which does look at your database schema, looks at the runtime GraphQL query, and tries to generate efficient database joins on-the-fly. That is a form of Object Relational Mapping which sometimes gets a lot of "OrmHate". This is mainly due to Object–relational impedance mismatch problem.
In my experience, any ORM works if you write the database model to exactly support your object API. In my experience, any ORM tends to fail when you have an existing database model that you try to map with an ORM framework.
IMHO, if the data model is optimised without thinking about ORM or queries then avoid ORM. For example, if the data model is optimised to conserve space in classical third normal form. My recommendation there is to avoid querying the main data model and use the CQRS pattern. See below for an example.
What Is Practical?
If you do want to use pseudo-joins in GraphQL but you hit an N+1 problem you can write code to map specific "field fetches" onto hand-written database queries. Carefully performance test using realist data whenever any fields return an array.
Even when you can put in hand written queries you may hit scenarios where those joins don't run fast enough. In which case consider the CQRS pattern and denormalise some of the data model to allow for fast lookups.
Update: GraphQL Java "Look-Ahead"
In our case we use graphql-java and use pure configuration files to map DataFetchers to database queries. There is a some generic logic that looks at the graph query being run and calls parameterized sql queries that are in a custom configuration file. We saw this article Building efficient data fetchers by looking ahead which explains that you can inspect at runtime the what the person who wrote the query selected to be returned. We can use that to "look-ahead" at what other entities we would be asked to fetch to satisfy the entire query. At which point we can join the data in the database and pull it all back efficiently in the a single database call. The graphql-java engine will still make N in-memory fetches to our code. The N requests to get the author of each book are satisfied by simply lookups in a hashmap that we loaded out of the single database call that joined the author table to the books table returning N complete rows efficiently.
Our approach might sound a little like ORM yet we did not make any attempt to make it intelligent. The developer creating the API and our custom configuration files has to decide which graphql queries will be mapped to what database queries. Our generic logic just "looks-ahead" at what the runtime graphql query actually selects in total to understand all the database columns that it needs to load out of each row returned by the SQL to build the hashmap. Our approach can only handle parent-child-grandchild style trees of data. Yet this is a very common use case for us. The developer making the API still needs to keep a careful eye on performance. They need to adapt both the API and the custom mapping files to avoid poor performance.
GraphQL as a query language on the front-end does not support 'joins' in the classic SQL sense.
Rather, it allows you to pick and choose which fields in a particular model you want to fetch for your component.
To query all phones in your dataset, your query would look like this:
query myComponentQuery {
phone {
id
brand
model
price
}
}
The GraphQL server that your front-end is querying would then have individual field resolvers - telling GraphQL where to fetch id, brand, model etc.
The server-side resolver would look something like this:
Phone: {
id(root, args, context) {
pg.query('Select * from Phones where name = ?', ['blah']).then(d => {/*doStuff*/})
//OR
fetch(context.upstream_url + '/thing/' + args.id).then(d => {/*doStuff*/})
return {/*the result of either of those calls here*/}
},
price(root, args, context) {
return 9001
},
},

Using Meteor publish-with-relations package where each join cannot use the _id field

I am working to solve a problem not dissimilar to the discussion present at the following blog post. This is wishing to publish two related data sets in Meteor, with a 'reactive join' on the server side.
https://www.discovermeteor.com/blog/reactive-joins-in-meteor/
Unfortunately for me, however, the related collection I wish to join to, will not be joined using the "_id" field, but using another field. Normally in mongo and meteor I would create a 'filter' block where I could specify this query. However, as far as I can tell in the PWR package, there is an implicit assumption to join on '_id'.
If you review the example given on the 'publish-with-relations' github page (see below) you can see that both posts and comments are being joined to the Meteor.users '_id' field. But what if we needed to join to the Meteor.users 'address' field ?
https://github.com/svasva/meteor-publish-with-relations
In the short term I have specified my query 'upside down' (as luckily I m able to use the _id field when doing a reverse join), but I suspect this will result in an inefficient query as the datasets grow, so would rather be able to do a join in the direction planned.
The two collections we are joining can be thought of as like a conversation topic/header record, and a conversation message collection (i.e. one entry in the collection for each message in the conversation).
The conversation topic in my solution is using the _id field to join, the conversation messages have a "conversationKey" field to join with.
The following call works, but this is querying from the messages to the conversation, instead of vice versa, which would be more natural.
Meteor.publishWithRelations({
handle: this,
collection: conversationMessages,
filter: { "conversationKey" : requestedKey },
options : {sort: {msgTime: -1}},
mappings: [{
//reverse: true,
key: 'conversationKey',
collection: conversationTopics,
filter: { startTime: { $gt : (new Date().getTime() - aLongTimeAgo ) } },
options: {
sort: { createdAt: -1 }
},
}]
});
Can you do a join without an _id?
No, not with PWR. Joining with a foreign key which is the id in another table/collection is nearly always how relational data is queried. PWR is making that assumption to reduce the complexity of an already tricky implementation.
How can this publish be improved?
You don't actually need a reactive join here because one query does not depend on the result of another. It would if each conversation topic held an array of conversation message ids. Because both collections can be queried independently, you can return an array of cursors instead:
Meteor.publish('conversations', function(requestedKey) {
check(requestedKey, String);
var aLongTimeAgo = 864000000;
var filter = {startTime: {$gt: new Date().getTime() - aLongTimeAgo}};
return [
conversationMessages.find({conversationKey: requestedKey}),
conversationTopics.find(requestedKey, {filter: filter})
];
});
Notes
Sorting in your publish function isn't useful unless you are using a limit.
Be sure to use a forked version of PWR like this one which includes Tom's memory leak fix.
Instead of conversationKey I would call it conversationTopicId to be more clear.
I think this could be now much easier solved with the reactive-publish package (I am one of authors). You can make any query now inside an autorun and then use the results of that to publish the query you want to push to the client. I would write you an example code, but I do not really understand what exactly do you need. For example, you mention you would like to limit topics, but you do not explain why would they be limited if you are providing requestedKey which is an ID of a document anyway? So only one result is available?

Backbone Sub-Collections & Resources

I'm trying to figure out a Collection/Model system that can handle retrieving
data given the context it's asked from, for example:
Available "root" resources:
/api/accounts
/api/datacenters
/api/networks
/api/servers
/api/volumes
Available "sub" resources:
/api/accounts/:id
/api/accounts/:id/datacenters
/api/accounts/:id/datacenters/:id/networks
/api/accounts/:id/datacenters/:id/networks/:id/servers
/api/accounts/:id/datacenters/:id/networks/:id/servers/:id/volumes
/api/accounts/:id/networks
/api/accounts/:id/networks/:id/servers
/api/accounts/:id/networks/:id/servers/:id/volumes
/api/accounts/:id/servers
/api/accounts/:id/servers/:id/volumes
/api/accounts/:id/volumes
Then, given the Collection/Model system, I would be able to do things like:
// get the first account
var account = AccountCollection.fetch().first()
// get only the datacenters associated to that account
account.get('datacenters')
// get only the servers associated to the first datacenter's first network
account.get('datacenters').first().get('networks').first().get('servers')
Not sure if that makes sense, so let me know if I need to clarify anything.
The biggest kicker as to why I want to be able to do this, is that if the
request being made (ie account.get('datacenters').first().get('networks'))
hasn't be made (the networks of that datacenter aren't loaded on the client)
that it is made then (or can be fetch()d perhaps?)
Any help you can give would be appreciated!
You can pass options to fetch that will be translated to querystring params.
For example:
// get the first account
var account = AccountCollection.fetch({data: {pagesize: 1, sort: "date_desc"}});
Would translate to:
/api/accounts?pagesize=1&sort=date_desc
It is not quite a fluent DSL but it is expressive and efficient since it only transmits the objects requested rather than filtering post fetch.
Edit:
You can lazy load your sub collections and use the same fetch params technique to filter down your list by query string criteria:
var Account = Backbone.Model.extend({
initialize: function() {
this.datacenters = new Datacenters;
this.datacenters.url = "/api/account/" + this.id + '/datacenters';
}
});
Then from an account instance:
account.datacenters.fetch({data: {...}});
Backbone docs on fetching nested models and collections

Categories

Resources