I have a question about a theoretical situation and how the Firestore JS SDK handles it.
The setup is:
We have offline persistence enabled.
We're offline
We need to get() from collection A immediately after coming online.
I'm gonna exaggerate the numbers to make the situation more easily graspable.
Steps
While offline, we add 1000000 documents to collection A.
We come back online, and the assumption is that Firestore starts synching the local data from Collection A to the server, which will take a while.
We do a get() from collection A, while Firestore might not yet have finished synching.
What happens? The assumption here is that, seeing as Firestore has detected we're online again, it tries to get the documents from Collection A that are found in the online DB, and thus might miss out on some of the documents that is still being synchronized from step 2.
Can a Firebase engineer clarify what would happen in this scenario?
A local client will always see its own changes. So even while you're offline, it will see the changes you've made locally in the collection. When you're back online, it will see the changes it's made locally too, regardless of whether those have been synchronized to the server yet.
Related
Long story short, I have been developing a Discord Bot that requires a query to the database every time a message is sent in a server. It will then perform an action depending on the message etc. The query is asynchronous, therefore it will not block another message from being handled.
However in terms of scalability, I do not believe querying a database every time a message is sent is very speedy and could become a problem. Is there a better solution? I am unaware of a way to store data within a particular discord server, which would likely solve my issue.
My main idea is to have heap storage, where the most recently active servers (ie sent messages recently), their data is queried into the heap, and when they are inactive, it is removed from the heap. Is this a good solution? Or is it better to just keep querying every time?
You could create a cache and every time you fetch or insert something into your database you can write this into the cache.
Then, if you need some data you can check if it's in the cache and if not, get it from the database and store it in the cache right after.
This prevents unnecessary access to the database because the database is only accessed if your bot does not have the required data stored locally.
Note:
The cache will only be cleared when you restart the bot. But of course, you can also clear it after a certain amount of time or by other triggers.
If you need an example, you can take a look at my guildMemberAdd event and the corresponding config command
The app
I have a web app that currently uses AppCache for offline functionality since users of the system need to create documents offline. The document is first created offline and when internet access is available, the user can click "sync" which will send the document to the server and save it as a revision. To be more specific, the app does not save the change delta as a revision (the exact field modified) but rather the whole document in its entirety. So in other words, a "snapshot" document is saved.
The problem
Users can login from different browsers and devices and work on their documents. When they click "sync", if the server's document is newer, the entire client's version will be overridden by the server's. This leads to one main issue that is depicted in the image below.
The scenario above occurs because of the current implementation which does not rely on deltas (small changes) and rather relies on snapshot revisions.
Some questions
1) My research indicates that I should be upgrading the "sync" mechanism to be expressed in deltas (small changes that can be applied independently). Is this a sound approach?
2) Should each delta be applied independently?
2) According to my research, revision deltas have a numeric value and not a timestamp. What should the value for this be exactly? How would I ensure both the server and the client agree on what the revision number should be?
Stack information
Angular on the frontend
IndexedDB to save documents locally (offline mode)
Postgres DB with JSONB in the backend
What your describing is a version control issue like in this question. The choice is yours with how to resolve. Here are a few examples of other products with this problem:
Google docs: A makes edit offline, B makes edit online, A goes online, Sync, Google Docs combines A and B's edits
Apple notes: Same as Google Docs
Git/Subversion: Throw an error, ask user to resolve conflicts
Wunderlist: Last edit overwrites previous
For your case, this simplest solution is to use Wunderlist's approach, but it seems that may cause a usability issue. What do your users expect to happen?
Answering your questions directly:
A custom sync implementation is necessary if you don't want overwrites.
This is a usability decision, what does the user expect?
True, revisions are numeric (e.g r1, r2). To get server agreement, alter the return value of the last sync request. You can return the entire model to the client each time (or just a 200 OK if a normal sync happened). If a model is returned to the client, update the client with the latest model.
In any case, the server should always be the source of truth. This post provides some good advice on server/mobile referential integrity:
To track inserts you need a Created timestamp ... To track updates you need to track a LastUpdate timestamp on your rows ... To track deletes you need a tombstone table.
Note that when you do a sync, you need to check the time offset between the server and the mobile device, and you need to have a method for resolving conflicts. Inserts are no big deal (they shouldn't conflict), but updates could conflict, and a delete could conflict with an update.
Okay, let me start by saying that I know this is weird. I do.
But here goes:
Let's say I have an SQL database which stores my data. And let's say I don't have a choice in this, it has to be SQL. The application I'm building has somewhere in the region of 100,000 records in its database, and once every single record has been processed by the users of the application, they all go off and get sent to a different application entirely. So for a short period of time, this application will be in use, and then stops being used until the same time next year. While the application is in use, no external sources will be touching the database at all.
When the (Node) server starts up, it loads everything from the database, into an object literal on the server.
The client-side of this application, on a very basic level, makes requests (to an API on the server) for data, and sends updated versions of records back to the server once they've been processed.
So here's where it gets weird: Let's say I don't want to have the client-side application have to directly retrieve records from the database, nor do I want it to be able to write to them. So the data from the entire database already exists in memory on the server. There's a module on the server that can handle changing the representation of that data already (again, because the client application only interacts with APIs on the server, the database module exists to facilitate this).
Multiple users access the system at once, but due to the way the system works, it is not possible for two users to be sent the same record, so two users will never be sending an update back for the same record (records are processed individually, and sequentially).
So, let's say that I decided that, since I was already managing all of this data in memory on the server, I would just send an updated version of the current data, in its entirety, back to the database, every time it changed.
The question is, where does this rank on the crazy scale?
Performance, writing an entire database rather than single records, would obviously suffer. But, in a database that is only read from once (on start-up of the application), is that even a concern? If every operation other than "Write all the stuff when any of the stuff changes" happened in memory on the server, does it matter how long those updates actually take? If a new update to the database comes in whilst it's being updated, surely SQL will take care of this?
It feels like the correct way to do this of course, is to have each user directly getting their info from the database, and directly making updates to the database too (or at least interacting with API endpoints to make this happen), but, is just...not doing that, utter lunacy?
Like I said, I know it's weird, but other than the fact that "it feels kind of wrong", I'm not sure I'm convinced that it is in fact entirely wrong. So I figured that this place would have an opinion.
The way that I think it currently works is:
[SQL DB] is updated whenever a change happens on {in-memory DB}
{in-memory DB} is updated in various ways based on API calls to the server
makes requests for data, and sends updates to data, both of which are processed on the in-memory DB
Multiple requests can happen at the same time from the application, but mutliple users can not see the same record, because records are allocated to a given user before they're sent
Multiple updates can come from multiple users, each of which ultimately ends in the entire SQL database being saved to with the contents of the in-memory DB.
(Note: I'm not saying "is this the best way to do this". I'm just asking, is there a significant argument for caring about the performance of a database being written to, if it's not going to be read from again unless the server needs to be restarted)
What I think that I would do, in this situation, is to add an attribute to each cached record to indicate that the record is "dirty." In other words, that something has been done to it, by someone, since it was originally read from the database.
(You could also add an attribute that indicates that someone "has this particular record 'checked-out,'" so that you can be sure that two users are no updating the same record at the same time.)
At some convenient moment, you can then walk through the collection, posting the "dirty" records back to the database. Use an SQL Transaction, not only for efficiency but also to be sure that the final update to the database is atomic.
You will need to be very mindful of the possibility of race-conditions. One possible strategy is to use a Unix timestamp as a "dirty" indicator. A record is selected for posting to the database only if its "dirty-time" is greater-than or equal-to the timestamp when the commit-process was last run.
(And, P.S.: "no, I've seen even 'weirder' things than this, in all my crazy years in this crazy business ...)
I work on a web app which store projects data. Data are saved in a couchDb database A. The app pull and push data with a local pouchDb database B, which is sync with A.
So the app can also work offline. When user has connection back, changes made on localDb B during offline time are sent to A using a classic replication.
I store 1 document per project in couchDb, it is a big JSON object with lot of data (project todos, collaborators, advancements, risks, problems, etc...).
It is working like a charm, but I have some problems, and it seems I use pouchDb in wrong way. Situation example:
User A is offline and he adds a todo on project 1.
User B is online and he adds a new collaborator on project 1.
User B changes are pushed to couchDb by the automatic sync.
The project 1 _rev has been incremented.
User B pulls its own changes from couchDb, because the app downloads all documents on any couchDb changes detected. Weird... Idk how to prevent that. But the app still work fine so it's not a big problem.
User A gets its connection back.
User A changes are ignored because of older _rev. But the user did a modification on a different project property, can couchDb detect that himself and merge with newer _rev ?
I clearly see my problem is I'm using 1 document per project. I could use thousands documents to store each properties of each project and my problem woudn't happens, but it seems quite weird: To retrieve all data of a project I would fully scan my database, check document type (collaborator, todos, ...?), and check if the document is linked to the project by adding a new _projectId property to any document.
Currently I just have to request one document, which contains all project data, then I manipulate my JSON easily. It's quite convenient to handle.
How to manage this ? A project may contains averagely 10 to 10 000 properties that multiple users can edit being online or offline.
But the user did a modification on a different project property, can couchDb detect that himself and merge with newer _rev ?
PouchDB/CouchDB conflict handling is described in the PouchDB guide: http://pouchdb.com/guides/conflicts.html
the app downloads all documents on any couchDb changes detected. Weird... Idk how to prevent that.
This is standard PouchDB/CouchDB behavior - you asked it to sync the whole database, so it synced the whole database. :) You can prevent it by using filtered-replication: http://pouchdb.com/api.html#filtered-replication.
How to manage this ? A project may contains averagely 10 to 10 000 properties that multiple users can edit being online or offline.
It really really depends on your data, how frequently it may change, what the unique identifier of a single "property" is... Storing 10,000 separate documents in PouchDB/CouchDB is not a crazy idea, though, and may help you out when it comes to conflicts, since only those individual documents can ever be in conflict.
In general, I'd recommend you read the guide to conflict resolution as described above and review your options. There's also a plugin that may help you with conflict resolution: https://github.com/jo/pouch-resolve-conflicts
My question is a follow-up to this topic. I love the simplicity and performance of Firebase from what I have seen so far.
As I understand, firebase.js syncs data snapshots from the server into an object in Javascript memory. However there is currently no functionality to cache this data to disk.
As a result:
Applications are required to have a connection when they start-up, thus there is no true offline access.
Bandwidth is wasted every time an app starts up by re-transmitting all previous data.
Since the snapshot data is sitting in memory as a Javascript object, it should be quite trivial to serialize it as JSON and save it to localStorage, so the exact application state can be loaded next time the app is started, online or not. But as the firebase.js code is minified and cryptic I have no idea where to look.
PouchDB handles this very well on a CouchDB backend. (But it lacks the quick response time and simplicity of Firebase.)
So my questions are:
1. What data would I need to serialize to save a snapshot to localStorage? How can I then load this back into Firebase when the app starts?
2. Where can I download the original non-minified dev source code for firebase.js?
(By the way, two features that would help Firebase blow the competition out of the water: offline caching and map reduce.)
Offline caching and map reduce-like functionality are both in development. The firebase.js source is available here for dev and debugging.
You can serialize a snapshot locally using exportVal to preserve all priority data. If you aren't using priorities, a simple value will do:
var fb = new Firebase(URL);
fb.once('value', function(snapshot) {
console.log('values with priorities', snapshot.exportVal());
console.log('values without priorities', snapshot.val());
});
Later, if Firebase is offline (use .info/connected to help determine this) when your app is loaded, you can call .set() to put that data back into the local Firebase. When/if Firebase comes online, it will be synced.
However, this is truly only suitable for static data that only one person will access and change. Consider, for example, the fallout if I download the data, keep it locally for a week, and it's modified by several other users during that time, then I load my app offline, make one minor change, and then come online. My stale changes would blow away all the work done in between.
There are lots of ways to deal with this--conflict resolution, using security rules and update counters/timestamps to detect stale data and prevent regressions--but this isn't a simple affair and needs deep consideration before you head down this route.