firestore loadBundle continues to store deleted documents

firestore loadBundle continues to store deleted documents - javascript

I created a bundle of 3 documents. Then I deleted 2 of them and created the bundle again. So the bundle now contains one document! And the application cache contains 3
Then I download the bundle again:
const resp = await fetch(downloadUrl);
await loadBundle(db, resp.body); //{totalDocuments: 1} - as expected
const query = await namedQuery(db, `my-bundle-query`);
if (query) {
const snap = await getDocsFromCache(query); //Nope! there should be one document, but there are three
}
Looks like a bug. I think loadBundle should somehow keep track of deleted documents. What should I do?
UPDATE:
namedQuery - query the entire cache. But the expected behavior is to get only the documents associated with the bundle. So this doesn't work at all - in my namedQuery I have a limit of 4 documents but got more from the cache!!
So the official example is wrong because it faces the same problem I described above

TL:DR;
Unfortunately this is a limitation of the way that bundles are currently coded. It would also be out of scope for bundles as they are designed to be used against collections with large amounts of reused data for the initial load of a page.
If you think that loadBundle should purge items from the local cache that aren't returned in the bundle, you should file a feature request to add this feature to the loadBundle call (e.g. loadBundle(rawBundle, /* forceRefresh = */ true)) or file a bug report.
The Details
For example's sake, lets assume your query is "Get the first 10 posts in collection /Posts".
Upon requesting the bundle for this query, the first bundle returns the following results:
{
"/Posts/D9p7MbcYCbTNzcXLrzfQ": { /* post data */ },
"/Posts/xz3eY1Gwsjl4tTxTjXyR": { /* post data */ },
"/Posts/fIvk5LF2zj2xgpgWIv9h": { /* post data */ }
}
which you then load using loadBundle.
Next, you delete two of these documents from the server using another client (using the same client would delete them from the local cache).
Now you re-request the bundle, which returns:
{
"/Posts/fIvk5LF2zj2xgpgWIv9h": { /* post data */ }
}
Upon calling loadBundle, the library iterates through the collection of documents in the bundle, updating the local cache for each document:
// this is psuedo-code, not the true implementation
function loadBundle(rawBundle) {
decodedBundle = parseBundle(rawBundle);
decodedBundle.docs.forEach((doc) => {
cachedDocuments.set(doc.id, doc);
})
return { // return bundle load progress
totalDocuments: decodedBundle.docs.length,
/* ... other stats ... */
};
}
In the above psuedo-code, you can see that only the documents that are included in the bundle are updated in the local cache. Documents not included in the bundle are not updated and the stats returned are related to the documents included in the bundle that was just decoded - not the results of the relevant query.
When you run the named query, the query is decoded, compared and executed against the local cache. As the former documents currently match the query according to the cache, they are included in the results.
The documents in the local cache will only be omitted when:
The decoded query returns more than 10 results, and they don't meet the criteria of the query.
The query is executed against the live database.
If the downloaded bundle included metadata indicating that the documents were deleted.
So for the local cache to be purged, the bundle would have to contain:
{
"/Posts/D9p7MbcYCbTNzcXLrzfQ": _DELETED,
"/Posts/xz3eY1Gwsjl4tTxTjXyR": _DELETED,
// ... for every deleted document that ever existed ...
"/Posts/fIvk5LF2zj2xgpgWIv9h": { /* post data */ }
}
Returning such a bundle would be technically complex and incredibly inefficient.
When requesting the bundle, you could include a list of document IDs, where if a given document ID doesn't exist, a document deletion is included in the bundle's data. However, in doing so, you may as well just make a normal database request to the server using getDocs or onSnapshot for the same result which would be faster and cheaper.
Bundles are designed to be used against collections with large amounts of reused data and generally only on the initial load of a page. If a post is deleted in the top 50 results, you would invalidate the cached results and rebuild the bundle. All new users would see the changed results immediately and only those users with the local copy would see them.
If you think that loadBundle should purge items from the local cache that aren't returned in the bundle, you should file a feature request to add this feature to the loadBundle call (e.g. loadBundle(rawBundle, /* forceRefresh = */ true)).

Related

File System Access API: is it possible to store the fileHandle of a saved or loaded file for later use?

Working on an app that uses the new(ish) File System Access API, and I wanted to save the fileHandles of recently loaded files, to display a "Recent Files..." menu option and let a user load one of these files without opening the system file selection window.
This article has a paragraph about storing fileHandles in IndexedDB and it mentions that the handles returned from the API are "serializable," but it doesn't have any example code, and JSON.stringify won't do it.
File handles are serializable, which means that you can save a file handle to IndexedDB, or call postMessage() to send them between the same top-level origin.
Is there a way to serialize the handle other than JSON? I thought maybe IndexedDB would do it automatically but that doesn't seem to work, either.

Here is a minimal example that demonstrates how to store and retrieve a file handle (a FileSystemHandle to be precise) in IndexedDB (the code uses the idb-keyval library for brevity):
import { get, set } from 'https://unpkg.com/idb-keyval#5.0.2/dist/esm/index.js';
const pre = document.querySelector('pre');
const button = document.querySelector('button');
button.addEventListener('click', async () => {
try {
const fileHandleOrUndefined = await get('file');
if (fileHandleOrUndefined) {
pre.textContent =
`Retrieved file handle "${fileHandleOrUndefined.name}" from IndexedDB.`;
return;
}
// This always returns an array, but we just need the first entry.
const [fileHandle] = await window.showOpenFilePicker();
await set('file', fileHandle);
pre.textContent =
`Stored file handle for "${fileHandle.name}" in IndexedDB.`;
} catch (error) {
alert(error.name, error.message);
}
});
I have created a demo that shows the above code in action.

When a platform interface is [Serializable], it means it has associated internal serialization and deserialization rules that will be used by APIs that perform the “structured clone” algorithm to create “copies” of JS values. Structured cloning is used by the Message API, as mentioned. It’s also used by the History API, so at least in theory you can persist FileHandle objects in association with history entries.
In Chromium at the time of writing, FileHandle objects appear to serialize and deserialize successfully when used with history.state in general, e.g. across reloads and backwards navigation. Curiously, it seems deserialization may silently fail when returning to a forward entry: popStateEvent.state and history.state always return null when traversing forwards to an entry whose associated state includes one or more FileHandles. This appears to be a bug.
History entries are part of the “session” storage “shelf”. Session here refers to (roughly) “the lifetime of the tab/window”. This can sometimes be exactly what you want for FileHandle (e.g. upon traversing backwards, reopen the file that was open in the earlier state). However it doesn’t help with “origin shelf” lifetime storage that sticks around across multiple sessions. The only API that can serialize and deserialize FileHandle for origin-level storage is, as far as I’m aware, IndexedDB.

For those using Dexie to interface with IndexedDB, you will get an empty object unless you leave the primary key unnamed ('not inbound'):
db.version(1).stores({
test: '++id'
});
const [fileHandle] = await window.showOpenFilePicker();
db.test.add({ fileHandle })
This results in a record with { fileHandle: {} } (empty object)
However, if you do not name the primary key, it serializes the object properly:
db.version(1).stores({
test: '++'
});
const [fileHandle] = await window.showOpenFilePicker();
db.test.add({ fileHandle })
Result: { fileHandle: FileSystemFileHandle... }
This may be a bug in Dexie, as reported here: https://github.com/dfahlander/Dexie.js/issues/1236

Why is the foreach loop NOT making a change to the file?

I am reviewing a nodeJS program someone wrote to merge objects from two files and write the data to a mongodb. I am struggling to wrap my head around how this is working - although I ran it and it works perfectly.
It lives here: https://github.com/muhammad-asad-26/Introduction-to-NodeJS-Module3-Lab
To start, there are two JSON files, each containing an array of 1,000 objects which were 'split apart' and are really meant to be combined records. The goal is to merge the 1st object of both files together, and then both 2nd objects ...both 1,000th objects in each file, and insert into a db.
Here are the excerpts that give you context:
const customerData = require('./data/m3-customer-data.json')
const customerAddresses = require('./data/m3-customer-address-data.json')
mongodb.MongoClient.connect(url, (error, client) => {
customerData.forEach((element, index) => {
element = Object.assign(element, customerAddresses[index])
//I removed some logic which decides how many records to push to the DB at once
var tasks = [] //this array of functions is for use with async, not relevant
tasks.push((callback) => {
db.collection('customers').insertMany(customerData.slice(index, recordsToCopy), (error, results) => {
callback(error)
});
});
})
})
As far as I can tell,
element = Object.assign(element, customerAddresses[index])
is modifying the current element during each iteration - IE the JSON object in the source file
to back this up,
db.collection('customers').insertMany(customerData.slice(index, recordsToCopy)
further seems to confirm that when writing the completed merged data to the database the author is reading out of that original customerData file - which makes sense only if the completed merged data is living there.
Since the source files are unchanged, the two things that are confusing me are, in order of importance:
1)Where does the merged data live before being written to the db? The customerData file is unchanged at the end of runtime.
2)What's it called when you access a JSON file using array syntax? I had no idea you could read files without the functionality of the fs module or similar. The author read files using only require('filename'). I would like to read more about that.
Thank you for your help!

Question 1:
The merged data lives in the customerData variable before it's sent to the database. It exists only in memory at the time insertMany is called, and is passed in as a parameter. There is no reason for anything on the file system to be overwritten -- in fact it would be inefficient to modify that .json file every time you called the database -- storing that information is the job of the database, not a file within your application. If you wanted to overwrite the file, it would be easy enough -- just add something like fs.writeFile('./data/m3-customer-data.json', JSON.stringify(customerData), 'utf8', console.log('overwritten')); after the insertMany. Be sure to include const fs = require('fs');. To make it clearer what is happening, try writing the value of customerData.length to the file instead.
Question 2:
Look at the docs on require() in Node. All it's doing is parsing the data in the JSON file.
There's no magic here. A static json file is parsed to an array using require and stored in memory as the customerData variable. Its values are manipulated and sent to another computer elsewhere where it can be stored. As the code was originally written, the only purpose that json file serves is to be read.

PouchDB delete document upon successful replication to CouchDB

I am trying to implement a process as described below:
Create a sale_transaction document in the device.
Put the sale_transaction document in Pouch.
Since there's a live replication between Pouch & Couch, let the sale_transaction document flow to Couch.
Upon successful replication of the sale_transaction document to Couch, delete the document in Pouch.
Don't let the deleted sale_transaction document in Pouch, to flow through Couch.
Currently, I have implemented a two-way sync from both databases, where I'm filtering each document that is coming from Couch to Pouch, and vice versa.
For the replication from Couch to Pouch, I didn't want to let sale_transaction documents to go through, since I could just get these documents from Couch.
PouchDb.replicate(remoteDb, localDb, {
// Replicate from Couch to Pouch
live: true,
retry: true,
filter: (doc) => {
return doc.doc_type!=="sale_transaction";
}
})
While for the replication from Pouch to Couch, I put in a filter not to let deleted sale_transaction documents to go through.
PouchDb.replicate(localDb, remoteDb, {
// Replicate from Pouch to Couch
live: true,
retry: true,
filter: (doc) => {
if(doc.doc_type==="sale_transaction" && doc._deleted) {
// These are deleted transactions which I dont want to replicate to Couch
return false;
}
return true;
}
}).on("change", (change) => {
// Handle change
replicateOutChangeHandler(change)
});
I also implemented a change handler to delete the sale_transaction documents in Pouch, after being written to Couch.
function replicateOutChangeHandler(change) {
for(let doc of change.docs) {
if(doc.doc_type==="sale_transaction" && !doc._deleted) {
localDb.upsert(doc._id, function(prevDoc) {
if(!prevDoc._deleted) {
prevDoc._deleted = true;
}
return prevDoc;
}).then((res)=>{
console.log("Deleted Document After Replication",res);
}).catch((err)=>{
console.error("Deleted Document After Replication (ERROR): ",err);
})
}
}
}
The flow of the data seems to be working at first, but when I get the sale_transaction document from Couch, then do some editing, I would then have to repeat the process of writing the document in Pouch, then let it flow to Couch, then delete it in Pouch. But, after some editing with the same document, the document in Couch, has also been deleted.
I am fairly new with Pouch & Couch, specifically in NoSQL, and was wondering if I'm doing something wrong in the process.

For a situation like the one you've described above, I'd suggest tweaking your approach as follows:
Create a PouchDB database as a replication target from CouchDB, but treat this database as a read-only mirror of the CouchDB database, applying whatever transforms you need in order to strip certain document types from the local store. For the sake of this example, let's call this database mirror. The mirror database only gets updated one-way, from the canonical CouchDB database via transform replication.
Create a separate PouchDB database to store all your sales transactions. For the sake our this example, let's call this database user-data.
When the user creates a new sale transaction, this document is written to user-data. Listen for changes on user-data, and when a document is created, use the change handler to create and write the document directly to CouchDB.
At this point, CouchDB is recieving sales transactions from user-data, but your transform replication is preventing them from polluting mirror. You could leave it at that, in which case user-data will have local copies of all sales transactions. On logout, you can just delete the user-data database. Alternatively, you could add some more complex logic in the change handler to delete the document once CouchDB has recieved it.
If you really wanted to get fancy, you could do something even more elaborate. Leave the sales transactions in user-data after it's written to CouchDB, and in your transform replication from CouchDB to mirror, look for these newly-created sales transactions documents. Instead of removing them, just strip them of anything but their _id and _rev fields, and use these as 'receipts'. When one of these IDs match an ID in user-data, that document can be safely deleted.
Whichever method you choose, I suggest you think about your local PouchDB's _changes feed as a worker queue, instead of putting all of this elaborate logic in replication filters. The methods above should all survive offline cases without introducing conflicts, and recover nicely when connectivity is restored. I'd recommend the last solution, though it might be a bit more work than the others. Hope this helps.

Maybe additional field for delete - thus marking the record for deletion.
Then periodic routine running on both Pouch and Couch that scan for marked for deletion records and delete them.

Efficient DB design with PouchDB/CouchDB

So I was reading a lot about how to actually store and fetch data in an efficient way. Basically my application is about time management/capturing for projects. I am very happy for any opinion on which strategy I should use or even suggestions for other strategies. The main concern is about the limited resources for local storage on the different Browsers.
This is the main data I have to store:
db_projects: This is a database where the projects itself are stored.
db_timestamps: Here go the timestamps per project whenever a project is running.
I came up with the following strategies:
1: Storing the status of the project in the timestamps
When a project is started, there is addad a timestamp to db_timestamps like so:
db_timestamps.put({
_id: String(Date.now()),
title: projectID,
status: status //could be: 1=active/2=inactive/3=paused
})...
This follows the strategy to only add data to the db and not modify any entries. The problem I see here is that if I want to get all active projects for example, I would need to query the whole db_timestamp which can contain thousands of entries. Since I can not use the ID to search all active projects, this could result in a quite heavy DB query.
2: Storing the status of the project in db_projects
Each time a project changes it's status, there is a update to the project itself. So the "get all active projects"-query would be much resource friendly, since there are a lot less projects than timestamps. But this would also mean that each time a status change happens, the project entry would be revisioned and therefor would produce "a lot" of overhead. I'm also not sure if the compaction feature would do a good job, since not all revision data is deleted (the documents are, but the leaf revisions not). This means for a state change we have at least the _rev information which is still a string of 34 chars for changing only the status (1 char). Or can I delete the leaf revisions after conflict resolution?
3: Storing the status in a separate DB like db_status
This leads to the same problem as in #2 since status changes lead to revisions on this DB. Or if the states would be added in "only add data"-mode (like in #1), it would just quickly fill with entries.

The general problem is that you have a limited amount of space that you could put into indexedDB. On the other hand the principle of ChouchDB is that storage space is cheap (which it is indeed true when you store on the server side only). Here an interesting discussion about that.
So this is the solution that I use for now. I am using a mix between solution 1 and solution 2 from above with the following additions:
Storing only the timesamps in a synced Database (db_timestamps) with the "only add data" principle.
Storing the projects and their states in a separate local (not
synced) database (db_projects). Therefor I still use pouchDB since
it has a lot simpler API than indexedDB.
Storing the new/changed
project status in each timestamp aswell (so you could rebuild db_projects
out of db_timestams if needed)
Deleting db_projects every so often and repopulate it, so the
revision data (overhead for this db in my case) is eliminated and the size is acceptable.
I use the following code to rebuild my DB:
//--------------------------------------------------------------------
function rebuild_db_project(){
db_project.allDocs({
include_docs: true,
//attachments: true
}).then(function (result) {
// do stuff
console.log('I have read the DB and delete it now...');
deleteDB('db_project', '_pouch_DB_Projekte');
return result;
}).then(function (result) {
console.log('Creating the new DB...'+result);
db_project = new PouchDB('DB_Projekte');
var dbContentArray = [];
for (var row in result.rows) {
delete result.rows[row].doc._rev; //delete the revision of the doc. else it would raise an error on the bulkDocs() operation
dbContentArray.push(result.rows[row].doc);
}
return db_project.bulkDocs(dbContentArray);
}).then(function(response){
console.log('I have successfully populated the DB with: '+JSON.stringify(response));
}).catch(function (err) {
console.log(err);
});
}
//--------------------------------------------------------------------
function deleteDB(PouchDB_Name, IndexedDB_Name){
console.log('DELETE');
new PouchDB(PouchDB_Name).destroy().then(function () {
// database destroyed
console.log("pouchDB destroyed.");
}).catch(function (err) {
// error occurred
});
var DBDeleteRequest = window.indexedDB.deleteDatabase(IndexedDB_Name);
DBDeleteRequest.onerror = function(event) {
console.log("Error deleting database.");
};
DBDeleteRequest.onsuccess = function(event) {
console.log("IndexedDB deleted successfully");
console.log(request.result); // should be null
};
}
So I not only use the pouchDB.destroy() command but also the indexedDB.deleteDatabase() command to get the storage freed nearly completely (there is still some 4kB that are not freed, but this is insignificant to me.)
The timings are not really proper but it works for me. I'm happy if somone has an idea to make the timing work properly (The problem for me is that indexedDB does not support promises).

Progress event for large Firebase queries?

I have the following query:
fire = new Firebase 'ME.firebaseio.com'
users = fire.child 'venues/ID/users'
users.once 'value', (snapshot) ->
# do things with snapshot.val()
...
I am loading 10+ mb of data, and the request takes around 1sec/mb. Is it possible to give the user a progress indicator as content streams in? Ideally I'd like to process the data as it comes in as well (not just notify).
I tried using the on "child_added" event instead, but it doesn't work as expected - instead of children streaming in at a consistent rate, they all come at once after the entire dataset is loaded (which takes 10-15 sec), so in practice it seems to be a less performant version of on "value".

You should be able to optimize your download time from 10-20secs to a few milliseconds by starting with some denormalization.
For example, we could move the images and any other peripherals comprising the majority of the payload to their own path, keep only the meta data (name, email, etc) in the user records, and grab the extras separately:
/users/user_id/name, email, etc...
/images/user_id/...
The number of event listeners you attach or paths you connect to does not have any significant overhead locally or for networking bandwidth (just the payload) so you can do something like this to "normalize" after grabbing the meta data:
var firebaseRef = new Firebase(URL);
firebaseRef.child('users').on('child_added', function(snap) {
console.log('got user ', snap.name());
// I chose once() here to snag the image, assuming they don't change much
// but on() would work just as well
firebaseRef.child('images/'+snap.name()).once('value', function(imageSnap) {
console.log('got image for user ', imageSnap.name());
});
});
You'll notice right away that when you move the bulk of the data out and keep only the meta info for users locally, they will be lightning-fast to grab (all of the "got user" logs will appear right away). Then the images will trickle in one at a time after this, allowing you to create progress bars or process them as they show up.
If you aren't willing to denormalize the data, there are a couple ways you could break up the loading process. Here's a simple pagination approach to grab the users in segments:
var firebaseRef = new Firebase(URL);
grabNextTen(firebaseRef, null);
function grabNextTen(ref, startAt) {
ref.limit(startAt? 11 : 10).startAt(startAt).once('value', function(snap) {
var lastEntry;
snap.forEach(function(userSnap) {
// skip the startAt() entry, which we've already processed
if( userSnap.name() === lastEntry ) { return; }
processUser(userSnap);
lastEntry = userSnap.name();
});
// setTimeout closes the call stack, allowing us to recurse
// infinitely without a maximum call stack error
setTimeout(grabNextTen.bind(null, ref, lastEntry);
});
}
function processUser(snap) {
console.log('got user', snap.name());
}
function didTenUsers(lastEntry) {
console.log('finished up to ', lastEntry);
}
A third popular approach would be to store the images in a static cloud asset like Amazon S3 and simply store the URLs in Firebase. For large data sets in the hundreds of thousands this is very economical, since those solutions are a bit cheaper than Firebase storage.
But I'd highly suggest you both read the article on denormalization and invest in that approach first.

Develop Reference

JavaScript is the programming language of the Web.