Unable to limit Dynamo Query in NodeJS [duplicate] - javascript

I need make a scan with limit and a condition on DynamoDB.
The docs says:
In a response, DynamoDB returns all the matching results within the scope of the Limit value. For example, if you issue a Query or a Scan request with a Limit value of 6 and without a filter expression, DynamoDB returns the first six items in the table that match the specified key conditions in the request (or just the first six items in the case of a Scan with no filter). If you also supply a FilterExpression value, DynamoDB will return the items in the first six that also match the filter requirements (the number of results returned will be less than or equal to 6).
The code (NODEJS):
var params = {
ExpressionAttributeNames: {"#user": "User"},
ExpressionAttributeValues: {":user": parseInt(user.id)},
FilterExpression: "#user = :user and attribute_not_exists(Removed)",
Limit: 2,
TableName: "XXXX"
};
DynamoDB.scan(params, function(err, data) {
if (err) {
dataToSend.message = "Unable to query. Error: " + err.message;
} else if (data.Items.length == 0) {
dataToSend.message = "No results were found.";
} else {
dataToSend.data = data.Items;
console.log(dataToSend);
}
});
Table XXXX definitions:
Primary partition key: User (Number)
Primary sort key: Identifier (String)
INDEX:
Index Name: RemovedIndex
Type: GSI
Partition key: Removed (Number)
Sort key: -
Attributes: ALL
In code above, if I remove the Limit parameter, DynamoDB will return the items that match the filter requirements. So, the conditions are ok. But when I scan with Limit parameter, the result is empty.
The XXXX table, has 5 items. Only the 2 firsts have the Removed attribute. When I scan without Limit parameter, DynamoDB returns the 3 items without Removed attribute.
What i'm doing wrong?

From the docs that you quoted:
If you also supply a FilterExpression value, DynamoDB will return the
items in the first six that also match the filter requirements
By combining Limit and FilterExpression you have told DynamoDB to only look at the first two items in the table, and evaluate the FilterExpression against those items. Limit in DynamoDB can be confusing because it works differently from limit in a SQL expression in a RDBMS.

Also ran into this issue, i guess you will just have to scan the whole table to a max of 1 MB
Scan
The result set from a Scan is limited to 1 MB per call. You can use the LastEvaluatedKey from the scan response to retrieve more results.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html

You might be able to get what you need by using a secondary index. Using the classic RDB example, customer - order example: you have one table for customers and one for orders. The Orders table has a Key consisting of Customer - HASH, Order - RANGE. So if you wanted to get the latest 10 orders, there would be no way to do it without a scan
But if you create a Global Secondary Index on orders of "Some Constant" -- HASH, Date RANGE, and queried against that index, they query would do what you want and only charge you for the RCUs involved with the records returned. No expensive scan needed. Note, writes will be more expensive, but in most cases, there are many more reads than writes.
Now you have your original problem if you want to get the 10 biggest orders for a day larger than $1000. The query would return the last 10 orders, and then filter out those less than $1000.
In this case, you could create a computed key of Date-OrderAmount, and queries against that index would return what you want.
It's not as simple as SQL, but you need to think about access patterns in SQL too. If if you have a lot of data, you need to create Indexes in SQL or the DB will happily to table scans on your behalf, which will impair performance and raise your costs.
Note that everything I proposed is normalized in the sense that there is only one source of truth. You are not duplicating data -- you are merely recasting views of it to get what you need from DynamoDB.
Bear in mind that the CONSTANT as a HASH s subject to the 10GB per partition limit, so you would need to design around it if you had a lot of active data. For example, depending on your expected access pattern, you could use Customer and not a constant as a HASH. Or use STreams to organize the data (or subsets) in other ways.

Small hack - Iterate till you get the results
lastEvaluatedKey = null;
do {
if(lastEvaluatedKey != null) {
// query or scan data with last evaluated key
} else {
// query or scan data WITHOUT last evaluated key
}
lastEvaluatedKey == key of last item retrieved
} while(lastEvaluatedKey != null && retrievedResultSize == 0); // == 0 or < yourLimit
If the number of items retrieved is 0 and lastEvaluatedKey is not null that means it has scanned or queried the number of rows which match to your limit. (and result size is zero because they didn't match the filter expression)

Related

Elasticsearch: return unique records across multiple indexes

I am trying to return unique records across multiple indexes.
Suppose I have two indexes, indexA and indexB. My elasticsearch queries both of theses indexes.
If I am filtering by the fieldname "Type" (this is in both indexes), how would I only get the unique ones?
Example: indexA has a record with column "type" with value "alpha" and indexB has a record with column "type" with value "alpha". My elastic search query should only output one of these records (does not mater which).
So far I have this:
searchParams = {
"body": {
"size": searchService.PAGE_SIZE,
"from": searchService.currentPage * searchService.PAGE_SIZE,
"query": {
"bool": {
"must": must
}
},
"aggs": {
"unique_type": {
"terms": {
"field": "type",
"size": 1
}
}
}
}
};
But it's not working.
Thanks!
Your query just needs a little tweak: change the size parameter's value.
How can I return N most frequent values of keyword type across several indexes?
You can use terms aggregation to do that.
In terms aggregation the size parameter limits the amount of buckets your return. In your case you have set it to 1, and this aggregation will return only 1 bucket.
Set size to 10 or other appropriate amount. This will return N most frequent values of that field (type in your case).
By the way, all Elasticsearch searches can be done across multiple indexes simultaneously.
What if I also want an example document per each bucket?
Bucket aggregations will collect unique values of given kind, called buckets, and count how many documents are in the bucket.
Aggregations return some statistics, like AVG() and SUM() do in SQL, on the entire result set. They are single numbers, not documents. In your case, Elasticsearch will first restrict the set of documents to those only matching your must query that you specified, and then compute all aggregations for that set of documents.
Is there a way to ask Elasticsearch to go back from these aggregation results and fetch a "top hit" for each bucket? There is, and it is called top_hits aggregation. In your case such top_hits aggregation will go inside the terms one.
Both terms and top_hits aggregations have their limitations, like, they cannot return all the buckets if they are too many, or all matching documents, since Elasticsearch tries to be as fast as possible. Please check the corresponding documentation pages.
What if I do need the entire list of all unique values of a field?
In this case you can use composite aggregation and paginate over buckets, like you are already doing pagination over search results (with size and from).
Hope this helps!

Querying for object key in Firestore

I currently have a few issues with my Firestore querying technique. As per this stackoverflow post I made recently, Querying with two array with firestore security rules
The answer proposed to add the the "ids" into a object, with the key as the id, and the value simply being "true". I have completed this, and now my structure looks like so:
This leaves me with this query:
db.collection('Depots')
.where(`products.${productId}`, '==', true)
.where(`users.${userId}`, '==', true)
.where('created', '>', 1585998560500)
.orderBy('created', 'asc')
.get();
This query leaves me with throwing an error, asking to create an index:
The query requires an index. You can create it here: ...
However, this tries to index the specific object key, i.e. QXooVYGBIFWKo6C so products.QXooVYGBIFWKo6C. Which is certianly not what I want, as this query changes, and can have an infinite number of possibilities, which means I would have to create another index for each key entry in order to query it.
Is there any way to solve this issue? I am assuming it needs to index this query due to the different operators used in the query, so I was wondering if there were any workarounds to this issue.
Thank you very much in advance.
What you have here is a map field, for which indexes should usually be created automatically.
That indeed means that you'll have as many indexes as you have products, which means:
You are limited in how many products you can have, as there is a maximum of 40,000 index entries per document.
You pay more per document, as you pay for the storage of each index.
If these are not what you want, you'll have to switch back to your original model, with the query limitations you had there. There doesn't seem to be a solution that fits both of your requirements.
After our discussion in chat, this is the starting point I would suggest. Who knows what the end architecture would look like, but I think this or very close to this. You say that a user can exist in multiple depots at the same time and multiple depots can contain the same products, also at the same time. You also said that a depot can never have more than 40 users at a given time, so an array of 40 users would certainly not encroach on Firestore's document limit of 1,048,576 bytes.
[collection]
<documentId>
- field: value
[depots]
<UUID>
- depotId: string "depot456"
- productCount: num 5,000
<UUID>
- depotId: string "depot789"
- productCount: num 4,500
[products]
<UUID>
- productId: string "lotion123"
- depotId: string "depot456"
- users: [string] ["user10", "user27", "user33"]
<UUID>
- productId: string "lotion123"
- depotId: string "depot789"
- users: [string] ["user10", "user17", "user50"]
[users]
<userId>
- depots: [string] ["depot456", "depot999"]
<userId>
- depots: [string] ["depot333", "depot999"]
In NoSQL, storage is cheap and computation isn't so denormalize your data as much as you need to make your queries possible and efficient (fast and cheap).
To find all depots in a single query where user10 and lotion123 are both true, query the products collection where productId equals x and users array-contains y and collect the depotId values from those results. If you want to preserve the array-contains operation for something else, you'd have to denormalize your data further (replace the array for a single user). Or you could split this query into two separate queries.
With this model, when a user leaves a depot, get all products where users array-contains that user and remove that userId from the array. And when a user joins a depot, get all products where depotId equals x and append that userId to the array.
Watch this video, and others by Rick, to get a solid handle on NoSQL: https://www.youtube.com/watch?v=HaEPXoXVf2k
#danwillm If you are not sure about the number of users and products then your DB structure seems unfit for this situation because there are size and length limitations of the firestore document.
You should rather create a separate collection for products and users i.e normalize your data and have a reference for the user in the product collection.
User :
{
userId: documentId,
name: John,
...otherInfo
}
Product :
{
productId: documentId,
createdBy: userId,
createdOn:date,
productName:"exa",
...otherInfo
}
This way you there will be the size of the document would be limited, i.e try avoiding using maps/arrays in firestore if you are not sure about there size.
Also, in this case, the number of queries would be increased but you don't need many indexes in this case.

How can filter through 100K records in indexedDB?

I have a IndexedDB store that has 100k fields of names.
How To filter it based on name includes some substring?
I tried to use indexeddb getall() but resulted in increased CPU usage.
Using Cursor to iterate took a lot of time.
Tried dexie js.
Is there any good implementation to do this kind of operation?
If you index the name field, you can do a prefix search, but ordinary indexes are not enough for full substring searches.
const db = new Dexie ('dbname');
db.version(1).stores({things: 'id, name'});
function query(prefix) {
return db.things
.where('name').startsWith(prefix)
.toArray();
}
This sample defines an index on the 'name' field and the query function will do getAll() on the 'name' index using an IDBKeyRange representing all names that starts with the given string.

How do you filter an airtable record that contains a given string, but may also contain others?

Using the airtable api's filterByFormula method I can query and select records that contain one specific item (string). However, this query only returns the records that have ONLY that specific string.
Searching the airtable api formula documentation here:
https://support.airtable.com/hc/en-us/articles/203255215-Formula-Field-Reference
yields no answers.
What I have:
base('Table').select({
view: 'Main View',
filterByFormula: `Column = "${myFilter}"`
}).firstPage(function(err, records) {
if (err) { console.error(err); reject( err ); }
records.forEach(function(record) {
console.log(record.get('Another Column'));
});
Again, this returns records that have ONLY myFilter, and not records that contain any additional values in the given Column.
edit: airtable refers to these columns as 'multiselect' field types.
It appears that airtable has support for the js equivalent of indexOf in the form of SEARCH. From the docs:
SEARCH(find_string, within_string, position)
Returns the index of the first occurrence of find_string in within_string, starting at position. Position is 0 by default.
As such, we can use the following as our search filter value
filterByFormula: `SEARCH("${myFilter}", Column) > 0`
This will search our Column for the presence of myFilter and return any result were myFilter is located at some position within the Column.
NB:It may be worth caveating, that your myFilter property is probably best to have a minimum character count of at least 2 for performance. The docs also do not specify an option for case insensitivity, so you may wish to wrap both Column and "${myFilter}" in a LOWER() function, if you want to return results regardless of upper or lower case characters. For example
filterByFormula: `SEARCH(LOWER("${myFilter}"), LOWER(Column)) > 0`

Range query for MongoDB pagination

I want to implement pagination on top of a MongoDB. For my range query, I thought about using ObjectIDs:
db.tweets.find({ _id: { $lt: maxID } }, { limit: 50 })
However, according to the docs, the structure of the ObjectID means that "ObjectId values do not represent a strict insertion order":
The relationship between the order of ObjectId values and generation time is not strict within a single second. If multiple systems, or multiple processes or threads on a single system generate values, within a single second; ObjectId values do not represent a strict insertion order. Clock skew between clients can also result in non-strict ordering even for values, because client drivers generate ObjectId values, not the mongod process.
I then thought about querying with a timestamp:
db.tweets.find({ created: { $lt: maxDate } }, { limit: 50 })
However, there is no guarantee the date will be unique — it's quite likely that two documents could be created within the same second. This means documents could be missed when paging.
Is there any sort of ranged query that would provide me with more stability?
It is perfectly fine to use ObjectId() though your syntax for pagination is wrong. You want:
db.tweets.find().limit(50).sort({"_id":-1});
This says you want tweets sorted by _id value in descending order and you want the most recent 50. Your problem is the fact that pagination is tricky when the current result set is changing - so rather than using skip for the next page, you want to make note of the smallest _id in the result set (the 50th most recent _id value and then get the next page with:
db.tweets.find( {_id : { "$lt" : <50th _id> } } ).limit(50).sort({"_id":-1});
This will give you the next "most recent" tweets, without new incoming tweets messing up your pagination back through time.
There is absolutely no need to worry about whether _id value is strictly corresponding to insertion order - it will be 99.999% close enough, and no one actually cares on the sub-second level which tweet came first - you might even notice Twitter frequently displays tweets out of order, it's just not that critical.
If it is critical, then you would have to use the same technique but with "tweet date" where that date would have to be a timestamp, rather than just a date.
Wouldn't a tweet "actual" timestamp (i.e. time tweeted and the criteria you want it sorted by) be different from a tweet "insertion" timestamp (i.e. time added to local collection). This depends on your application, of course, but it's a likely scenario that tweet inserts could be batched or otherwise end up being inserted in the "wrong" order. So, unless you work at Twitter (and have access to collections inserted in correct order), you wouldn't be able to rely just on $natural or ObjectID for sorting logic.
Mongo docs suggest skip and limit for paging:
db.tweets.find({created: {$lt: maxID}).
sort({created: -1, username: 1}).
skip(50).limit(50); //second page
There is, however, a performance concern when using skip:
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset increases, cursor.skip() will become slower and more CPU intensive.
This happens because skip does not fit into the MapReduce model and is not an operation that would scale well, you have to wait for a sorted collection to become available before it can be "sliced". Now limit(n) sounds like an equally poor method as it applies a similar constraint "from the other end"; however with sorting applied, the engine is able to somewhat optimize the process by only keeping in memory n elements per shard as it traverses the collection.
An alternative is to use range based paging. After retrieving the first page of tweets, you know what the created value is for the last tweet, so all you have to do is substitute the original maxID with this new value:
db.tweets.find({created: {$lt: lastTweetOnCurrentPageCreated}).
sort({created: -1, username: 1}).
limit(50); //next page
Performing a find condition like this can be easily parallellized. But how to deal with pages other than the next one? You don't know the begin date for pages number 5, 10, 20, or even the previous page! #SergioTulentsev suggests creative chaining of methods but I would advocate pre-calculating first-last ranges of the aggregate field in a separate pages collection; these could be re-calculated on update. Furthermore, if you're not happy with DateTime (note the performance remarks) or are concerned about duplicate values, you should consider compound indexes on timestamp + account tie (since a user can't tweet twice at the same time), or even an artificial aggregate of the two:
db.pages.
find({pagenum: 3})
> {pagenum:3; begin:"01-01-2014#BillGates"; end:"03-01-2014#big_ben_clock"}
db.tweets.
find({_sortdate: {$lt: "03-01-2014#big_ben_clock", $gt: "01-01-2014#BillGates"}).
sort({_sortdate: -1}).
limit(50) //third page
Using an aggregate field for sorting will work "on the fold" (although perhaps there are more kosher ways to deal with the condition). This could be set up as a unique index with values corrected at insert time, with a single tweet document looking like
{
_id: ...,
created: ..., //to be used in markup
user: ..., //also to be used in markup
_sortdate: "01-01-2014#BillGates" //sorting only, use date AND time
}
The following approach wil work even if there are multiple documents inserted/updated at same millisecond even if from multiple clients (which generates ObjectId). For simiplicity, In following queries I am projecting _id, lastModifiedDate.
First page, fetch the result Sorted by modifiedTime (Descending), ObjectId (Ascending) for fist page.
db.product.find({},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
Note down the ObjectId and lastModifiedDate of the last record fetched in this page. (loid, lmd)
For sencod page, include query condition to search if (lastModifiedDate = lmd AND oid > loid ) OR (lastModifiedDate < loid)
db.productfind({$or:[{"lastModifiedDate":{$lt:lmd}},{"_id":1,"lastModifiedDate":1},{$and:[{"lastModifiedDate":lmd},{"_id":{$gt:loid}}]}]},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
repeat same for subsequent pages.
ObjectIds should be good enough for pagination if you limit your queries to the previous second (or don't care about the subsecond possibility of weirdness). If that is not good enough for your needs then you will need to implement an ID generation system that works like an auto-increment.
Update:
To query the previous second of ObjectIds you will need to construct an ObjectID manually.
See the specification of ObjectId http://docs.mongodb.org/manual/reference/object-id/
Try using this expression to do it from a mongos.
{ _id :
{
$lt : ObjectId(Math.floor((new Date).getTime()/1000 - 1).toString(16)+"ffffffffffffffff")
}
}
The 'f''s at the end are to max out the possible random bits that are not associated with a timestamp since you are doing a less than query.
I recommend during the actual ObjectId creation on your application server rather than on the mongos since this type of calculation can slow you down if you have many users.
I have build a pagination using mongodb _id this way.
// import ObjectId from mongodb
let sortOrder = -1;
let query = []
if (prev) {
sortOrder = 1
query.push({title: 'findTitle', _id:{$gt: ObjectId('_idValue')}})
}
if (next) {
sortOrder = -1
query.push({title: 'findTitle', _id:{$lt: ObjectId('_idValue')}})
}
db.collection.find(query).limit(10).sort({_id: sortOrder})

Categories

Resources