I want to store an array of objects that contain the property event_timestamp that contains a Date() object of the timestamp of the event.
i'll like to provide a from Date object and a to Date object and to collect all the objects within that range.
I don't want to iterate over all the array, i'd like something faster... for example O(Log N) or anything else.
is there something like that implemented in javascript? is there a 3rd party library that provides the search by O(Log N) or any other fast method?
any information regarding the issue would be greatly appreciated.
thanks!
update
thanks for your responses so far.
my scenario is that i have a dashboard that contains graphs. now user can search for example.. give me the result of the last 15 minutes. then the server provide rows with the relevant results and each row in the result contains a property called event_timestamp that contains a Date Object.
now, if after 5 minutes the user searches for the last 15 minutes again, then it means that the first 10 minutes he already queried before, so I want to cache that and to send a request from the server only for the last 5 minutes.
so whenever the user queries the server, i take the response and parse it using the following steps:
get event_time on first row
get event_time of last row
append to an array the following object:
{
rows: rows,
fromDate: firstRowDate,
toDate: lastRowDate
}
now.. in the previous example the user first queries for 15 minutes, and after 5 minutes queried for 15 minutes again. which means i have 10 valuable minutes of data in the object.
the main issue here is how do i iterate over the rows fast to find the relevant rows that i need to use for my graph.
in a simple way i can just iterate through all the rows and find the ones that are in the range of fromDate to toDate. but if i'll deal with a few million rows that's gonna be a hassle for the client.
I use google flat buffer to transfer the data and then I build the rows by myself. so I can save them in any other format.
hope this helps understand my needs. thanks!
With your current way of storage there is no other or faster way than to iterate over the whole array. So your complexity will be O(n) just for traversing the array.
The information you need to access is stored inside an object which itself is stored in the array. How else would you be able to access the information than to look at each single object. Even in other languages with different array implementations, there wouldn't be a faster way.
The only way you could make this faster is by changing the way you store your data.
For example you could implement a BST with the date as the index. As this would be sorted by date, you wouldn't have to traverse the whole tree and you would need fewer operations to find the nodes in range. I'm afraid you would have to build it yourself though.
Update:
This was related to your initial question. Your update goes in a whole new direction.
I don't think your approach is a good idea. You won't be able to handle such a big amount of data, at least not in a performant way. Why not query the server more often, based on what the client really needs. You could even adjust the resolution, as you might not need all the points, depending on how large the chosen range is.
Nonetheless, if we can assume the array you got is already sorted by date, you can find your matching values faster with a Binary Search Algorithm.
This is how it would basically work: You start in the middle of your array, if the found number is higher than the one you are looking for, you inspect the left half of your array, if it is lower, you inspect the right one. Now you you inspect the middle of your new section. You continue with this pattern until you've found your value.
Visualization of the binary search algorithm where 4 is the target value.
The average performance of a Binary Search Algorithm is O(log n), so this would definitely help.
This is just a start to give you an idea. You will need to modify it a bit, to work with your range.
Related
Just looking to see if there's an elegant solution to this problem:
Is there a way to loop through the results of a psql query and return a specific result based on the SQL query?
For example, let's say I wanted to SELECT amount_available FROM lenders ORDER BY interest_rate, and I wanted to loop through the column looking for any available amounts, add those available amounts to a variable, and then once that amount reached a certain figure, exit.
More verbose example:
Let's say I have someone who wants to borrow $400. I want to go through my lenders table, and look for any lender that has available funds to lend. Additionally, I want to start looking at lenders that are offering the lowest interest rate. How could I query the database and find the results that satisfy the $400 loan at the lowest interest rate, and stop once I've reached my goal, instead of searching the whole db? And can I do that inside a JavaScript function, returning those records that meet that criteria?
Maybe I'm trying to do something that's not possible, but just curious.
Thanks!
You translate your requirement into the SQL language. After all, SQL is a descriptive language. The database engine then figures out how to process the request.
Your example sound like
SELECT name
FROM lenders
WHERE property >= 400
ORDER BY interest_rate
FETCH FIRST ROW ONLY;
So imagine, I want to retrieve all orders for an array of customers.
The arrayList in the example below will have an array of customer IDs.
This array will be passed into the get method below and processed asynchronously retrieving orders for each customer ID in the array.
Here's where I get lost. How can you paginate the database result set and pull only a small set of records at a time from the database without bring having to pull all the records across the network.
What's confusing me is the asynchronous nature as well as we won't know how many orders per customer there are? So how can you efficiently return a set page size at a time?
service.js
function callToAnotherService(id) {
return new Promise((resolve, reject) => {
//calls service passing id
}
}
exports.get = arrayList => Promise.all(arrayList.map(callToAnotherService))
.then((result) => result);
In MySQL there are more than one way to achieve this.
The method you choose depends on many variables, such your actual pagination method (whether you just want to have "previous" and "next" buttons, or actually want to provide range from 1...n, where n is total number of matching records divided by your your per page records count); also on database design, planned growth, partitioning and/or sharding, current and predicted database load, possible hard query limits (for example if you have years worth of records, you might require end user to choose a reasonable time range for the query (last month, last 3 months, last year, and so on...), so they don't overload the database with unrestricted and too broad queries, etc..
To paginate:
using simple previous and next buttons, you
can use the simple LIMIT [START_AT_RECORD,]
NUMBER_OF_RECORDS method, as Rick
James proposed
above.
using (all) page numbers, you need to know the number of matching
records, so based on your page size you'd know how many total pages
there'd be.
using a mix of two methods above. For example you could present a few clickable page numbers (previous/next 5 for example), as well as first and last links/buttons.
If you choose one of the last two options, you'd definitely need to know the total number of found records.
As I said above, there is more than one way to achieve the same goal. The choice must be made depending on the circumstances. Below I'm describing couple simpler ideas:
FIRST:
If your solution is session based, and you can persist the session, then you can use a temporary table into which you could select only order_id (assuming it's a primary key in the orders table). Optionally, if you want to get the counts (or otherwise filter) per customer, you can also add the second column as customer_id next to order_id from the orders table. Once you have propagated the temporary table with minimum data, you can just easily count rows from the temporary table and create your pagination based on that number.
Now as you start displaying pages, you only select the subset of these rows (using the LIMIT method above), and join the corresponding records (the rest of the columns) from orders on temporary table order_id.
This has two benefits: 1) Browsing records page-by-page would be fast, as it's not querying the (presumably) large orders table any more.2)You're not using aggregate queries on the orders table, as depending on the number of records, and the design, these would have pretty bad performance, as well as potentially impacting the performance of other concurrent users.
Just bear in mind that the initial temporary table creation would be a bit slower query. But it'd definitely be even more slower if you didn't restrict temporary table to only essential columns.Still, it's really advisable that you set a reasonable maximum hard limit (number of temporary table records, or some time range) for the initial query
SECOND:
This is my favourite, as with this method I've been able to solve customers' huge database (or specific queries) performance issues on more than one occasion. And we're talking about going from 50-55 sec query time down to milliseconds. This method is especially immune to database scalability related slow downs.
The main idea is that you can pre-calculate all kinds of aggregates (be that cumulative sum of products, or number of orders per customer, etc...). For that you can create an additional table to hold the aggregates (count of orders per customer in your example).
And now comes the most important part:
You must use custom database triggers, namely in your case you can use ON INSERT and ON DELETE triggers, which would update the aggregates table and would increase/decrease the order count for the specific customer, depending on whether an order was added/deleted. Triggers can fire either before or after the triggering table change, depending on how you set them up.
Triggers have virtually no overhead on the database, as they only fire quickly once per (inserted/deleted) record (unless you do something stupid and for example run COUNT(...) query from some big table, which would completely defeat the purpose anyway)I usually go even more granular, by having counts/sums per customer per month, etc...
When done properly it's virtually impossible for aggregate counts to go out of sync with the actual records. If you application enables order's customer_id change, you might also need to add ON UPDATE trigger, so the customer id change for order would automatically get reflected in the aggregates table.
Of course there are many more ways you can go with this. But these two above have proven to be great. It's all depending on the circumstances...
I'm hoping that my somewhat abstract answer can lead you on the right path, as I could only answer based on the little information your question presented...
In MySQL, use ORDER BY ... LIMIT 30, 10 to skip 30 rows and grab 10.
Better yet remember where you left off (let's say $left_off), then do
WHERE id > $left_off
ORDER BY id
LIMIT 10
The last row you grab is the new 'left_off'.
Even better is that, but with LIMIT 11. Then you can show 10, but also discover whether there are more (by the existence of an 11th row being returned from the SELECT.
I have a very simple dataset, consisting of
a date (object) for every day over multiple years and
a value that belongs to a date.
I need to look at that data on the level of days, months and years (and flowing time intervals, like e.g. 30 days). The data is obtained from a csv file.
What would you suggest, is the best way, to (conceptually) prepare the data? Should I, when reading the data immediately nest it (year > month > day - value) and also calculate everything I need to plot (like averages and the likes) and save it with my data (e.g. data.year.monthXY.avg = z)?
Or should I leave the data as is (at its most basic form, a day and its value) and do all the calculations later in the script?
I have little experience in handling data and D3s best practices and would appreciate any advice you have for me on that topic.
I feel like, there is not need to prep the date, because all the info is already contained in the date object and the value, and all the calculation can be done on the fly. (I have to admit, though, that I don't know exactly how i can tell D3 to, say, create the average of a month, without first creating a new dataset that just includes this exact month. That might be another question though, depending how you suggest I should sort my data.) But on the other hand nesting data where possible seems like a good idea, in particular for making use of D3s strength in that regard and its benefits for other/future visual representations.
Background
I have a huge CSV file that has several million rows. Each row has a timestamp I can use to order it.
Naive approach
So, my first approach was obviously to read the entire thing by putting it in memory and then ordering. It didn't work that well as you may guess....
Naive approach v2
My second try was to follow a bit the idea behind MapReduce.
So, I would slice this huge file in several parts, and order each one. Then I would combine all the parts into the final file.
The issue here is that part B may have a message that should be in part A. So in the end, even though each part is ordered, I cannot guarantee the order of the final file....
Objective
My objective is to create a function that when given this huge unordered CSV file, can create an ordered CSV file with the same information.
Question
What are the popular solutions/algorithm to order data sets this big?
What are the popular solutions/algorithm to order data sets this big?
Since you've already concluded that the data is too large to sort/manipulate in the memory you have available, the popular solution is a database which will build disk-based structures for managing and sorting more data than can be in memory.
You can either build your own disk-based scheme or you can grab one that is already fully developed, optimized and maintained (e.g. a popular database). The "popular" solution that you asked about would be to use a database for managing/sorting large data sets. That's exactly what they're built for.
Database
You could set up a table that was indexed by your sort key, insert all the records into the database, then create a cursor sorted by your key and iterate the cursor, writing the now sorted records to your new file one at a time. Then, delete the database when done.
Chunked Memory Sort, Manual Merge
Alternatively, you could do your chunked sort where you break the data into smaller pieces that can fit in memory, sort each piece, write each sorted block to disk, then do a merge of all the blocks where you read the next record from each block into memory, find the lowest one from all the blocks, write it out to your final output file, read the next record from that block and repeat. Using this scheme, the merge would only ever have to have N records in memory at a time where N is the number of sorted chunks you have (less than the original chunked block sort, probably).
As juvian mentioned, here's an overview of how an "external sort" like this could work: https://en.wikipedia.org/wiki/External_sorting.
One key aspect of the chunked memory sort is determining how big to make the chunks. There are a number of strategies. The simplest may be to just decide how many records you can reliably fit and sort in memory based on a few simple tests or even just a guess that you're sure is safe (picking a smaller number to process at a time just means you will split the data across more files). Then, just read that many records into memory, sort them, write them out to a known filename. Repeat that process until you have read all the records and then are now all in temp files with known filenames on the disk.
Then, open each file, read the first record from each one, find the lowest record of each that you read in, write it out to your final file, read the next record from that file and repeat the process. When you get to the end of a file, just remove it from the list of data you're comparing since it's now done. When there is no more data, you're done.
Sort Keys only in Memory
If all the sort keys themselves would fit in memory, but not the associated data, then you could make and sort your own index. There are many different ways to do that, but here's one scheme.
Read through the entire original data capturing two things into memory for every record, the sort key and the file offset in the original file where that data is stored. Then, once you have all the sort keys in memory, sort them. Then, iterate through the sorted keys one by one, seeking to the write spot in the file, reading that record, writing it to the output file, advancing to the next key and repeating until the data for every key was written in order.
BTree Key Sort
If all the sort keys won't fit in memory, then you can get a disk-based BTree library that will let you sort things larger than can be in memory. You'd use the same scheme as above, but you'd be putting the sort key and file offset into a BTree.
Of course, it's only one step further to put the actual data itself from the file into the BTree and then you have a database.
I would read the entire file row-by-row and output each line into a temporary folder grouping lines into files by reasonable time interval (should the interval be a year, a day, an hour, ... etc. is for you to decide basing on your data). So the temporary folder would contain individual files for each interval (for example, for day interval split that would be 2018-05-20.tmp, 2018-05-21.tmp, 2018-05-22.tmp, ... etc.). Now we can read the files in order, sort each in memory and output into the target sorted file.
I'm designing a MongoDB database that works with a script that periodically polls a resource and gets back a response which is stored in the database. Right now my database has one collection with four fields , id, name, timestamp and data.
I need to be able to find out which names had changes in the data field between script runs, and which did not.
In pseudocode,
if(data[name][timestamp]==data[name][timestamp+1]) //data has not changed
store data in collection 1
else //data has changed between script runs for this name
store data in collection 2
Is there a query that can do this without iterating and running javascript over each item in the collection? There are millions of documents, so this would be pretty slow.
Should I create a new collection named timestamp for every time the script runs? Would that make it faster/more organized? Is there a better schema that could be used?
The script runs once a day so I won't run into a namespace limitation any time soon.
OK, this is a neat question b/c the short is basically: you will have to iterate and run javascript over each item.
The part where this gets "neat" is that this isn't really different from what an SQL solution would have to do. I mean, you're basically joining a table to itself where x.1=x.1 and y.1=y.2. Even if the relational DB can handle such a beast, it's definitely not going to be fast with millions of entries.
So the truth is, you're doing this right way. Here are the extra details I would use to make this cleaner.
Ensure that you have an index on Name/Timestamp.
Run a db.mycollection.find().foreach() across the data set.
Foreach entry you're going to a) Perform comparison. b) Save appropriately. c) Update a flag indicating that this record has been processed.
On future loads you should be able to add a query to your find. db.mycollection.find({flag:{$exists:false}}).foreach()
Use db.eval() to help with speed.
The reason for the "Name/Timestamp" index is that you're going to be looking up each "successor" by "Name/Timestamp", so you want to be quick here.
The reason for the "processed" flag is that you should never have to re-run the same item. If given timestamp 'n' you find 'n+1', then that's the only 'n+1' you're going to have.
Honestly, if you're only running this once / day, it's quite likely that the speed will be just fine, especially if you only running on new records. Just assume that it's going to take several minutes.