Due to the reasons outlined in this question I am building my own client side search engine rather than using the ydn-full-text library which is based on fullproof. What it boils down to is that fullproof spawns "too freaking many records" in the order of 300.000 records whilst (after stemming) there are only about 7700 unique words. So my 'theory' is that fullproof is based on traditional assumptions which only apply to the server side:
Huge indices are fine
Processor power is expensive
(and the assumption of dealing with longer records which is just applicable to my case as my records are on average 24 words only1)
Whereas on the client side:
Huge indices take ages to populate
Processing power is still limited, but relatively cheaper than on the server side
Based on these assumptions I started of with an elementary inverted index (giving just 7700 records as IndexedDB is a document/nosql database). This inverted index has been stemmed using the Lancaster stemmer (most aggressive one of the two or three popular ones) and during a search I would retrieve the index for each of the words, assign a score based on overlap of the different indices and on similarity of typed word vs original (Jaro-Winkler distance).
Problem of this approach:
Combination of "popular_word + popular_word" is extremely expensive
So, finally getting to my question: How can I alleviate the above problem with a minimal growth of the index? I do understand that my approach will be CPU intensive, but as a traditional full text search index seems unusably big this seems to be the only reasonable road to go down on. (Pointing me to good resources or works is also appreciated)
1 This is a more or less artificial splitting of unstructured texts into small segments, however this artificial splitting is standardized in the relevant field so has been used here as well. I have not studied the effect on the index size of keeping these 'snippets' together and throwing huge chunks of texts at fullproof. I assume that this would not make a huge difference, but if I am mistaken then please do point this out.
This is a great question, thanks for bringing some quality to the IndexedDB tag.
While this answer isn't quite production ready, I wanted to let you know that if you launch Chrome with --enable-experimental-web-platform-features then there should be a couple features available that might help you achieve what you're looking to do.
IDBObjectStore.openKeyCursor() - value-free cursors, in case you can get away with the stem only
IDBCursor.continuePrimaryKey(key, primaryKey) - allows you to skip over items with the same key
I was informed of these via an IDB developer on the Chrome team and while I've yet to experiment with them myself this seems like the perfect use case.
My thought is that if you approach this problem with two different indexes on the same column, you might be able to get that join-like behavior you're looking for without bloating your stores with gratuitous indexes.
While consecutive writes are pretty terrible in IDB, reads are great. Good performance across 7700 entries should be quite tenable.
Related
I'm working on a MongoDB database and so far have stored some information as Numbers instead of Strings because I assumed that would be more efficient. For example, I store countries following ISO 3166-1 numeric and sex following ISO/IEC 5218. But so far I have not found a similar standard for languages, ISO 639 does not appear to have a matching list of numeric codes.
What would be the right way to do this? Should I just use the String codes?
Thanks!
If you're a fan of the numbers, you can use country calling codes, although they "only" represent the ITU members (193 countries according to Wikipedia). But hey, they have Somalia and Palestine, so that's a good hint about how global this is.
However, storing everything in an encoded format (numbers here) implies a decoding step on the fly when any piece of data is requested (with translation tables stored in RAM instead of DB's ROM). Probably on the server whose CPU is precious, but you might have deported the issue on the client, overworking the precious, time-critical server-client link in the process.
So, back in the 90s, when a 40MB HDD was expensive, that might have been interesting. Today, the cost of storing data vs. the cost of processing data is not on the same side of 1... Not counting the time it takes you to think and implement the transformations. All being said "IMHO", I think this level of efficiency actually kills efficiency. ;)
EDIT: Oops, just realized I misthought (does that verb even exist?) the country/language issue. Countries you have sorted out already, my bad. I know no numbered list of languages. However, the second part of the post might still be relevant...
If you are after raw performance and/or want to achieve really small data sizes, I would suggest you use either the three-letter (higher granularity) or the two-letter (lower granularity) codes from IOC ISO-639-1/2.
To my knowledge, there's no helper or anything for this standard built into any programming language that I know, so you'd need to build your own translator (code<->full name) which, however, should be trivial.
And as others already mentioned, you have to assess the cost involved with this (e.g. not being able to simply look at the data and understand it right away anymore) for yourself. I personally do recommend keeping data sizes small since BSON parsing and string operations are horribly expensive compared to dealing with numbers (or shorter strings for that matter). When dealing with small data sets, this won't make a noticeable difference. If, however, you need to churn through millions of documents or more optimizations like this can become mission critical.
I'm working about audio but I'm a newbie in this area. I would like to matching sound from microphone to my source audio(just only 1 sound) like Coke Ads from Shazam. Example Video (0.45 minute) However, I want to make it on website by JavaScript. Thank you.
Building something similar to the backend of Shazam is not an easy task. We need to:
Acquire audio from the user's microphone (easy)
Compare it to the source and identify a match (hmm... how do... )
How can we perform each step?
Aquire Audio
This one is a definite no biggy. We can use the Web Audio API for this. You can google around for good tutorials on how to use it. This link provides some good fundametal knowledge that you may want to understand when using it.
Compare Samples to Audio Source File
Clearly this piece is going to be an algorithmic challenge in a project like this. There are probably various ways to approach this part, and not enough time to describe them all here, but one feasible technique (which happens to be what Shazam actually uses), and which is also described in greater detail here, is to create and compare against a sort of fingerprint for smaller pieces of your source material, which you can generate using FFT analysis.
This works as follows:
Look at small sections of a sample no more than a few seconds long (note that this is done using a sliding window, not discrete partitioning) at a time
Calculate the Fourier Transform of the audio selection. This decomposes our selection into many signals of different frequencies. We can analyze the frequency domain of our sample to draw useful conclusions about what we are hearing.
Create a fingerprint for the selection by identifying critical values in the FFT, such as peak frequencies or magnitudes
If you want to be able to match multiple samples like Shazam does, you should maintain a dictionary of fingerprints, but since you only need to match one source material, you can just maintain them in a list. Since your keys are going to be an array of numerical values, I propose that another possible data structure to quickly query your dataset would be a k-d tree. I don't think Shazam uses one, but the more I think about it, the closer their system seems to an n-dimensional nearest neighbor search, if you can keep the amount of critical points consistent. For now though, just keep it simple, use a list.
Now we have a database of fingerprints primed and ready for use. We need to compare them against our microphone input now.
Sample our microphone input in small segments with a sliding window, the same way we did our sources.
For each segment, calculate the fingerprint, and see if it matches close to any from storage. You can look for a partial match here and there are lots of tweaks and optimizations you could try.
This is going to be a noisy and inaccurate signal so don't expect every segment to get a match. If lots of them are getting a match (you will have to figure out what lots means experimentally), then assume you have one. If there are relatively few matches, then figure you don't.
Conclusions
This is not going to be an super easy project to do well. The amount of tuning and optimization required will prove to be a challenge. Some microphones are inaccurate, and most environments have other sounds, and all of that will mess with your results, but it's also probably not as bad as it sounds. I mean, this is a system that from the outside seems unapproachably complex, and we just broke it down into some relatively simple steps.
Also as a final note, you mention Javascript several times in your post, and you may notice that I mentioned it zero times up until now in my answer, and that's because language of implementation is not an important factor. This system is complex enough that the hardest pieces to the puzzle are going to be the ones you solve on paper, so you don't need to think in terms of "how can I do X in Y", just figure out an algorithm for X, and the Y should come naturally.
I'm a game developer and I have been trying different structures to see what would give me the best results but it seems that JavaScript is mostly unaffected by data locality, meaning that processing times are within margin of error and memory usage is mostly as expected.
Does data locality matter at all in JavaScript or am I just wasting my time trying to improve certain structures?
Is it due to the sandboxed nature of the execution environment (i.e. it would matter outside the browser)?
This article is an interesting read. It says, in summary, low level locality optimizations are a lost cause, but big data structures, big arrays of data structures more so, should be accessed in linear order.
Coming from a C background, I tend to access a grid held in a 1D array in a Y-outer/X-inner pattern anyway, but when I have done otherwise, I can tell you from experience, it starts to lag on large grids.
So, to attempt to answer your question, and in part theorize: to the degree that this holds true, the classic structure-of-arrays rather than array-of-structures mentality might very well be performant for sufficiently large arrays of large structures with limited variable access in a given case. But I would definitely test both macro-structures if I were implementing a critical feature :)
I'm trying to create a tower defence game in Javascript.
It's all going well apart from the pathfinding..
I'm using the astar code from this website: http://www.briangrinstead.com/blog/astar-search-algorithm-in-javascript which uses a binary heap (which I believe is fairly optimal)
The problem i'm having is I want to allow people to block the path of the "attackers". This means that each "attacker" needs to be able to find its way to the exit on its own (as someone could just cut off a single "attacker" and it would need to find its own way to the exit). Now 5/6 attackers can pathfind at any one time with no issue. But say the path is blocked for 10+ attackers, all 10 of them will need to fire its pathfinding script at the same time which just drops the FPS to about 1/2 per sec.
This must be a common problem for anyone who has a lot of entities pathfinding at anyone time, so I imagine there must be a better way than my approach.
So my question is: What is the best way to implement mass pathfinding algorithm to multiple "bots" in the most efficient way.
Thanks,
James
Use Anti-objects, this is the only way to get cheap pathfinding, afaik :
http://www.cs.colorado.edu/~ralex/papers/PDF/OOPSLA06antiobjects.pdf
Anti-object basically mean that instead of bots having individual ai, you will have one "swarm ai", which is bound to your game map.
p.s.: Here is another link about pathfinding in general (possibly the best online reference available):
http://theory.stanford.edu/~amitp/GameProgramming/index.html
Just cache the result.
Store the path as the value in a hash table (object), give each node a UUID, concatenate the UUIDs to form a unique hash table key and insert the path into it.
When you retrieve the path back out of the hash table, walk the path, and see if it's still valid, if not, recalculate and insert the new one back in.
There are many optimization that you can do :)
Like c69 said swarm AI or hive mind come to mind :P
I have a ton of associated data revolving around a school, students, teachers, classes, locations, etc etc
I am faced with a challenge put fourth by my client; they want to have reports on everything. This means they want the ability to cross reference data points every which way and i think i'm just short of writing a pretty query builder. :/
This stack question is aimed at soliciting opinions on how to structure a reporting interface beautifully.
Any suggestions, references, examples, jQ plugins etc would be amazing.
Thank you!
I find the Trac's query builder rather acceptable for what it is meant to do.
But most probably your clients don't want everything, they are just too lazy to think about what they want now. You could help them decide by analyzing the use cases together, and come up at least with a few kinds of queries with just a few parts customizable -- in the worst case -- or just a few canned queries they really need -- in the best.
You should probably schedule a meeting with your client to determine what they need to do. This does not mean having them speculate about how great it would be if your software could do everything, was ultra-flexible yet totally easy to use, etc... but sit down and find out what they are doing right now. I'm saying this because that "oh, I'd like to be able to cross-reference everything with everything else!" sounds a bit too familiar, and might end in an ugly case of inner-platform effect.
I've found that rapid paper prototyping with the client is a great way to explore possible ideas, as it shifts their attention away from "can you make this button yellow?" issues to The Big Picture, to let them make up their minds what they actually need. Plus, it's ridiculously inexpensive to do.
Apart from that, for inspiration, there are UI pattern languages that address handling potentially large amounts of interconnected data. What's great about these is that you will often be able to use these patterns to communicate ideas to your client, since a well-structured pattern language will guide a non-expert through domain-relevant design decisions in increasing detail.
First, I can only support the other voices: work out with the clients what they actually need. A good argument is "I can do, but it will cost you X thousand dollars, every user will need Y hours of training, and you'll need a $100.000K/year developer to maintain it."
(Unfortunately, most clients at that point prefer to pick the guy who says "yes, can do cheaper!")
Only second, and only if the client says "yes we do need everything":
What works well is a list/grid view progressive filtering. Instead of buildign the SQL query, then running it, let the user directly work with the results: e.g. right clicking a cell, and selecting "limit to this value" could add a WHERE colN = <constant> constraint.
You can generate suggestions for columns from SELECT DISTINCT calls - if it returns less than, say, 20 values, you can offer checkboxes for a OR combination of possible values.
It would be interesting to discuss en elegant UI for the sea of remaining problems: OR'ed conditions across multiple columns, ordering by more than one column, grouping, ...