What data structure is best suited for quickly searchable text data?

What data structure is best suited for quickly searchable text data? - javascript

When looking at products like DnD Insider and the Kindle app, users can quickly search for matching text strings in a large structure of text data. If I were to make a web application that allowed users to quickly search a "rulebook" (or similar text) for a matching entry and pull up the data to read, how should I organize the data?
I don't think it's a good idea to put all the data into memory. But if I stored it in some kind of database, what would be a good way to search the database and retrieve the appropriate matching entry?
So far, I believe I'm going to use the Boyer-Moore algorithm to actually do the searching. I can put the various sections of rule-text into different database entries. The user search will prioritize searching section titles over section body text. Since the text will be static and not user-editable, perhaps an array to store every word would work?

Typically some kind of inverted index is used for this purpose: https://en.wikipedia.org/wiki/Inverted_index
Basically this is a map from each word to a list of the places in which it appears. Each "place" could be a (document ID, occurrence count), or something more precise if you want to support phrase searching or if you want to give more weight to matches in titles, etc.
Search results are usually ranked with some variant of tf-idf: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Related

How to create a search results web page and store a large list of items without SQLDatabase

I made a web page for selling items online. The website has a lot of products but will probably have multiple thousand products in the near future. The website contains a search bar and I want to create a search results page, but I am not sure what the best way of doing this is. I thought about using JavaScript to loop through the list of all the products until it finds a match. But this process is probably too slow. My question is: What is the best way to store a large list of items, and what is the best way to find matches from the list for the search query? I now that many people use SQL databases for storing lists but is that method any better than simply storing everything in a JavaScript list, and why? Also how do I find a match in the list? Can I use JavaScript or is it necessary or better to use a language like PHP?

Cloudant search query not returning expected result

I'm using Cloudant and I created a search index. However, I'd like the index to return the term I'm querying. I mean I want to get a data which has a specific date that I chose.
1.) I have created a cloudant database and loaded it with some data.
2.) I have created a search index.
3) Node set-ups
4) And function content.
I've expected to see whole data about this exact "ts" variable. But I got this:
I have been struggling with this for a few days and can't seem to get this working. I am sure it's just a Newbie issue.
Many thanks in advice.

A search index uses the Apache Lucene library for text pre-processing and indexing. It's designed for chopping up sentences into words and words into stemmed tokens for "free text" search, i.e. finding the documents that best match a multi-word phrase. You can optionally choose the type of stemming which is performed by specifying the "analyzer" when creating the index, that is the text pre-processing algorithm used to break up the strings.
If you want to keep the string intact, then choose the "keyword" analyzer:
You may also want to investigate using Cloudant Query, whose type=json indexes would not pre-process your timestamp string.

Denormalization of data in MongoDB

I am learning MongoDB and I have a question regarding duplication of data. In the SQL world you try to normalize the data. For instance I have a table with categories and another one with products. Each product may belong to many categories so there is a join between these tables.
However am I right that in MongoDB you don't think like this? Does instead each product have a embedded document(s) of categories? Is that just the way it is? You don't care if the data is duplicated?

In the SQL world you try to normalize the data
Not always, normalising to the point of death inflicts performance hits but it is true that I personally do not apply the same normalisation to MongoDB as I do SQL.
If you are aware of the normalised forms ( http://en.wikipedia.org/wiki/Database_normalization ) I like to think MongoDB as going to 1NF and then back down to denormalised again.
You don't care if the data is duplicated?
Oh yes we do. Updating is a pain if the data is duplicated wrong.
Let me give you an example: category and product would be two separate entities, there is no denying it. These two entities are normalised (the repeating data of product has been spearated from category). Another way of thinking of it is: Are all products only going to exist in one category?
So on top level entities, as you can see, the same rules relatively apply with 1NF easily being applied to MongoDB.
On the front of duplication you, of course, would not want to store each product separately within each category (I answered no to the question above) so you would naturally want to separate catgeories and products.
You would normally have a many-to-many relationship here with a middle normalised table. This is where de-normalisation can come in. You can say that a category will have a list of products that are unique to that category as such you could de-normalise the many-to-many relational table into the category row as a list (or the other way around into the product row). This will not generate duplication since that list is unique to that category (more than likely). This of course means that the category or products would house a list _ids of the related row instead of the object itself.
There are times where duplication is nessecary, mainly for optimisation or work arounds for not having JOINs; this rule also applies to SQL as well if you have ever done a big enough site.
Typical usage scenarios of duplication is aggregation fields of stats like a Facebook posts shares and comments and maybe even the 5 latest comments of that post would also be duplicated onto the post row.
So it is not a case of ignoring schema design but more of tuning it for MongoDBs characteristics. Normally if you do that you will find that you, naturally, design a good schema.
As an added reference you can refer here: http://docs.mongodb.org/manual/core/data-modeling

Building a Quote Database without Server Side

I want to build a searchable database of quotes. The idea is I would type in a keyword to a search box and I would have quotes with those keywords. I would assign key words to the quotes. I am using a hosted CMS (Adobe Business Catalyst) and cannot use server side scripting. How is the best way to go about this? Is it possible to do this with javascript and jquery?

You could put all of the quotes on to the page statically in a JSON object, or even just as HTML elements, ready to be shown, but hidden. Then search through them with your keywords, and un-hide the ones that are relevant to the search.
Depending on how many quotes you have, the page could become large and take a long time to load, but that's just something to keep in mind for performance.

The way I would go about doing this is build a quotes web app. Then construct a web apps search form and only include a text box for searching by keyword. BC will automatically search the description of the item or the custom field in your web app, whichever you choose.
This WILL take less time than creating JSON objects are parsing HTML code. This utilizes server side logic and only returns to the browser the results that match your criteria so this will have better performance.
The only downside is the results page will not be SEO friendly. In the case where you want to create a pre-defined search, I would Ajax in the results of the search into your static page.

After a bit more research I discovered that Business Catalyst will allow you to build "Web Apps". This can operate as a database and you can incorporate a nice search into the webapp that will allow you to search for keywords etc.
Other than that I believe you would be required to follow #ctcherry 's method.

Faster search String in Mysql Database

I have a large DB having > 20,000 rows. I have two tables songs and albums:
table songs contains songid, albumid, songname and table albums contains albumid, albumname
Currently when user searches for a song I give results instantly as soon as he starts typing. Just like Google Instant.
What I am using is: Everytime user types I send that query string to my backend php file and there I execute that query in my DB like this:
SELECT * FROM songs, albums WHERE songs.albumid = albums.albumid AND songs.songname LIKE '%{$query_string}%';
But it's very inefficient to use DB queries everytime and also it's not scalable as my DB is growing everyday.
Therefore I want the same feature but faster and efficient and scalable.
Also, I dont want it to be exact pattern matching, for example:
Suppose, if user types "Rihana" instead of "Rihanna" then it should be able to give the results related to Rihanna.
Thanks.

You should index table Songs songname column on first n chars, say 6, to get better performance for the query.
Trigger the query search only after n chars have been typed, say 3 (jquery autocomplete has this option, for example)
You may also consider an in-memory DB if performance is truly crucial (sounds like it is) and the amount of data will not consume too much resident memory.
Google, btw, does not use a legacy RDBMS to perform its absurdly fast searches (continually amazed...)

First of all you should find MySQL's FULLTEXT search support far faster than your current approach.
I suspect with the kind of speed you'd like from this solution and the support for searching for mis-spelled words that you'd be better off investigating some kind of more featured full text search engine. These include:
Sphinx Search
Solr
ElasticSearch

Try full text search.
Indexing requires MyISAM tables though.
If you need ACID and full text search, use PostgreSQL.

Develop Reference

JavaScript is the programming language of the Web.