I want to implement a full-text-search for *.epub-Files. Therefore I forked the epub-full-text-search module (https://github.com/friedolinfoerder/epub-full-text-search).
I will have many ebooks to search through, so I want to have a way to only search in a specific ebook one at a time.
How could I do this with search-index. I coded a solution which allows to search in the fields filename (the unique filename of the epub) and body (the content of the chapters), but this doesn't feel like it's the right way to do this and the performance is also not ideal.
Here is an example how I do the search with search-index:
searchIndex.search({
query: [{
AND: [
{body: ['epub']},
{filename: ['accessible_epub_3']}
]
}]
});
Is there a better way to do this. Maybe with buckets, categories and filters?
Thanks for your help!
Search-index, which epub-full-text-search is based on, gives one search result back for each document/item that has a match for any given query. My guess is that you would like to know where in the epub-file you get a hit. If a certain paragraph is a good enough search result item, I would index paragraphs. Each paragraph would have a unique book-key as a filter, and maybe a reference to where it is in the epub-file (page/percentage/etc).
Disclaimer: I'm working on the search-index project.
Related
I have a set of documents, each annotated with a set of tags, which may contain spaces. The user supplies a set of possibly misspelled tags and I wants to find the documents with the highest number of matching tags (optionally weighted).
There are several thousand documents and tags but at most 100 tags per document.
I am looking on a lightweight and performant solution where the search should be fully on the client side using JavaScript but some preprocessing of the index with node.js is possible.
My idea is to create an inverse index of tags to documents using a multiset, and a fuzzy index that that finds the correct spelling of a misspelled tag, which are created in a preprocessing step in node.js and serialized as JSON files. In the search step, I want to consult for each item of the query set first the fuzzy index to get the most likely correct tag, and, if one exists to consult the inverse index and add the result set to a bag (numbered set). After doing this for all input tags, the contents of the bag, sorted in descending order, should provide the best matching documents.
My Questions
This seems like a common problem, is there already an implementation for it that I can reuse? I looked at lunr.js and fuse.js but they seem to have a different focus.
Is this a sensible approach to the problem? Do you see any obvious improvements?
Is it better to keep the fuzzy step separate from the inverted index or is there a way to combine them?
You should be able to achieve what you want using Lunr, here is a simplified example (and a jsfiddle):
var documents = [{
id: 1, tags: ["foo", "bar"],
},{
id: 2, tags: ["hurp", "durp"]
}]
var idx = lunr(function (builder) {
builder.ref('id')
builder.field('tags')
documents.forEach(function (doc) {
builder.add(doc)
})
})
console.log(idx.search("fob~1"))
console.log(idx.search("hurd~2"))
This takes advantage of a couple of features in Lunr:
If a document field is an array, then Lunr assumes the elements are already tokenised, this would allow you to index tags that include spaces as-is, i.e. "foo bar" would be treated as a single tag (if this is what you wanted, it wasn't clear from the question)
Fuzzy search is supported, here using the query string format. The number after the tilde is the maximum edit distance, there is some more documentation that goes into the details.
The results will be sorted by which document best matches the query, in simple terms, documents that contain more matching tags will rank higher.
Is it better to keep the fuzzy step separate from the inverted index or is there a way to combine them?
As ever, it depends. Lunr maintains two data structures, an inverted index and a graph. The graph is used for doing the wildcard and fuzzy matching. It keeps separate data structures to facilitate storing extra information about a term in the inverted index that is unrelated to matching.
Depending on your use case, it would be possible to combine the two, an interesting approach would be a finite state transducers, so long as the data you want to store is simple, e.g. an integer (think document id). There is an excellent article talking about this data structure which is similar to what is used in Lunr - http://blog.burntsushi.net/transducers/
Im trying to build a basic site fed with JSON That works as such…
So what I have setup is a table in mysql structured like so…
Where name is the initial list of buttons, then after choosing name you will get each successive choice as needed until you pull up the specific course detail. hopefully I am strait forward so far.
What I have tried so far (and it does work-ish).
so I use getJSON with a little bit of PHP like so...
function getList() {
$.getJSON(serviceURL + 'getProducts.php', function(data) {
products = data.items; });
then I do some $.each with some ifs and some appends...
$.each(products, function(index, product) {
if (product.program === '' && product.platform === '') {
$('#theList').append('<div.....' + course etc.. + ');
};
if (product.program !== '' && product.platform === '') {
something else and so on....
and it works(mostly) but it completely sucks. I have a version of the site right now, that I can click on the product and go through many of the options all the way to a course list and then course detail. I end up having to do addition getJSON with url hashes to get the next level of choices and because I have wait for the data to load I have to create crazy id's for the divs to append to when the data is finally fetched so it appends in the correct place. Like I said, it works, but I know it is the wrong way to do it. To be honest the mysql table is only 110 rows, i could have this all hardcoded a week or two faster than I am figuring this out. But I really need to learn this.
is underscore.js my solution?
so in my travels of the interwebs I discover underscore.js and I try this..
var Sample = _.groupBy(products, 'name');
console.log(Sample);
And I get a beautiful JSON array.
{"Fred":[
{"name":"Fred","Type":"Red","program":"basic","platform":"windows"},
{"name":"Fred","Type":"Red","program":"basic","platform":"osx"},
{"name":"Fred","Type":"Red","program":"basic","platform":"OS X"},
{"name":"Fred","Type":"Red","program":"basic","platform":"osx"},
{"name":"Fred","Type":"Red","program":"basic","platform":"osx"},
{...
and if I do another groupBy on Sample.Fred for Type or program or platform, I get nice little arrays that have the exact data I need. So now I have eliminate the need to do separate getJSON for each level, but I know there must be a better way to do this. Is there a variable for the first level in objects? sample.items doesn't get me anywhere and sample.objects doesn't either, sample.fred works, but then I am going to have to write out each name... sample.fred, sample.Jane, etc... I know that isn't right either.
How do I say…
for each object remove duplicate children (assuming I don't get the courses in my intial getJSON) and then append the values to sequentially nested divs???
As it stands right now I will still have to do a ton of groupBy's and the $.each to append each option to the appropriate div. but I have to believe there is a better and smarter way.
I hope this does't get nailed for being to localized, but I believe it is a basic concept and path I need to do here to be able to filter my way through children getting the list of values I need from particular items.
I realize there are many ways to do this with PHP and a ton of other ways I have never used though of or tried. I don't really want to explore the PHP route as I want to expand my skills in javascript and jquery.
Thanks to any who are willing to help.
It sounds like you might be better going with a full blown class and modeling the relationships. Then you can decide what owns what, and that will give you grouping as needed, as well as letting you easily add functionality and keep you from trying to manage a bunch of arrays.
I have a firebase model where each object looks like this:
done: boolean
|
tags: array
|
text: string
Each object's tag array can contain any number of strings.
How do I obtain all objects with a matching tag? For example, find all objects where the tag contains "email".
Many of the more common search scenarios, such as searching by attribute (as your tag array would contain) will be baked into Firebase as the API continues to expand.
In the mean time, it's certainly possible to grow your own. One approach, based on your question, would be to simply "index" the list of tags with a list of records that match:
/tags/$tag/record_ids...
Then to search for records containing a given tag, you just do a quick query against the tags list:
new Firebase('URL/tags/'+tagName).once('value', function(snap) {
var listOfRecordIds = snap.val();
});
This is a pretty common NoSQL mantra--put more effort into the initial write to make reads easy later. It's also a common denormalization approach (and one most SQL database use internally, on a much more sophisticated level).
Also see the post Frank mentioned as that will help you expand into more advanced search topics.
Doing a project using data from the world bank. Their data is organized such that they have individual years as column names in their csv. Aside from possibly changing the name, how would I go about accessing them using d3.csv and mapping them into a custom array since I won't use all the values.
For example, the file I'm using is GDP per country. Each element/row is formated like so
"Country Name","Country Code","Indicator Name","Indicator Code", "1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013",
If I wanted the GDP values for a Country like United Kingdom, Brazil, China, Russia, US, etc, and for years 2004-2012, how would I go about it?
Would the code looks something like this
d3.csv(URL, function(d) {
return {
AttributeName : d.ColumnName
//Continues for all columns I need
};
});
Again, d.ColumnName won't work if the column name is an actual integer. How would I account for spaces in the column name as shown as well?
How would I also go about displaying the elements properly in say the document itself or the console?
I apologize for so many questions. Feel free to direct me to solutions. Thanks.
Thanks to Lars for the answer. I'm gonna append the next step of what I want to do. Now if I wanted to make a Line Graph using this data, how would I access the resulting array of objects? Again, I only want specific countries out of the 200+ element Array.
You can solve all of these issues by using the alternative syntax for accessing attributes -- instead of
d.foo
you can write
d["foo"]
This also works with spaces, numbers, etc:
d["11111"]
d["string with spaces"]
for example
http://www.sitename.com/section1/pagename.aspx
http://www.sitename.com/section2/pagename.aspx
I need quick report only for pages which has same name. like "pagename.html" in example.
crawl your site, reverse the found URLs and sort?
If this is a Sitecore site as I'm led to believe, could you not just go into the Content Editor and search for the name you're looking for? Anything with the item name of "pagename", for example, will have a URL ending with "pagename.aspx" in your standard Sitecore install. Just make sure you're looking at the actual item name and not the display name, as they can be quite different at times depending on how you've got the URL's set up.
I would write a quick utility program to just traverse the content tree programmatically, saving the names to a dictionary... when you hit an item that is already in the dictionary, you've got a name collision.
One thing you might consider is writing a filter (I think that is what it is called) for this
http://trac.sitecore.net/AdvancedSystemReporter
That would allow you to have a nice interface in Sitecore for viewing duplicate item names.
How big is your database, and how is your LINQ to Objects? Theoretically you could write a LINQ query against Database.Items that looks for items with the same name that descend from the same site branch of your content tree. This could be very memory intensive if your master DB is large, but would not be difficult to code.
Edit -- if you can loop over all your site items, you could do something like this (untested):
var items = siteItem.Axes.GetDescendants();
var dupes = from item in items
join item2 in items on item.Name equals item2.Name
where item.ID != item2.ID
select item;