Abstract Classification using NLP/ML - javascript

I need to autogenerate categories of a publication using its abstract and support synonyms. I have classification data of 800-900 articles which I can use for training. This classification data is generated by the pharma experts by reading a unstructured publication.
Existing classification categories are like below for existing publications:
Drug : Some drug, Some other drug.
Diseases : Some Disease.
Authors : Some authors and so on..
These categories are currently generated by Human expert. I explored Natural library in node.js and lingpipe in Java. It has classifiers but I am not able to figure out what is the most efficient way to train it, so that I get 90% accuracy.
Following are approaches in my mind :
I can pass entire abstracts of publication one by one and tell it its categories like below?
var natural = require('natural');
var classifier = new natural.BayesClassifier();
classifier.addDocument('This article is for parcetamol written by Techgyani. Article was written in 2012', 'year:2012');
classifier.addDocument('This article is for parcetamol written by Techgyani. Article was written in 2012', 'author:techgyani');
classifier.train();
I can pass it sentence one by one and tell it what is its category which will be manual and timeconsuming process. So that when I pass it entire abstract, it will autogenerate set of categories for me like below :
var natural = require('natural');
var classifier = new natural.BayesClassifier();
classifier.addDocument('This article is for parcetamol written by Techgyani', 'drug:Paracetamol');
classifier.addDocument('This article is for parcetamol written by Techgyani', 'author:techgyani');
classifier.addDocument('Article was written in 2012', 'year:2012');
classifier.train();
I can also extract tokens from the publication and search my database and figure categories on my own without any use of NLP/ML libraries.
According to your experience which is the most efficient way to solve this problem? I am open for solution in any language but I prefer Javascript because existing stack is in Javascript.

I'd recommend using either most frequent words or word frequency as features in a naive bayes classifier.
No need to tag sentences individually. I'd expect reasonable accuracy at the document level, although that will depend on the nature of your documents trained and classified.
Great discussion on Python implementation below
Implementing Bag-of-Words Naive-Bayes classifier in NLTK

According to me, second solution of yours will work like a charm. You need to train your classifier in order to do your work.
You need to pass classifier.train(data, labels);. I know this will be a manual work, but it will hardly take some time to train your classifier.
Once it is trained, you can very well pass one of your sentence and see for the output by yourself

You should explore off the shelf Named Entity Recognition models first before investing in training. Spacy is written in Python but has a javascript binding. The classifier in natural use naive bayes and logistic regression and will not have as good a performance as a neural network library like Spacy. I suspect that natural will not work well for new cases where it has not already not seen the drug, disease, or author name in the training set.

Related

How to add grammar/hints to microsoft-cognitiveservices-speech-sdk?

I have a basic setup with the Javascript library of microsoft-cognitiveservices-speech-sdk. I use the browser implementation, not the node implementation. Overall it works fine, yet some issues do occur in which the transcription is a bit off.
Background
The project I am working on is a web application and it uses speech recognition. The user interacts with the application with business codes like A6, B12, ...
I use webkitSpeechRecognition whenever possible, in any other case I provide a fallback with microsoft-cognitiveservices-speech-sdk, which the majority of times works very well.
Issue
The business codes are not always correctly transcribed on microsoft-cognitiveservices-speech-sdk. webkitSpeechRecognition does a better job with this.
Example (in French):
User > A20 (prononcé "a vingt")
STT > Avant
Expected: A20
This might seem close but it isn't, webkitSpeechRecognition is able to solve this one correctly.
In the documentation, it seems that one can provide a dynamic grammar and suggestions/hints in order to help the STT. Yet I wasn't able to find an example or a way to use this interface. I was wondering if some of might have a lead for this.
To elaborate this a bit more, I was thinking of providing a IDynamicGrammar object, but I don't know if this is the correct approach nor do I know how to provide this.
Side note
I can use a sort of mechanism like ElasticSearch to find the correct correspondence, yet this only takes me so far. I would really like to optimise the STT.
I cannot force all the users to use Chrome
I cannot change the business codes
Reading through the article :
https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-phrase-lists?pivots=programming-language-javascript
The phrase list is currently applicable only to the English language.
Alternatively, you could train/customize your own model.
The below article details the same :
https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech
https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech-test-and-train
Please note the pronunciation mapping/hints in the Azure Speech to Text is currently available only for the English and German language at this point of time.
Reference : https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech-test-and-train#related-text-data-for-training
However, I had tried casually with the uttered sentences - mentioned the article here
As this did not have any language restriction.
I created the sample sentences as related text, trained the model & deployed model.
This had slightly better recognition of the codes/non-grammar words.
Sample sentences :
This is A 20 Business
There is going be a B 6 Business Model
B 6 on the other hand is not doing good as a business
Please indicate the C 26 profits.
Out of the Box Speech Recognition :
After Using the custom trained mode for the Speech Recognition :
Having said that, I assume that if we train the model with more data - sentences,audio with labeled text(as this also doesn't have any language restriction). The custom model will serve your requirement.
To consume the custom model in the Java Script you could refer this article :
https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-specify-source-language?pivots=programming-language-more

does anyone know how to retrain Object Detection (coco-ssd) of TFJS for object 91?

So far I seen so many discussion on this topic and using different approaches to achieve this (https://github.com/tensorflow/models/issues/1809) but I want to know if anyone managed to successfully use Tensorflowjs to achieve this.
I know some also achieved this using transfer learning but it is not same as being able to add my own new class.
The short answer: No, not yet, though technically possible, I have not seen an implementation of this in the wild.
The longer answer - why:
Given that "transfer learning" essentially means reusing the existing knowledge in a trained model to help you then classify things of a similar nature without having to redo all the prior learning there are actually 2 ways to do that:
1) This is the easier route but may not be possible for some use cases: Use one of the high level layers of the frozen model that you have access to (eg the models that are released by TF.js are frozen models I believe - the ones on GitHub). This allows you to reuse some of its lower layers (or final output) which may already be good at picking out certain features that are useful for the use case you need eg object detection in a general sense, which you can then feed into your own unfrozen layers that sit on top of that output you are sampling from (which is where the new training would happen). This is faster as you are only updating weights etc for the new layers you have added, however because the original model is frozen, it means you would have to replicate in TF.js the layers you were bypassing to ensure you have the same resulting model architecture for COCO-SSD in this case if you wanted the architecture. This may not be trivial to do.
2) Retraining the original model - can think of tuning the original model - but this is only possible if you have access to the original unfrozen model and the data used to train that. This would take longer as you are essentially retraining the whole model on all the data + your new data. If you do not have the original unfrozen model, then the only way to do this would be to implement the said model in TF.js yourself using the layers / ops APIs as needed and then use that to train on your own data.
What?!
So an easier to visualize example of this is if we consider PoseNet - the one that estimates where human joints/skeletons are.
Now in this Posenet example imagine you wanted to make a new ML model that could detect when a person is in a certain position - eg waving a hand.
In this example you could use method 1 to simply take the output of existing posenet predictions for all the joints it has detected and feed that into a new layer - something simple like a multi layered perceptron - that could then very quickly learn from example data when a hand was in a waving position for example. In this case we are simply adding to the existing architecture to achieve a new result - gesture prediction vs the raw x-y point predictions for the joints themselves.
Now consider case 2 for PoseNet - you want to be able to recognise a new part of the body that it currently does not. For that to happen you would need to retrain the original model so that it could learn to predict that new body part as part of its output.
This is much harder as you would need to retrain the base model to do this, which means you need to have access to the unfrozen model to do that. If you didn't have access to the unfrozen model then you would have no choice but attempt to recreate PoseNet architecture entirely yourself and then train that with your own data. As you can see this 2nd use case is much harder and more involved to do.

String translation with dynamic text and inputs

I am working on front-end only React application and will soon be implementing internationalization. We only need one additional language... at this point. I would like to do it in a way that is maintainable, where adding a new language would be ideally as close as possible to merely providing a new config object with translations for various strings.
The issue I know we will have, is that we have dynamic inputs inside of sentences as demonstrated below (where [] are inputs and ** is dynamically changing data). This is just an example sentence... lots of other similar type things elsewhere in the app.
I am [23] years old. I was born in [ ______▾]. In 2055 I would be *65* years old.
We could break out 'I am', '*age input', 'years old. I was born in', '*year dropdown'. etc. But depending on the language, word order could be changed or an input could be at the beginning of a sentence etc, and I feel like doing it in that way would make for a really weird looking and hard to maintain language file.
I'm looking to know if there are common patterns and/or libraries we can use to help with this challenge.
A react specific library is react-intl by yahoo. Which is part of a larger project called FormatJS which has many libraries and solutions for internalization. These and their corresponding docs are a good starting point.

An online service I could query for a name of a random historical figure

I decided that it's time to get rich and famous, so I'm building a tool that generates Hollywood movie titles. I plan to sell them for money.
Example movie name: Abraham Lincoln: Vampire Hunter.
Basically, I would take a name of a famous historical figure and combine it with a respectable profession to get the name of a movie.
There is a problem, though. I don't know all the names of historical figures, and I sure would mind writing them all down by myself. So, is there an online database or service I could query, that would return the name of a random historical figure? How would I do it in Javascript (node.js)? The professions I can come up with myself.
Edit: I'm looking for something that has an API.
You might want to check out DBPedia. It offers a SPARQL endpoint interface to Wikipedia's data, and returns machine-readable results formatted as RDF.
Don't be put off by RDF and semantic-webby stuff... for simple queries, you can parse the RDF as XML and handle it that way.
Why don't you try:
http://en.wikipedia.org/wiki/List_of_celebrities
http://simple.wikipedia.org/wiki/Biographies_of_famous_Americans
http://en.wikipedia.org/wiki/List_of_wealthiest_historical_figures
http://en.wikipedia.org/wiki/The_100
This one also looks pretty good:
http://www.bbc.co.uk/history/historic_figures/
Not sure if they have an API though! ;)

How to make a GOOD reporting Interface

I have a ton of associated data revolving around a school, students, teachers, classes, locations, etc etc
I am faced with a challenge put fourth by my client; they want to have reports on everything. This means they want the ability to cross reference data points every which way and i think i'm just short of writing a pretty query builder. :/
This stack question is aimed at soliciting opinions on how to structure a reporting interface beautifully.
Any suggestions, references, examples, jQ plugins etc would be amazing.
Thank you!
I find the Trac's query builder rather acceptable for what it is meant to do.
But most probably your clients don't want everything, they are just too lazy to think about what they want now. You could help them decide by analyzing the use cases together, and come up at least with a few kinds of queries with just a few parts customizable -- in the worst case -- or just a few canned queries they really need -- in the best.
You should probably schedule a meeting with your client to determine what they need to do. This does not mean having them speculate about how great it would be if your software could do everything, was ultra-flexible yet totally easy to use, etc... but sit down and find out what they are doing right now. I'm saying this because that "oh, I'd like to be able to cross-reference everything with everything else!" sounds a bit too familiar, and might end in an ugly case of inner-platform effect.
I've found that rapid paper prototyping with the client is a great way to explore possible ideas, as it shifts their attention away from "can you make this button yellow?" issues to The Big Picture, to let them make up their minds what they actually need. Plus, it's ridiculously inexpensive to do.
Apart from that, for inspiration, there are UI pattern languages that address handling potentially large amounts of interconnected data. What's great about these is that you will often be able to use these patterns to communicate ideas to your client, since a well-structured pattern language will guide a non-expert through domain-relevant design decisions in increasing detail.
First, I can only support the other voices: work out with the clients what they actually need. A good argument is "I can do, but it will cost you X thousand dollars, every user will need Y hours of training, and you'll need a $100.000K/year developer to maintain it."
(Unfortunately, most clients at that point prefer to pick the guy who says "yes, can do cheaper!")
Only second, and only if the client says "yes we do need everything":
What works well is a list/grid view progressive filtering. Instead of buildign the SQL query, then running it, let the user directly work with the results: e.g. right clicking a cell, and selecting "limit to this value" could add a WHERE colN = <constant> constraint.
You can generate suggestions for columns from SELECT DISTINCT calls - if it returns less than, say, 20 values, you can offer checkboxes for a OR combination of possible values.
It would be interesting to discuss en elegant UI for the sea of remaining problems: OR'ed conditions across multiple columns, ordering by more than one column, grouping, ...

Categories

Resources