Given a body of text and several keywords I want to determine which keyword is the most relevant. So I basically want to see which keyword occurs the most times but it’s a bit more complex than that because I want to search keywords in both their plural and non-plural forms and remove generic words like "and" and "the".
I could write a function to do a decent job at this but rather than reinventing the wheel I’m wondering if there’s a good nlp library, ideally in JS, that handles this sort of thing i.e., keyword relevance. Accuracy is more important than performance in this case but both are important.
To give a specific example of what this will be used for, of the three keywords highlighted in yellow at the top, "disney" should come out as most relevant as it occurs in the article the most number of times and is most specific to the article. https://www.guide.com/gift-guide-for-all-the-disney-fanatics-in-your-life/a
Natural is a good library for natural language processing. https://github.com/NaturalNode/natural. There is a good free course on it here https://egghead.io/courses/natural-language-processing-in-javascript-with-natural.
Related
Challenge
When using puppeteer page.click('something') one of the first challenges is to make sure that the right 'something' is provided.
I guess this is a very common challenge, yet I did not find any simple way to achieve this.
What I tried so far
In Google Chrome I inspect the element that I want to click. I then get an extensive element description with a class and such. Based on an example I found, my approach is now:
Take the class
Replace all spaces with dots
Try
If it fails, check what is around this and add it as a prefix, for example one or two instances of button.
This does not exactly feel like it is the best way (and sometimes also fails, perhaps due to inaccuracies from my side).
One thing that I notice is that Chrome actually often seems to give a hint hovering over the thing I want to click, I am not sure if that is right but I also did not see a way to copy that (and it can be quite long).
If there is a totally different recommended way (e.g. Looking in the browser for what the name roughly is, and then using puppeteer to list all possible things), that is also fine. I just want to get the right input for page.click()
If you need an example of what I am trying: If you open this question in an incognito tab, you get options like share or follow. Or if you go to a web shop like staples and want to add something to cart.
When using puppeteer page.click('something') one of the first challenges is to make sure that the right 'something' is provided.
Just to be clear, "something" is a CSS selector, so your question seems to reduce to how to write CSS selectors that are accurate. Or, since Puppeteer offers XPath and traditional DOM traversals, we could extend it to include those selection tools as well.
Broader still, if there's a data goal we're interested in, often times there are other routes to get the data that don't involve touching the document at all.
I guess this is a very common challenge, yet I did not find any simple way to achieve this.
That's because there is no simple way to achieve this. It's like asking for the one baseball swing that hits all pitches. Web pages have messy, complex, arbitrary structures that follow thousands of different conventions (or no conventions at all). They can serve up a slightly or completely different page structure on any request. There's no silver-bullet strategy for writing good CSS selectors, and no step-by-step algorithm you can apply to universally "solve" the problem of accurately and robustly selecting elements.
Your goal should be to learn the toolkit and then practice on many different pages to develop an intuition for which tools and tricks work in which contexts and be able to correctly discern the tradeoffs in different approaches. Writing a full guide to this is out of scope, and articles exist elsewhere that cover this in depth, but here are a few high-level rules of thumb:
Look at context: consider the goals of your project, the general structure of the page and patterns on the page. Too many questions on Stack Overflow regarding CSS selectors (but also in general) omit context, which severely constrains the recommendation space, often leading to an XY problem. A few factors that are often relevant:
Whether the scrape is intended to be one-off or a long-running script that should try to anticipate and be resillient to page changes over time
Development time/cost/goal tradeoffs
Whether the data can be obtained by other means than the DOM, like accessing an API, pulling a JSON blob from a <script> tag, accessing a global variable on the window or intercepting a network response.
Considering nesting: is the element in a frame or shadow DOM?
Considering whole-page context: which patterns does the site tend to follow? Are there parent elements that are useful to selecting a child? (often, this is a distant relationship, not visible in a screenshot as provided by OP)
Consider all capabilities provided by your toolkit. For example, OP asked for a selector to close a modal on Stack Overflow; it turns out that none of the elements have particularly great CSS selectors, so using Puppeteer to trigger an Esc key press might be more robust.
Keep it simple: since pages can change at any time, the more constraints you add to the selector, the more likely one of those assumptions will no longer be true, introducing unnecessary points of failure.
Look for unique identifiers first: ids are usually unique on a page (some Google pages seem to scoff at this rule), so those are usually the best bets. For elements without an id, my next steps are typically:
Look for an id in a close parent element and use that, then select the child based on its next-most-unique identifier, usually a class name or combination tag name and attribute (like an input field with a name attribute, for example).
If there are few ids or none nearby, check whether the class name or attribute that is unique. If so, consider using that, likely coupled with a parent container class.
When selecting between class names, pay attention to those that seem temporary or stateful and might be added and removed dynamically. For example, a class of .highlighted-tab might disappear when the element isn't highlighted.
Prefer "bespoke" class names that seem tied to role or logic over generic library class names associated with styling (bootstrap, semantic UI, material UI, tailwind, etc).
Avoid the > operator which can be too rigid, unless you need precision to disambiguate a tree where no other identifiers are available.
Avoid sibling selectors unless unavoidable. Siblings often have more tenuous relationships than parents and children.
Avoid nth-child and nth-of type to the extent possibe. Lists are often reordered or may have fewer or more elements than you expect.
When using anything related to text, generally trim whitespace, ignore case and special characters where appropriate and prefer substrings over exact equality. On the other hand, don't be too loose. Usually, text content and values are weak targets but sometimes necessary.
Avoid pointless steps in a selector, like body > div#container > p > .target which should just be #container .target or #container p .target. body says almost nothing, > is too rigid, div isn't necessary since we have an id (if it changes to a span our new selector will still work), and the p is generic--there are probably no .targets outside of ps anyway.
Avoid browser-generated selectors. These are usually the worst of both worlds: highly vague and rigid at the same time. The goal is to be the opposite: accurate and specific, yet as flexible as possible.
Feel free to break rules as appropriate.
My research has found me powerful libraries like paper.js which are quick to show all the next-level awesome stuff they can do, but I'm not sure how (or if) I can accomplish a basic task:
I want to present a simple Stick Figure to my user:
o
\|/
/ \
Then let them grab the ends of the lines ("hands and feet"), and drag them into different positions, leaving the connecting nodes ("shoulders and hips") intact.
Simple.
In short, what is the simplest way to draw something with lines and nodes that users can manipulate.
Requirements:
Some line lengths are fixed, others can be lengthened.
Some nodes are fixed, others can be moved.
I need to at least display the angles of the lines, but ideally can modify those angles (via text input).
I feel like this shouldn't be this hard, but as I said, most libraries skip the basics and go straight to their coolest features, while any search for "edit vector nodes" and the like brings up lots of irrelevant results about node.js...
Probably you must have found the solution, but since it is unanswered, I thought of posting about a fantastic library I know.
The library is konva.js. You can find demo application references on site itself.
It features powerful set of functions to create custom shapes and layering support(In fact there is lot more). You can easily modify the properties of any shape, add animation and more.
You can find documentation here.
Usually we search elements by selectors in libraries like jquery but what happens if we want to make the other way around: given an element we want to find all rules applied to it like firebug does it!
This kind of job is necessary if we want to bring a css file and update it so we can see live results in our webpage.
In firebug by selecting an element we can see in reverse order all the rules that were applied to it.
First, we have to find the document's style rules and retrieve their contents as a text so user can update them but the real problem is to find the relation between an element and all the rules....
Is this possible even with the help of a library like jquery?
I was looking for opinions about what tools can one use to tackle such a problem in the restrictions applied by a browser environment...
My opinion is that we have to make a brute search attack at the css files/selectors and construct the dependencies: suppose we have 100 rules for a page with 100 elements the possible combinations are 10^4 as the relations are many-to-many.
Then, we can build 'tables' in memory (hash arrays) that might excced 10^4 records if we want to keep the cascading order of rules for an element.
Anyway, my point is that I wouldn't dare to put jquery in such a pain! If we can 'transplant' the heart of jquery, it's search engine I mean.
It seems that it's heart listens to the name of 'sizzle' but it's a waste of time here (regular expressions? no thanks). I think the real 'heart' is a simple word: 'querySelectorAll' and now we are running with native speeds here...
just an opinion since I don't care about wide/old browser support.
This is probably a common pattern in coding, though I am not knowledgeable enough to recognize it or use the correct terminology (I would've googled it then).
When you are creating an object that will invoke a method in a loop, but the content of the loop can change according to a certain set of conditions, do you do (pseudo-code):
method _method($condition){
switch($condition){}
}
or
method init($condition){
switch($condition){
case 1:
this._method = function(){};break;
case 2:
this._method = function(){};break;
....
}
}
I prefer version 1 as it is more maintainable, but I feel version 2 is lighter. Am I correct?
I realize the answer can differ quite a lot depending on the use case, so here is mine: I have a javascript slider, implemented as a jQuery plugin. depending on options, when the slider comes to the last frame, it either scrolls back to start, or seamlessly loops. Now, if it must seamlessly loop, then when each slide advance, I must set the element that has just disappeared on the left to be queued at the far right. Which means I have to detect when the element has disappeared from the view, and do a helluva of other checks for scrollLeft values and whatnot that I don't need to do if I am simply scrolling.
So how should I go about that?
I'm sure you understand this, but it's worth repeating:
If it ain't broke, don't fix it.
You can almost never find a performance problem in non-toy software by just thinking about it.
When people learn how to solve performance problems (as, IMHO, I have),
they are often asked what the secret is.
The secret is simple - don't guess, do diagnose.
Here's the method I use.
Everybody knows that, but they go ahead guessing anyway.
For one example, review the answers to this question.
Your use case seems to be simple and doesn't need much computing time, though you should use the code which is more readable for you and easier to maintain.
When using performance critical code you should keep the code within a loop as short/inexpensive as possible.
I'm often conflicted about how to approach this problem in my applications. I've used any number of options including:
A generic multiselect - This is my least favorite and most rarely used option. I find the usability to be atrocious, a simple mis-click can screw up all your hard work.
An "autocomplete" solution - Downside: user must have spelling abilities to find the damn values they need, aren't exposed to ones they may not have in mind, and the potential backend performance of substring searching.
Two adjacent multiselects, with an add/remove button - Downsides: still "ugly" imo
Any number of fancy javascript solutions (http://livepipe.net/control/selectmultiple, http://loopj.com/2009/04/25/jquery-plugin-tokenizing-autocomplete-text-entry/, etc.)
I haven't been able to find any usability studies done on the best approach to this problem. Many of these alternate solutions are great when you're going from <10 elements to a hundred, but may break down completely when you are going from a hundred to a thousand.
What do you guys use? Why do you use it? Can you point me to usability case studies? Is there a "magic" solution that has yet to be discovered?
My advice is don't use generic multiple select controls. I've been running User Experience Research for my whole career, and every single time I test a site with multiple select controls, it always turns out to cause problems for end users.
I did a post on this a while back: Multiple Select Controls Must Evolve or Die
Sounds like you knew this anyway, though. Your real question is "what do I use instead?" Well, to answer this question you have to work out whether the user's task leans towards recall or recognition.
(i) By recall, I mean the user knows what they want to pick before they have even seen the list. In this case, it's probably easiest for them if you offer an autocomplete tool (as used very effectively on facebook, for example). This solution is even better when the list of options is also impossibly long to present on a page (e.g. location names, etc).
(ii) Moving on to recognition - by this I mean a task that involves the user not knowing what they want to pick until they've seen the list of options. In this case, autocomplete doesn't give them any hints. An array of checkboxes would be much more helpful. If you can show them all at once, this is helpful to the user. Scrolling DIVs are more compact but they create a memory load for the user - i.e. once they've scrolled down, they have to remember which items they picked. This is particularly evident when users save a form and come back to it later.
So - thinking about your problem, do you need a solution for recall or recognition?
I can't point you to any case studies, unfortunately, but what I can tell you is that I personally prefer large checkbox arrays in two-to-five column layouts. Sure, they take up a lot of space, but they are extremely precise and uncomplicated.
I think for any control - be it basic multiselect, double list, checkbox array, or any other solution - once you go over a certain threshold of items it's going to be challenging for the user no matter what.
Have a look at Dojo Toolkit's DataGrid control. It's by far the most flexible and powerful, and supports multiple row selections. It also has accessibility features built in.
The ExtJS library has some really good solutions for your issue. There a bunch of user extensions for neat-o combos and multi select boxes.
If you want a combo select list, you can add query searching and pagination, plus design the resulting drop-down using easy templating, like in this example:
http://extjs.com/deploy/dev/examples/form/forum-search.html
Here is a nice multiselect, in the style you seemed to describe:
*(main_site)/learn/Extension:Multiselect2
You can find all user extensions here:
*(main_site)/learn/Ext_Extensions
Plus, you can easily include it into an existing web page without alot of extra overhead. ExtJS's full stack is quite large, but to get only the JS files you need, they provide a nice builder tool to grab just those parts you need:
*(main_site)/products/extjs/build/
Just a warning: ExtJS has just released 3.0, but I'm not sure the user extensions have been upgraded. The "forum-search" was taken from a 3.0 example though, so it will work just fine with the latest and greatest.
(*) Apparently new users can only post one link...