How do I get the visible text of a range? (createRange) - javascript

The string returned by .toString() on a range created by document.createRange(...) will contain things like the inner part of script and style tags. (At least using current version of Chrome.)
Is there a way to get just the visible text?

I found a solution that seems reasonable and at least tentative standard compliant. (My guess, without checking, is that the standards perhaps does not handle all details in a case like this yet, but that the current implementation in Chrome seems useful and might become standard.)
The solution is simply to first create a document fragment from the range:
var fragment = r.cloneContents();
Then just walk the fragment the way you would walk a sub tree in the DOM. Do not enter nodes like "SCRIPT" and "STYLE". Collect the "#text" nodes.

Related

Convenient way to get input for puppeteer page.click()

Challenge
When using puppeteer page.click('something') one of the first challenges is to make sure that the right 'something' is provided.
I guess this is a very common challenge, yet I did not find any simple way to achieve this.
What I tried so far
In Google Chrome I inspect the element that I want to click. I then get an extensive element description with a class and such. Based on an example I found, my approach is now:
Take the class
Replace all spaces with dots
Try
If it fails, check what is around this and add it as a prefix, for example one or two instances of button.
This does not exactly feel like it is the best way (and sometimes also fails, perhaps due to inaccuracies from my side).
One thing that I notice is that Chrome actually often seems to give a hint hovering over the thing I want to click, I am not sure if that is right but I also did not see a way to copy that (and it can be quite long).
If there is a totally different recommended way (e.g. Looking in the browser for what the name roughly is, and then using puppeteer to list all possible things), that is also fine. I just want to get the right input for page.click()
If you need an example of what I am trying: If you open this question in an incognito tab, you get options like share or follow. Or if you go to a web shop like staples and want to add something to cart.
When using puppeteer page.click('something') one of the first challenges is to make sure that the right 'something' is provided.
Just to be clear, "something" is a CSS selector, so your question seems to reduce to how to write CSS selectors that are accurate. Or, since Puppeteer offers XPath and traditional DOM traversals, we could extend it to include those selection tools as well.
Broader still, if there's a data goal we're interested in, often times there are other routes to get the data that don't involve touching the document at all.
I guess this is a very common challenge, yet I did not find any simple way to achieve this.
That's because there is no simple way to achieve this. It's like asking for the one baseball swing that hits all pitches. Web pages have messy, complex, arbitrary structures that follow thousands of different conventions (or no conventions at all). They can serve up a slightly or completely different page structure on any request. There's no silver-bullet strategy for writing good CSS selectors, and no step-by-step algorithm you can apply to universally "solve" the problem of accurately and robustly selecting elements.
Your goal should be to learn the toolkit and then practice on many different pages to develop an intuition for which tools and tricks work in which contexts and be able to correctly discern the tradeoffs in different approaches. Writing a full guide to this is out of scope, and articles exist elsewhere that cover this in depth, but here are a few high-level rules of thumb:
Look at context: consider the goals of your project, the general structure of the page and patterns on the page. Too many questions on Stack Overflow regarding CSS selectors (but also in general) omit context, which severely constrains the recommendation space, often leading to an XY problem. A few factors that are often relevant:
Whether the scrape is intended to be one-off or a long-running script that should try to anticipate and be resillient to page changes over time
Development time/cost/goal tradeoffs
Whether the data can be obtained by other means than the DOM, like accessing an API, pulling a JSON blob from a <script> tag, accessing a global variable on the window or intercepting a network response.
Considering nesting: is the element in a frame or shadow DOM?
Considering whole-page context: which patterns does the site tend to follow? Are there parent elements that are useful to selecting a child? (often, this is a distant relationship, not visible in a screenshot as provided by OP)
Consider all capabilities provided by your toolkit. For example, OP asked for a selector to close a modal on Stack Overflow; it turns out that none of the elements have particularly great CSS selectors, so using Puppeteer to trigger an Esc key press might be more robust.
Keep it simple: since pages can change at any time, the more constraints you add to the selector, the more likely one of those assumptions will no longer be true, introducing unnecessary points of failure.
Look for unique identifiers first: ids are usually unique on a page (some Google pages seem to scoff at this rule), so those are usually the best bets. For elements without an id, my next steps are typically:
Look for an id in a close parent element and use that, then select the child based on its next-most-unique identifier, usually a class name or combination tag name and attribute (like an input field with a name attribute, for example).
If there are few ids or none nearby, check whether the class name or attribute that is unique. If so, consider using that, likely coupled with a parent container class.
When selecting between class names, pay attention to those that seem temporary or stateful and might be added and removed dynamically. For example, a class of .highlighted-tab might disappear when the element isn't highlighted.
Prefer "bespoke" class names that seem tied to role or logic over generic library class names associated with styling (bootstrap, semantic UI, material UI, tailwind, etc).
Avoid the > operator which can be too rigid, unless you need precision to disambiguate a tree where no other identifiers are available.
Avoid sibling selectors unless unavoidable. Siblings often have more tenuous relationships than parents and children.
Avoid nth-child and nth-of type to the extent possibe. Lists are often reordered or may have fewer or more elements than you expect.
When using anything related to text, generally trim whitespace, ignore case and special characters where appropriate and prefer substrings over exact equality. On the other hand, don't be too loose. Usually, text content and values are weak targets but sometimes necessary.
Avoid pointless steps in a selector, like body > div#container > p > .target which should just be #container .target or #container p .target. body says almost nothing, > is too rigid, div isn't necessary since we have an id (if it changes to a span our new selector will still work), and the p is generic--there are probably no .targets outside of ps anyway.
Avoid browser-generated selectors. These are usually the worst of both worlds: highly vague and rigid at the same time. The goal is to be the opposite: accurate and specific, yet as flexible as possible.
Feel free to break rules as appropriate.

Setting (ARIA) role for HTML custom elements without explicit attribute?

I have a web app that displays and passes around user-editable semantic markup. For a variety of reasons, including security, the markup consists entirely of custom elements (plus the i, b, and u tags). For regular rendering, I simply have styles for all the tags and stick them straight in the DOM. This looks and works great.
Now I'm doing screen-reader testing, and things aren't great. I have a bunch of graphical symbols I want to add labels for. I've got divs that I want to make into landmarks. I've got custom input fields and buttons.
It would be easy enough to just add role= to all the different tag instances. But part of the reason for the custom markup is to eliminate all the redundant information from the strings that get passed around (note: they're also compressed). My <ns-math> tag should always have role="math", and adding 11 bytes to what might be tags around a single character is an actual problem when multiplied out over a whole article. I know this because the app started with a bunch of <span class="... type elements instead of custom.
For the fields and buttons, I've used a shadow DOM approach. This was necessary anyway to get focus/keyboard semantics correct without polluting the semantic markup with a bunch of redundant attributes, so it's easy to also slap the ARIA stuff on the shadow elements. Especially since the inputs are all leaf nodes. But most of my custom tags amount to fancy spans, and are mostly not leaf nodes, so I don't really want to shadow them all when they're working so well in the light DOM.
After a bunch of searching, it seems like the internals.role property from "4.13.7.4 Accessibility semantics" of the HTML standard is maybe what I want. I may well be using it incorrectly (I'm a novice at front-end), but I can't seem to get this to work in recent versions of Firefox or Chrome. I don't get an error, but it seems to have no effect on the accessibility tree. My elements are not form-associated, but my reading is that the ARIAMixin should be functional anyway. This is maybe a working draft? If this is supposed to work in current browsers, does anybody have a code snippet or example?
Is there some other straight-forward way to achieve my goal of accessibility-annotating my custom elements without adding a bunch of explicit attributes to the element instances?
So you want the benefit of adding a role or an aria-attribute without actually adding those attributes? The concept of an "accessibility object model" (AOM) has been bantering around a bit that would let you access and modify the accessibility tree directly but it's still in the works. Here's an article from a couple years ago that talks about it. Nothing official. Just one person's thoughts.
Further research shows that, as of this time, the abstracted accessibility options I'm asking for are not yet implemented.
For the time being: eliminating a number of page-owned enclosing divs from the accessibility hierarchy via role="presentation" significantly improved my overall tree. With those out of the way, the majority of my custom tags seem to be simply semantically ignored. This is mostly fine as the majority of my content is plain text.
Since I already mark up the vast majority of even single-character symbols, I've simply added all my symbols to the markup generator. Since everything is already in custom tags, I then use a shadow DOM span with role="img" and a character-specific aria-label to present the symbolic character.
My solution is still incomplete. I wish that I could convey the full richness of the semantic content I have available.

What kind of performance optimization is done when assigning a new value to the innerHTML attribute

I have a DOM element (let's call it #mywriting) which contains a bigger HTML subtree (a long sequence of paragraph elements). I have to update the content of #mywriting regularly (but only small things will change, the majority of the content remains unchanged).
I wonder what is the smartest way to do this. I see two options:
In my application code I find out which child elements of #mywriting has been changed and I only update the changed child elements.
I just update the innerHTML attribute of #mywriting with the new content.
Is it worth to develop the logic of approach one to find out the changed child nodes or will the browser perform this kind of optimization when I apply approach two?
No, the browser doesn't do such optimisation. When you reassign innerHTML, it will throw away the old contents, parse the HTML, and place the new elements in the DOM.
Doing a diff to only replace (or rather, update) the parts that need an update can be worth a lot, and is done with great success in rendering libraries that employ a so-called virtual DOM.
However, they're doing that diff on an element data structure, not an HTML string. Parsing that to find out which elements changed is going to be horribly inefficient. Don't use HTML strings. (If you're already sold on them, you might as well just use innerHTML).
Without concdering the overhead of calculating which child elements has to be updated option 1 seems to be much faster (at least in chrome), according to this simple benchmark:
https://jsbench.github.io/#6d174b84a69b037c059b6a234bb5bcd0

What's a good way to figure out which code is causing runaway DOM node creation?

The Chrome Dev Tools have unearthed some problems similar to those posted here, more DOM nodes being created than I feel should be given my design choices.
What's a good way to figure out what area of code is causing runaway DOM node creation? The information is really useful but figuring out what to do with it seems much less straightforward than, for example, dealing with a CPU profile.
Try taking two heap snapshots (the Profiles panel), one with few DOM nodes and one with lots of them, then compare and see if many nodes are retained. If yes, you will be able to detect the primary retainers.
I would suggest creating code that walks the DOM and collects some statistics about what nodes are in the DOM (tag type, class name, id value, parent, number of children, textContent, etc...). If you know what is supposed to be in your page, you should be able to look at this data dump and determine what's in there that you aren't expecting. You could even run the code at page load time, then run it again after your page has been exercised a bit and compare the two.

Diff between two HTML chunks: structural instead of lines/chars?

I'm looking for a JavaScript diff engine that will return the difference in the structure of two chunks of HTML. That is, instead of, "at this line, at such and such character, something happened", it'd be, "this element was inserted after this element", or "this element was removed", or "this text node was altered", etc.
Cursory research suggests this is hard.
The specific scenario is that I have a live preview of Markdown text editor. It works well with just text, but as soon as a user posts in, say, a YouTube <iframe> embed, then it renders/reloads on every keystroke, which is absurdly expensive. Large images are difficult, too, because they cause a nauseous jittering effect as they load from the cache (at least in WebKit).
What would be beautiful is a replacement for jQuery.html() that instead of just replacing the HTML contents actually compared the old with the new, and selectively updated/inserted/appended so that unchanged elements are left alone.
Deep clone (via node.cloneNode(true)) both nodes if they're currently in use (i.e. if any child nodes are referenced in your JavaScript).
Normalize both nodes via node.normalize().
Iterate over every child node of both nodes and compare with node.isEqualNode(other_node).
For every non-equal node, iterate deeper to see if you can find any equal child nodes.
To be honest, you're much better off using a text diff lib instead of making your own DOM-based diff lib.

Categories

Resources