What is reliable way to clip *content* of a web page? - javascript

I wonder how it is possible to (more or less ) reliably clip the content from a random web site (using Ruby or JavaScript, doesn't really matter).
Much like Evernote and Flipboard do.
What is the best way to determine where the actual content is within a page?
The purpose: given a URL - retrieve the actual content of that page and ignore all the layout and other unrelated information.
For example:
given http://ninemsn.com/ => the HTML of the main news topic that is in the middle part of the content.
given the http://news.cnet.com/8301-1035_3-20104048-94/a-beginners-guide-to-telecom-jargon-part-7 => the HTML of the main article.
Just use Evernote's "clip full page" option to see exactly what I mean.
Thanks.

My initial thoughts would be to DOM parse the page, then traverse the DOM tree to the content of a specific div and show that (via XPath, etc). For pages without clearly-defined sections it's going to be difficult regardless of which method you use. The AutoPager plugin for Firefox and Chrome implements XPath parsing behaviour. Get the latest version and open up the .xpi to see how he does it. It's a JavaScript implementation.
Pick the div by letting someone enter, per URL/site scheme, what the id or class of the content div is. For your ninemsn example, the div containing the article's title, share buttons, the author's image, and the post content is
<div class="post">
and the actual body of the text is
<div class="postBody txtWrap" section="txt">
So someone would enter that you need to parse the first h1 from <div class="post"> and that's the article title, and then get all the text from <div class="postBody"> and make that the article content (you might need to parse the class in such a way that it can match both postBody and txtWrap).
Another example (for funsies): Stack Overflow. A question's title is contained in
<div id="question-header">
A question's text is trickier, because it's in a div with the same class as an answer's text, and no id. You need to match <div id="question"> and then traverse down to
<div class="post-text">
Similarly for answers, each <div id="answer-[UINTEGER]"> contains a <div class="post-text"> with its respective text.
In both situations, you can traverse those top-level question and answer- divs for <div class="user-details"> to fetch usernames, reputation and badge counts, etc.

Related

Copy HTML Element with AHREF And Allowing Pasting to Word

Have a modal dialog that presents user with a ticket number that is formatted as a clickable object(URL). If you were to view the underlying HTML there is an A tag to the ticket number:https:/server/form/id?ticket number. If the user highlights the ticket number, does a right mouse click and Copy (not Copy link address) and then pastes that into MS Word, the ticket number is pasted and it retains the underlying URL embedded as a hyperlink. For some users highlighting the value without getting other surrounding text can be challenging. What I want to do is include a button that will run JavaScript that will perform that same action for them. I have been able to write the script that gets the URL from the A tag and put into the users Clipboard but pasting it into MS Word pastes the entire URL - which makes sense. Is there a way to copy to the clipboard what the user does manually?
Notes
Firstly, when I answer questions, I first try to give a more generalized answer that is meant to be adapted. StackOverflow isn't a coding service and as such, the answers shouldn't answer 1 single use-case but rather, as many as possible so when other users come across a question similar to theirs, the answers can still be useful.
Relevant Answer
That being said, How do I copy to the clipboard in JavaScript seems to cover what OP is asking for. I am posting this answer to explain how/why it answers OP's question.
There are 3 methods given in the accepted answer on that post. The first method (Async Clipboard API) is going to typically be what people use. The second method is deprecated. The final method (Overriding the copy event) isn't covered in detail but applies to OP's question.
Applying the Answer
Using the first method, a button (or other actionable element) could be added to the page to copy data to the clipboard. Based on OP's question, the only other thing needed here is getting the user's selection. The MDN Web Doc for the user selection on a page covers this and a lot more, but the basic thing needed here is window.getSelection().toString(), which will get the currently selected text (text only, no element information).
Method 1
const _CopyText = () => {
navigator.clipboard.writeText(window.getSelection().toString());
console.log(window.getSelection().toString());
}
<div>
<p>This is random text just to fill space. Its <b>only purpose</b> is to create the appearance of content on a page. <span style="color: #00F;"><b><i>This is random text</i></b></span> just to fill space. Its only purpose is to create the appearance of content on a page.<p>
<p>This is random text just to fill space. Its only purpose is to create the appearance of content on a page.</p>
</div>
<input type="button" value="Copy Text" onclick="_CopyText()">
<br>
<textarea style="width: 40em; height: 5em;"></textarea>
Method 2
This method actually involves manipulating what is currently being copied by the user. For this example the focus is just to convert the copied information into text only (removing any additional information such as URLs or formatting). With this, we are using the 3rd method discussed in the linked post regarding Overriding the copy event.
document.addEventListener("copy", e => {
e.clipboardData.setData("text", window.getSelection().toString());
e.preventDefault();
});
<div>
<p>This is random text just to fill space. Its <b>only purpose</b> is to create the appearance of content on a page. <span style="color: #00F;"><b><i>This is random text</i></b></span> just to fill space. Its only purpose is to create the appearance of content on a page.<p>
<p>This is random text just to fill space. Its only purpose is to create the appearance of content on a page.</p>
</div>
<input type="button" value="Copy Text" onclick="_CopyText()">
<br>
<textarea style="width: 40em; height: 5em;"></textarea>
Additional Notes
Places like the CodePen and the StackOverflow snippets have limited or no access to certain features (such as the clipboard API). So saying 'it doesn't work' when you try the snippets is not due to failed code. Please use the code and all references to learn and then adapt the code for any specific needs and use-cases.

Is it possible for Javascript to generate a DOM html that is unextractable by Cheerio?

I am trying to extract the price from this webpage: https://www.allbirds.com/products/mens-wool-runner-up-mizzles-natural-grey?size=13
I narrowed it down to these divs:
<div class="jsx-3947815802 Container">
<div class="jsx-526902087 Grid">
<div class="jsx-2943457050 Grid__cell Grid__cell--small-12 Grid__cell--medium-7 Grid__cell--large-up-8">...
The jsx-{random_number} for the class names is suspicious to me. They seem generated on the fly. The price I need is inside divs like these. However, these don't exist in the page source and or the cheerio object I am using during runtime. It just disappears.
How common is this technique? It seems like a pretty good way to stop web scrapers. How do I get around it?
If those classes are random, it might be annoying, but it's not a deal-breaker, because the other classes look to be static.
For example, the element that includes the price looks something like:
<p class="jsx-3188494938 Paragraph PdpMasterProductDetails__paragraph">$135</p>
The PdpMasterProductDetails__paragraph does not change. So, you can retrieve the text by using that as a selector:
$('.PdpMasterProductDetails__paragraph').text()
You can also retrieve the price from a meta tag:
<meta property="og:price:amount" content="135">
which can be selected via the selector string:
meta[property="og:price:amount"]
How common is this technique?
Very.
Building websites as Single Page Applications with tools like React is very common.
It seems like a pretty good way to stop web scrapers.
It isn't.
How do I get around it?
Hit the web service the React code fetches the raw data from directly. It's easily discoverable via the Network tab in the browser's developer tools.

HTML Make code run independently from the rest of the page...iframe?

This is a little hard to explain. I'm creating a webpage that shows how other webpages will render in the browser. Here's a simplified version of the problem I'm having...
<div id="test">This is an example of page 1</div>
<div id="test">This is an example of page 2</div>
As you can see, both divs have the same ID. I can't change the ID in my situation and it's causing problems. I'm having various other CSS and javascript problems also. Each section of code is conflicting with the other. So, I was looking for a way to have a section render independently of everything else. One way to do it would be to create an iframe for each section of code. But that would require me to create a separate webpage for each section, right? Or, is there a way for an iframe to work just by entering code into it, rather than a URL.
You can only have one id per element in an HTML document. So each div must have a different id, otherwise you will run into problems. If multiple elements need to have the same name, you can use classes <div class="test" id="unique-id"></div> and then <div class="test" id="another-id"></div>.
To answer your question with regards to iframes, yes, you need a separate page for each iframe. It is not possible to write code within the iframe tags to execute separately. See the iframe spec.
Edit: After reading the iframe spec myself, it appears you can use the srcdoc attribute to overwrite what is in the src attribute, but it looks like this isn't entirely accepted across browsers. MDN has more information about the attribute.

Showing text association of an image (an alt) on another HTML page

This is probably a simple lookup, but since I don't know how to word my question in the form of a google search, I brought it here. I have a number of divs with an image associated inside each div. With each image, I gave it an alt for word association. I might need to use title instead, but for now I am just going to use alt for the purpose of my question. I want the user to click each image and link them all to the same html page (kind of like an under construction page). For obvious refactoring, I only want one of these html 'under construction sites'. However, I want them slightly personalized to show that the image they clicked is being noted as under construction. iE:
<div class="view view-first">
<img src = "img/storage.jpg" alt="Storage Corp">
<div class="mask">
<h2>Storage Rental Space</h2>
<p>Develepment, Server</p>
Read More"
</div>
</div>
So I would want them to click on read more, have them go to server.hml where it says something like 'Sorry Storage Corp is under constriction'.
Easy enough for one image, but let's say I have 20, and I want one server.html that spits out a different 'Sorry xxxx is under construction'. Do I created an empty div in server.html and call the text from the image alt text for the image they clicked into the html page for each image? If so, what is the proper syntax? Maybe I have been too knee deep in JS for the last few weeks that I just can't think of a proper way to do with without declaring a universal scope to hold the string and call it on an image click?
Thanks for any tips!
You’re likely to need either a server-side language or JavaScript to make the result page dynamic. The simplest way to do this would be to include the text you want in the href as part of the URI. For example "server.html?text=Sorry+wrong+page". In server.html, you could then grab this variable from the GET string and put it into the page.
All server-side languages give you access to GET variables. In JavaScript, it’s a little more complicated. See this question for ways to do it.

Script to search for a value and create a variable dynamically?

Sorry if the question in the title is a bit vague, here's what I'm trying to accomplish:
Is there a script out there that can search a page (or page source) for a particular determined value (for example, a product ID "1234") and insert it dynamically or on-the-fly into a variable which can be used anywhere on the page if called?
For example:
I'm working on a site that uses a shopping cart/feed platform that is closed source, so I can't grab variables I need (such as the product ID, product price, and order ID), as they are "locked down" (so to speak). And I need to be able to grab them and dynamically insert them into click trackers/pixel URL strings.
I'm not sure if this is possible or if this is a much larger task at hand.
A webpage might have many ways of showing a value, and it takes some human interpretation to determine what values are important. Example:
<div id="couponDisplay" class="inset hidden">Enter your coupon code: <input name="coupon"/></div>
<div id="cartRegion">
<div class="cart">
<div class="lolbx_quantityControl" data-initialValue="1" onKey="lolbx_notify()">
</div>
</div>
</div>
It may be that you're looking at a shopping cart page, and though the whole HTML is hundreds of lines long, the most important part of that page for you is the data-initialValue="1" part. That's less obvious to a computer. The first step for you might be the path the computer uses to reach the value you want, then see if you can replicate that.
I'm not sure if I understand what sort of system you're connecting to, though; I will say that using outside web services through "hacks" like this without their permission may be violating their terms of use (ie, grabbing Google Maps data to make your own map control with no Google branding)

Categories

Resources