Web Scraping - What's a robust and extensible approach? [closed] - javascript

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have some limited experience with web scraping using tools like Beautiful Soup and Nokogiri.
My approach thus far when looking for information is to first inspect the HTML elements and CSS tags, then applying the selector. While this works, slight differences/changes among web sites would render the code useless. Also, there have been situations where sites simply don't add the selector tags to their HTML elements, so I once had to resort to the hacky approach of selecting the style property of the element.
How would one devise a scraper that would work across multiple sites? I'm aware that the solution would depend on the context, but is there a general good practice in doing it? I was actually asked in an interview before this question and I had no idea.
I have tried googling but much of what I found doesn't go past the basics, and I don't know where to look. Any help would be appreciated.

It's not clear from your question what exactly you are trying to accomplish. If you want the content of the page (like in an article) - you should try goose, which should give you a leg up. You can also try searching for conventional web page approaches like meta tags.
Either way, you should remember that this is the World Wild Web, and the HTML is a very forgiving language, which lets people design pages which are very hard to read by a machine. Even big sites sometimes have their proprietary breaks from conventions, which forces exceptions in your code in order to read them. Site logic may also conflict with conventional logic, or other major site.
This means that your code would probably consist of a lot of use-cases and exceptions.
My suggestion to you is to keep samples of pages of sites you want to scrape, and have a unit test which iterates over them and verifies the scraping results. This way, each time you find a new quirk, you can add it to your collection, and be certain that if the change you made broke some other site's scraping, you would know about it.

Related

Can it be a good idea to build a website out of one line of HTML and fully Javascript? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
So what I did was that I created two files index.html and script.js and using all I did on the html was I just used the script tag.
<script src="script.js"></script> and on the script.js file I just typed document.write("<html><h...the whole html in a line></html>). Ok it may not seem like the best idea but it seemed more faster and the styling worked fine, so is it actually a good idea? and mainly is it faster?
No, I can't think of any logical way how that would be faster for a static page. I'm not sure how you were expecting it to work, but instead of just parsing the HTML, it has to run your script, inject the content into the page, and then parse the resulting HTML anyway.
Keep the JavaScript for when you need dynamic content.
My main concern, with constructing a page in the manner you've described is that it may not get indexed into the search engines as well as a page that has static HTML. The web crawler of the search engine would need to be sophisticated enough to run your javascript before scraping the page's text content. I'm not sure they all do that currently, but they should in my opinion.
So, I wouldn't do this on a page you want to be found via the search engines.
There are top-notch web applications built entirely out of JavaScript, using frontend frameworks like React.js, Angular.js, Vue.js, etc.
So it isn't actually a bad idea building your site fully out of JavaScript!
I don't know if you can compare HTML and Javascript frameworks which can be used to build websites (React, Angular, etc.), because they work very differently. Remember, HTML is a markup language, while things such as React and Angular are frameworks written in the JavaScript language.
So to answer your question, is it a good idea to build a website using only JavaScript? - The simple answer is it depends. Using React to make applications or even static websites can save you quite a bit of time, though using plain JavaScript really wouldn't have too many benefits as far as I can see. But if you just want to play around with whatever you're doing, then I'd say sure, go ahead!

Similar and simplified examples (newbie questions) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm currently studying about web development, I still don't know about jquery, but I've a little knowledge about javascript, html and css (basic).
I've been looking at some examples in github to improve my skills, and I've found this content;
https://github.com/stewilondanga/editables
I perfectly understand the theory, but I do not know how to put it into practice, I would like for any similar examples (simplified alternatives) and how to convert the exported code generated by javascript into a html5 table?
Any example would be appreciated! thanks for your attention!
First of all, jQuery does not generate code. It's a framework, you load it into a web page, and then you can use it from within Javascript code in that page.
I suggest you start by looking at the source of https://stewilondanga.github.io/editables/, if an editable tables is what you need. There are more general frameworks to do this, e.g. Aloha
To try it yourself, I'd suggest you bite the bullet equip yourself with some kind of web server, be it on a server somewhere, or on your local machine, so you can easily try out things like this, copy the sources, alter the code etc.., and quickly hit reload on your browser.
While it may seem easier to run a local server and point your browser at http://localhost/something, IMHO it also takes more tinkering to get browsers to embrace that fully. You don't need the extra grief while already learning all those new concepts. If you want to tackle this seriously, consider getting a hosting service or small VPS somewhere. If you don't know how to do that, get help for that first, but get it out of the way. It'll save you much grief.

Designing a web application with JavaScript [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a question about the design of a web application with JavaScript: Should a web application be designed to work without JavaScript, and then later add JavaScript for users that have it? Or should I design a web application with JavaScript in mind and then add fallback functionality for user that do not have JavaScript.
I hope this question makes sense. Let me know if you need me to clarify something.
Thanks.
The terms, that describe what you are looking for are "Progressive Enhancement" and "Graceful Degradation".
Here is good article describing what you already have in your question in more detail:
A List Apart: Understanding Progressive Enhancement
An article that could help you on your decision:
Dev.Opera: Graceful degradation versus progressive enhancement (The named reasons are still valid, despite the fact that the article is marked as outdated)
I favor progressive enhancement in most cases, since it is more accessible when it comes to different output devices, software and the capabilities of the user using that website.
Answers like "there are so few people with JavaScript disabled" are just one side of the medal. Not relying on JS also could improve your site experience to non-graphical clients like search engine robots (how should they load AJAX content, when that is only accessible via JS?) or screen reader software. In fact there are many more good reasons not to rely on on JS.
At this time of age there are so few people with javascript disabled, that there is no signifcant benefit for creating a static version. Try to imagine who is your visitor and if he/she would even know how to disable it.
I suggest you to design a web application with JavaScript in mind and then add fallback functionality for user that do not have JavaScript.
Now a days everything runs on JS only. You should create some kind of services/API on server side and a separate project for UI, this is the trend being followed these days.
UI project can be based on any JS framework or it can even be a simple MVC/.net project. This approach can decouple stuff, and thereafter you can create 2 UI projects one for JS users and one for the users who do not have JS.
seems a bit of a work but, it will pay in the long run.

Building a JavaScript grid from scratch [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am curious to know what it takes to build a JavaScript grid from scratch. The grid should have features like jqGrid http://www.trirand.com/blog/jqgrid/jqgrid.html.
Can anyone please give me inputs?
Thanks
What it takes to build something similar to jqGrid:
A huge, HUGE amount of time.
If something similar to what you want exists already, why would you want to spend lots and lots of time re-inventing the wheel? Anyhow, if you have nothing better to do, want to learn from it or if you are just curious, here is a list of skills that are needed to create a similar system:
HTML object manipulation.
Style manipulation.
Tons of different event handlers.
AJAX to grab (pages of) documents to display. Probably some server-side stuff too...
Creating of a nice layout system wich works in every browser.
Creating handlers to read and manage the different file types to support (XML, JSON, etc)
Creating HTML forms and reading them out with JS and then use AJAX to resave an XML, JSON, etc document back to the server.
An Algorithm to allow searching in the data you display.
Keyboard manipulation and the toggling off of standard key-events.
10. Tons and TONS of debugging to make sure it looks nice in all browsers.
Of course, this is only a tip of the iceberg since I don't really know the jqGrid program myself. I created this list by looking at some of the examples and reading the Features page.
Again, I would not recommend to rebuild such a big system from scratch, but the choice is of course yours ;).

Is there a guide about how to create beautiful HTML that is preped to be used with javascript? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
This might seem like an odd question, but I find that javascript is either easy or hard, depending on how you've coded the HTML. Is there a book or website that goes into detail about successful patterns and guidelines for coding HTML, so that it's very workable with jQuery, css and complex ajax applications? Like solid rules to live by.
Again, seems like a weird question maybe, but I don't know a better way to ask it. I just find myself always having to change the markup as new things come up - like switching between a hidden input element to a data attribute... or putting more ids or taking away ids - and I guess I arrive at the right way to do it, but I'm curious if someone has bothered to analyze this and came up with some great guidelines, standards and patterns so that the resultant HTML is right the first time.
Thanks
The first thing if you want to code some clean HTML that will be easy to work with is to make sure that your code is valid against an official DTD, HTML4 (here) or XHTML (here).
Then use id and class in a proper way (id only for unique section and class for repeatable ones) and name them correctly according to the context so they are easily reachable.
From my experience, I would actually suggest that, when it comes to large projects and professional JavaScript coding, the goal actually becomes to decouple the JavaScript code from whatever HTML it lives in.
As mentioned already, as long as you are using well formed HTML (DTD compliant), a library like jQuery shouldn't have any trouble operating on it. However, as best practice, I would recommend striving to isolate and encapsulate dependencies, whether they be because of HTML structure or just other chunks of JavaScript code.
the best way is to develop the html and javascript together. That way you can adjust the document structure to whatever you need.
This article seems to answer my question:
http://www.viget.com/inspire/extending-paul-irishs-comprehensive-dom-ready-execution/

Categories

Resources