Parsing Unstructured Data

Parsing Unstructured Data - javascript

I am working on writing a bookmarklet that will pull information from a site and send it off to a users account to be saved for later usage. This generally includes the problem of taking unstructured information and making it structured. Take for example a hobbyist wants to save a project for later. There are a number of parts that they need to obtain and instructions to follow. On one blog, the writer could refer to the instructions as directions or recipe or any number of synonyms. One person may list the information with <li> tags to order the steps, while another may not.
What are general strategies to turn unstructured data into structured information? Are there other strategies to determine which content is relevant? (i.e. Instapaper or Readability)

Hmm...maybe you could use this in conjunction with Google? Taking a look a head & meta tags is a good idea too. You could also take a listing of how often words are used. Heck, you could even have a popup alert that asks the user to enter data about the page.

It doesn't seem like there is a good computer science answer to this question, so I have decided to change the approach and have users organize the data as they see fit.

Related

How to use a PostgreSQL or SQLite (preferred) database in a website?

I searched a lot on the web but couldn't find an answer.
I'm creating a simple tech wiki/forum website and want to simply add a subscription service that would notify a group of people's emails if any updates are made.
Simplified: The user would enter his/her e-mail (no name only e-mail) in a JS, HTML form and that needs to be stored in a (preferred) SQLite .dd database from which the e-mails would be retrieved (manually) to be sent e-mails to.
Note: The database is currently empty with no tables in it.

This question is probably too broad and not appropriate for the StackOverflow since an answer would entail the design of an entire system covering a number of different technologies including Databases, Website Construction, and Email (which itself is a special form of hell IMO for the uninitiated).
I would recommend you first line out exactly what functionality in detail your website will handle (can people make accounts? can arbitrary users make posts/page?) From there, you will need to divide those features into the right applicable technologies/concerns.
If you've haven't already, I would choose a set of technologies and look for tutorials covering the areas of concern. There are about a dime of dozen blog post on various websites covering common features across web applications. If you don't know what to pick, I would just go with what the blog post uses. Admittedly, this is a terrible suggestion, but I am assuming this is just a side project for you to learn and not for a paying client so that approach is fine to a certain degree (if this is for a paying client, well you're on your own).

Adding nodes to XML with a form

I've created a list using XML and have embedded an XSL stylesheet in it. Now I want to know, because this is a list of movies that constantly grows, is there a way to create a form that will add child and grandchild nodes to the list.
I'm thinking there might be some javascript involved, but I have no idea what I'm doing when it comes to scripting.

If the intension is to write new entries into this xml file you need to look at some form of serverside scripting. Javascript can not write to files.
PHP is probably the easiest to understand, start by buying a book that explains the basics with a few examples, it's not that hard to grasp, and a form and a simple function for writing to files shouldn't take you to long to figure out.
If you have no interest in that sort of thing, hire someone to do the job for you, it will probably only take a few hours for someone who knows what they are doing to get a complete system up and running for you, and unless you really would like to learn a serverside language that is probably the best option.

If you really want to do it all on the client side, you can try HTML5 client-side storage. This means the data will be persistent only for the user that enters the data (and only for the browser used to enter it, and only on the computer that was used).
If you want the movies to be accessible to multiple people, or from multiple computers, you need to use server-side storage.
FYI, HTML5 storage is not yet very well standardized across browsers.

What is a good example of a strategy for achieving SEO-friendliness in a javascript-heavy application?

Intro
I know this has been asked before but the questions I found were either to specific or to general to provoke the kind of answer I was looking for. The best possible answer I can imagine would be an example using backbone and the least amount of server-side logic possible (no preferred language/framework there).
Problem
I am planning an javascript/ajax-heavy (backbone + mostly-json backend) application that implements a facetted search. Take for example a facetted search of a simple shoe shop application that lets you filter color, brand and type of shoes and sort by price and size or whatever else.
Assume I am using backbone or a similar framework on the client and a json service as a backend.
What would be a good (tradeoff between effort and result) strategy to achieve seo-friendliness as well as a snappy interface?
Resources
A solution that came to my attention is Hijax by reusing client-sided templates on the server-side, as described here: http://duganchen.ca/single-page-web-app-architecture-done-right
Resources that I digested without final conclusion
http://code.google.com/intl/de-DE/web/ajaxcrawling/
https://stackoverflow.com/a/6194427/818846
http://www.quora.com/Search-Engine-Optimization-SEO/If-I-have-data-that-loads-using-json-JavaScript-will-it-get-indexed-by-Google?q=seo+javascript

The general point in SEO friendliness: It should work without JavaScript.
It's also good for accessibility, so you should do it like this, if the user does not have JavaScript enabled (like the search engine does), it will work.
If he has JavaScript enabled, (like any sane human being does), it will work with all the nifty JavaScript features you've added.
As a general usability rule of thumb: If it works, it should also work without JavaScript!

The solution of your first link sounds right. The main issue of a single page app is that you have to render your templates on both sides, the backend and the frontend. Using the Mustache or google closures template will be good solution for this.
The same solution that was used for google+, where initially the side will be rendered on the server and you load a static html page, after that the page will be rendered on the client side but with the same templates as on the server.

Also remember that the search engines follow links much more often than they (ever?) complete forms.
This problem of enabling the crawlers to see your db contents is called the "dark web," "invisible web", "deep web" or "hidden web". Blog post
So re your problem statement:
a facetted search of a simple shoe shop application that lets you filter color, brand and type of shoes and sort by price and size or whatever else.
I'd suggest that you include searches via a hierarchy of links in addition to searching via forms with select fields.
Eg, on a secondary menu include all the different brands as individual links. Then each link should lead to a list of the products sold by that brand. The trick is arrange things so that the link to an individual shoe will take you back to the first page (the rich one page app) but showing the specific shoe. -- And the page should implement the Google Ajax-crawling recommendations that you reference in the OP.

Get html form from xsd

I have a quite complex xsd file that describe some objects (it's not important, but it's the DATEX II standard)
Do you know if there is an automatic way to create an html form that act like a "wizard" to guide the user to create xml object as described in the xsd?

The answer to this depends on the intended user base, how you want your users to access your forms, and the technology stack you have in place already or you're willing to deploy.
If your users are quality control analysts, and so the intent is to have them use that generated UI to manage test cases, then a handful of commercial tools have this ability. A quick search on Google for terms such as "generate ui forms from XSD to test web services" should give you on first page the major players in this space (I won't name names to avoid conflict of interests). There are differences in how vendors approach this, that have to do with the time it takes to generate these forms from large bodies of XML Schema, which in turn translate into different degrees of usability. Given what I see in DATEX, from a complexity perspective, you may have a hard time to find a free tool for this...
If your users are rather data entry specialists, then the above are not the tools you want them to use. Without knowing much about your environment (I see your java-ee tag, but still not clear how it would relate to this task), one model could be a combination of InfoPath with SharePoint; while the process to generate the form is not fully automatic, it comes close to that. It is XSD driven, in the sense that at design time you drag and drop XSD on a design form, that allows you to build some really nice UI. Follow their competition on your particular technology stack and you may have your answer. Or, you can go to this site that lists XForms implementations; IBM's form designer, much like InfoPath, can use XML Schema for design, etc.
If this is for developers to get some XML, another alternative might also be to go with an Excel based (or SharePoint lists) approach and generate XML from that data (you give away cost to acquire something to build specific to your requirements tooling, here assuming people that are really familiar with spreadsheets instead).
Given how DATEX model looks like, you'll have to do some manual customizations anyway, if you plan to use the extensibility model, or if you choose to build different forms for different scenarios i.e. instead of one big form that'll give you all descendents for the abstract payloadPublication in some drop down, to just a specific, simple form e.g. MeasurementSiteTablePublication.

I know this is an old question, but I ran into this issue myself and couldn't find a satisfying open-source solution. I ended up writing my own implementation, XSD2HTML2XML. To my knowledge it is the most complete tool out there. It supports automatic generation of HTML forms from XSD, including the population of a form with XML data.
If you prefer an out-of-the-box solution rather than writing your own, please see my implementation, XML Schema Form Generator.

Effective way of getting an article's published date/author dynamically?

I'm working on a referencing webapp as part of a course I am studying, the aim of which is to allow students to quickly and easily reference the materials they find information in and I'm running into a couple of issues with things.
The first is getting an article/site's published date. When dealing with static HTML sites this is easy, as I can simply use document.lastModified to pull in the time it was last modified. Issues arise when dealing with the much more common CMS powered website, as pages are dynamically generated which causes document.lastModified to always return the equivalent of 'now'... which isn't accurate at all.
There are steps that developers of sites can take to make this a bit easier with the implementation of HTML5, namely with the addition of the element, which can have additional attributes set to define it as the time a post was published. Sites like these are fine, but the vast majority of sites aren't using HTML5 and I don't really see this changing any time soon. Anyone out there got some ideas on how to accurately identify when a post was created?
The second is accurately identifying the author of a post or page. There are a couple of ways to identify this. The first is if a site has used the hAtom microformat to identify elements of the site, which makes things easy... but as with post dates isn't common.
The next is looking at the meta data of a site, and identifying the author based on content stored there. This is both uncommon and also generally the owner of the site, or another person not responsible for the post, which leaves it somewhat unreliable for use as a resource.

If the website has an RSS feed and the article is recent enough to be included in it you could extract metadata about the article from it.

Sounds like a pretty tough thing to make, only because there is absolutely no standardization for this information that I know of. Some sites might put it in their keywords, others not.
I did some scraping as part of a media criticism class, and I find that pretty much each cms has to be processed individually. Overall, making something that would find the author info on a random web pages sounds very difficult.
You might be able to make something specifically for capturing this info from WordPress blogs, since those have so many commonalities. But something designed to just hit up any site and grab specific pieces of info, that's pretty tough.
Not trying to discourage you at all, just saying that you've set a pretty high goal, imho.

Sorry I can't help very much, but what about using regex to scan the page for 'By ___' or 'Source: ___' to get the author / source of the information?
As for the date last modified, as far as I know there's no easy way to grab this, as regex'ing for a date would return recent articles in sidebars, links, etc. And yeah, as you said document.lastmodified wouldn't work. You could consider replacing this with "date added" to your referencer, or similar.
Hope this helps you at least a little bit, and if not, gives you an idea or two.
Of course, if there's any API / RSS available, you could scan it for the last updated / posted date, and use that?

Develop Reference

JavaScript is the programming language of the Web.