All I want for the new year is HTML 5
While there have been many wishes and predictions for 2009, mine is simple, it’s HTML 5. The adoption of HTML 5 specs by browsers, rendering engines and content publishers in this coming year will make 2009 a good year for me. As I have written in the past (here and here), content extraction is an often overlooked challenge that gets in the way of deriving web content semantics. This is an issue that often gets overlooked but for those of us who are passionate about extracting web content semantics, we understand how much it gets in the way of making much of the good work being done now even better. As we have seen recently, this is not an issue that is challenging only the small players, some of the major applications that rely on content extraction such a Google Alerts are seeing a degradation in content quality as they provide articles that have keyword hits in the navigations bars, ads and other non-content related text on web pages.
HTML 5 had taken steps in specifying how web content (e.g. news story, blog entry) should be represented in a page. The specification has attempted to structure a web page by separating different parts of a web page such as headers, footers, navigation, content etc. The elements of HTML 5 that will help with content extraction are <section> and <article>.
The <section> element is described in the HTML 5 specification as follows,
“The
sectionelement represents a generic document or application section. A section, in this context, is a thematic grouping of content, typically with a header, possibly with a footer.Examples of sections would be chapters, the various tabbed pages in a tabbed dialog box, or the numbered sections of a thesis. A Web site’s home page could be split into sections for an introduction, news items, contact information.”
Having an HTML element that groups content is very welcome. The <section> element can be used to contain content such as a news article. HTML 5 has gone one step further to make this possible by introducing the <article> element which the specification described as follows,
“The
articleelement represents a section of a page that consists of a composition that forms an independent part of a document, page, or site. This could be a forum post, a magazine or newspaper article, a Web log entry, a user-submitted comment, or any other independent item of content.An
articleelement is “independent” in that its contents could stand alone, for example in syndication. However, the element is still associated with its ancestors; for instance, contact information that applies to a parentbodyelement still covers thearticleas well.”
A structured implementation of the <section> and <article> elements by content publishers will go a long way in making content extraction simpler thereby providing for a small step in making web content semantic analysis easier.
The HTML 5 specification has been out there for some time, its time for rendering engines to start implementing some of the new semantic oriented elements in the specification (some rendering engines have already started implementing parts of the specification). 2009 sounds like a good year for rendering engines, content publishers and content generation software to come together and help chart the course for web semantic analysis-based applications.
NOTE: HTML 5 contains other descriptive elements that help with the expression of semantics of textual data. I will get to those in future posts.
Even with the development of HTML 5, do you think that most developers and sites will follow this structure, and what about Flash and Silverlight. With the ever developing RIA experience, do we loose indexible search and true content specifications? OPML is a standard and look where that has gone. I really like your take on the HTML 5 and look forward to more posts about how you are using it and what it has changed for you. Keep up the sweet posts master of RSS time.
Fantastic post!!! Cheers!