Yahoo! BOSS – the answer for web semantic analysis-based applications??
A couple of weeks ago, I attended the Yahoo Open Hack Day at the Yahoo Campus in Sunnyvale, CA. At Open Hack Day, Yahoo opened up all their technologies for a few chosen hackers to play with and evaluate for a weekend. The technology that I was most interested in was BOSS (Build your Own Search Service). BOSS is “Yahoo!’s open search web services platform”. Simply put, this means Yahoo has opened up its web index for anyone to use using the BOSS API. This is unprecedented and opens up a ton of opportunities to advance some of the topics that I have discussed on this blog, primarily NLP: Unstructured thinking for unstructured data and 2008 Web Search is still in 1979.
As I have said in the past, the goal of the semantic web is still a long ways to be realized. However, rather than wait for every website owner to build semantic web conforming website (or retrofit their past content to be semantic web compliant), we should seek to derive web semantics at the application level using a whole new set of applications, web semantic analysis-based applications. Yahoo’s BOSS can be one of the missing components that pushes the ball forward towards this goal.
Surprisingly, one of the challenges of deriving web semantics is as simple as programmatically identifying and extracting the content from a web page (I have talked about this in a previous post: A case for standardizing blog templates). Before semantic analysis can be performed on a web page, the proper content must be extracted fom the web page first. As humans, when we look at a web page, we can readily distinguish the “main content” of a web page from navigation bar, header, links or ads. This is not so easy for computer programs to accomplish. At Filtrbox, we have developed algorithms to accomplish this with a very high success rate only because we have devoted time and resources into the algorithms because they are core to our business. Other application developers wishing to leverage web content semantics may not have the time and resources to build such algorithms because that is not core to their business. This is where Yahoo BOSS comes into the picture. We know that Yahoo has built its massive Web index by indexing the “main content” extracted from web pages. Yahoo has invested time and resources to solve the content extraction problem. In addition, they have built a massive infrastructure to index and store web content. Therefore, instead of re-inventing the wheel, developers of applications that leverage web semantics can take advantage of Yahoo’s content extraction through the Yahoo BOSS API. However, Yahoo needs to open up a little more for this to be possible.
Here is where Yahoo needs to open up: Although Yahoo currently performs content extraction and content indexing, unfortunately the Yahoo BOSS API is not geared towards applications that analyze web data semantics. The Yahoo BOSS API in its current form is geared towards web searches. It is keyword query-based and returns at least TITLE, URL and ABSTRACT/EXCERPT. Unfortunately, to move towards web semantic analysis-based applications, the ABSTRACT/EXCERPT alone is not enough. Instead, the Yahoo BOSS API should return the WHOLE “main content” (not links,ads and navigation etc) of a web page. Returning the whole content enables applications to perform semantic analysis on the data from millions of web pages that is stored in Yahoo’s web index, thereby adding value to the data and moving the ball forward towards unlocking the hidden value in web data using web semantic analysis-based applications.