On Thursday (04/24/2008 ) last week, I had the privilege of talking to Dr. Jim Martin’s Natural Language Processing (NLP) graduate class, at the University of Colorado at Boulder, about the work that we are doing at Filtrbox and the role that current NLP students will play in the future of information technology. This blog post is the basis of my message to the class.
As I have written before, the problem that we face today is how to harness the data that is available on the web so that we can apply meaningful interpretation to it using applications. This problem is rooted in the assumption that the data that is stored on the web is “unstructured”. Unlike the majority of the data processed by applications today which is stored in some form of a structure e.g. a relational database, the data on the web is not so, as its is perceived as discrete pieces of data scattered all over the web.
I told the class that part of what I am doing at Filtrbox is an attempt to prove that the data on the web is not as “unstructured” as we may think today. Within that data, there is a lot of structure, relationship and general interconnectedness no matter how “discrete” we may think it is. With effective mining of the data and good applications, we can apply interpretation to the data and produce meaningful information. However, we are still far from applications that can apply effective interpretive meaning on this data. The reason for this is that we have to address the problem of information retrieval (IR) first before we can get to the writing of applications.
To recognize where we are today on the continuum of web data information retreival and applications; a look at the evolution of enterprise applications gives us a great analogy:
Enterprise applications are where they are today primarily because they have a structured data storage model (Relational Database or RDB) and a standard access model (Structured Query Language or SQL). Before there were enterprise applications that we know today, there were only RDBs and SQL. While RDB work dates back to the 1960s, the RDBs that the majority is familiar with today had their beginnings in the 1970s. The first (or widely believed to be) commercially available implementation of RDB+SQL was Oracle, then known as Relational Software, in 1979. This provided the ability to query an RDB for data using SQL but no applications as we know them today. Analogizing this with the web, this is where we are today. We can go on Google or our favorite RSS readers (RDB analogy) and query for web data using a weak REST API or search form (SQL analogy) but we have no applications comparative to what is in enterprise today to interpret that data. So simply put, today we are where enterprise applications were in 1979.
My message to the class was that applications like Filtrbox are starting to barely scratch the surface with respect to the implementing of applications on top of web data. That is because, although its 2008, we are still in 1979. The stumbling block is the perception of the “unstructured” nature of web data. Today’s NLP students will play a large role tomorrow in identifying and establishing structure in the “unstructured” web data in order to move us beyond 1979.