NLP: Unstructured thinking for unstructured data
In my last blog post, I talked about how we have had to develop Natural Language Processing (NLP) algorithms in order to overcome the lack of standardization on the web. At Filtrbox, the more we dig deeper into the web, exploring its inner depths for information, the more I find that we are having to use a NLP concept here or a half NLP concept there to facilitate the process of mining unstructured data. The application of NLP concepts is increasingly figuring into the majority of our algorithms. I have begun to notice that my thought process as software architect, designer and developer is tending to exhibit influences of NLP and machine learning concepts much more than before.
I think NLP fundamentals are essential for those who wish to undertake the challenge of building the next generation of web applications that process the unstructured data on the web. Yes, there are efforts to build a structured web via initiatives such as the semantic web and the various APIs being proposed. I respect these efforts; however, I would not solely rely on these initiatives alone. The proposed APIs provide access to structured data stored on various islands on the web. For those users who do not have their data on those islands, their data is not accessible via the API. The Semantic Web is the initiative that will bring us closest to structured data on the web. However, as we are witnessing its painfully slow adoption, it looks like its going to be a while before we have some structure on the web. The challenge is what do we do now while we wait for these initiatives to mature. I think what we do today is, instead of waiting for content publishers to structure their content, we process content publishers’ content as is and we programmatically infer the structure of the content.
The application of NLP concepts are one way we can make the content structure inferences. By applying NLP, this will take us a step closer to programmatic input, processing and storage of unstructured data. We have traditionally thought in terms of structured data, programmed for structured data and stored structured data. The challenge posed by the web today is an opportunity to break new ground for software engineers and start thinking, programming and storing unstructured data.