Abstract
Prevailing wisdom assumes that there are well-defined, effective and efficient methods for accessing Deep Web content. Unfortunately, there is a host of technical and non-technical factors that may call this assumption into question. In this paper, we present the findings from work on a software system, which was commissioned by the British Broadcasting Corporation (BBC). The system requires stable and periodic extraction of Deep Web content from a number of online data sources. The insight from the project brings an important issue to the forefront and underscores the need for further research into access technology for the Deep Web.
The problem of accessing Deep Web content has many significant issues yet to be solved, such as challenges with dynamic, unlinked, private and non-HTML content. These concerns are further exacerbated by the rapid growth of Deep Web content, fueled by the success of social networking online, the proliferation of Web 2.0 content and the profitability of the companies that steward in this new era. Business models, resource management strategies, and long-term vision play a significant role in driving technical directions and influencing access methods.
As technology products transition to relying on Deep Web content, the gaps between reality and the assumed becomes clearer. It then becomes apparent that the reliability and efficacy of the established (or preferred) information retrieval techniques are below expectation and require improvement.
In this paper, we discuss the lessons learned from an industry-specific application deployment, the Sound Index, which showcases a set of interesting issues that must be addressed.