Abstract:
Deep Web databases, whose content is presented as dynamically generated Web pages hidden behind forms, have mostly been left un indexed by search engine crawlers. In order to automatically explore this mass of information, many current techniques assume the existence of domain knowledge, which is costly to create and maintain. In this article, we present a new perspective on form understanding and deep Web data acquisition that does not require any domain-specific knowledge.
Unlike previous approaches, we do not perform the various steps in the process (e.g., form understanding, mrecord identification, attribute labeling) independently but integrate them to achieve a complete understanding of deep Web sources.
Through information extraction techniques and using the form itself for validation, we reconcile input and output schemas in a labeled graph which is further aligned with a generic ontology.The impact of this alignment is threefold: first, the resulting semantic infrastructure associated with the form can assist Web crawlers when probing the form for content indexing; second, attributes of response pages are labeled by matching known ontology instances, and relations between attributes are uncovered; and third, we enrich the generic ontology with facts from the deep Web.
The deep Web consists of dynamically-generated Web pages that are reachable by issuing queries through HTML forms. A form is a section of a document with special control elements (e.g., checkboxes, text inputs) and associated labels. Users generally interact with a form by modifying its controls (entering text, selecting menu items) before submitting it to a Web server for processing.