1 research outputs found

    Sharad Seth 2 Analysis and Taxonomy of Column Header Categories for Web Tables

    No full text
    We describe a component of a document analysis system for constructing ontologies for domain-specific web tables imported into Excel. This component automates extraction of the Wang Notation for the column header of a table. Using column-header specific rules for XY cutting we convert the geometric structure of the column header to a linear string denoting cell attributes and directions of cuts. The string representation is parsed by a contextfree grammar and the parse tree is further processed to produce an abstract data-type representation (the Wang notation tree) of each column category. Experiments were carried out to evaluate this scheme on the original and edited column headers of Excel tables drawn from a collection of 200 used in our earlier work. The transformed headers were obtained by editing the original column headers to conform to the format targeted by our grammar. Fortyfour original headers and their reformatted versions were submitted as input to our software system. Our grammar was able to parse and the extract Wang notation tree for all the edited headers, but for only four of the original headers. We suggest extensions to our table grammar that would enable processing a larger fraction of headers without manual editing
    corecore