1 research outputs found

    Entity Extraction from Unstructured Data on the Web

    Get PDF
    A large number of web pages contain information about entities in lists where the lists are represented in textual form. Textual lists contain implicit records of entities. However, the field values of such records cannot easily be separated or extracted by automatic processes. This, therefore, remains a challenging research problem in the literature. Previous studies in the literature relied mainly on probabilistic graph-based models to capture the attributes and the likely structures of implicit records in a list. However, one of the important limitations of existing methods is that the structures of the records in input lists were implicitly encoded via training data which was manually created. This thesis aims to investigate novel techniques to acquire automatically information about entities from implicit records embedded in textual lists on the web. This thesis introduces a self-supervised learning framework which exploits both existing data in a knowledge base and the structural similarity between sequences in lists to build an extraction model automatically. In the proposed framework, initial labels for candidate field values are created and assigned to generate label sequences. Then, the structure of implici
    corecore