3,486 research outputs found

    ON MONITORING LANGUAGE CHANGE WITH THE SUPPORT OF CORPUS PROCESSING

    Get PDF
    One of the fundamental characteristics of language is that it can change over time. One method to monitor the change is by observing its corpora: a structured language documentation. Recent development in technology, especially in the field of Natural Language Processing allows robust linguistic processing, which support the description of diverse historical changes of the corpora. The interference of human linguist is inevitable as it determines the gold standard, but computer assistance provides considerable support by incorporating computational approach in exploring the corpora, especially historical corpora. This paper proposes a model for corpus development, where corpus are annotated to support further computational operations such as lexicogrammatical pattern matching, automatic retrieval and extraction. The corpus processing operations are performed by local grammar based corpus processing software on a contemporary Indonesian corpus. This paper concludes that data collection and data processing in a corpus are equally crucial importance to monitor language change, and none can be set aside

    A Survey on Awesome Korean NLP Datasets

    Full text link
    English based datasets are commonly available from Kaggle, GitHub, or recently published papers. Although benchmark tests with English datasets are sufficient to show off the performances of new models and methods, still a researcher need to train and validate the models on Korean based datasets to produce a technology or product, suitable for Korean processing. This paper introduces 15 popular Korean based NLP datasets with summarized details such as volume, license, repositories, and other research results inspired by the datasets. Also, I provide high-resolution instructions with sample or statistics of datasets. The main characteristics of datasets are presented on a single table to provide a rapid summarization of datasets for researchers.Comment: 11 pages, 1 horizontal page for large tabl

    Content Recognition and Context Modeling for Document Analysis and Retrieval

    Get PDF
    The nature and scope of available documents are changing significantly in many areas of document analysis and retrieval as complex, heterogeneous collections become accessible to virtually everyone via the web. The increasing level of diversity presents a great challenge for document image content categorization, indexing, and retrieval. Meanwhile, the processing of documents with unconstrained layouts and complex formatting often requires effective leveraging of broad contextual knowledge. In this dissertation, we first present a novel approach for document image content categorization, using a lexicon of shape features. Each lexical word corresponds to a scale and rotation invariant local shape feature that is generic enough to be detected repeatably and is segmentation free. A concise, structurally indexed shape lexicon is learned by clustering and partitioning feature types through graph cuts. Our idea finds successful application in several challenging tasks, including content recognition of diverse web images and language identification on documents composed of mixed machine printed text and handwriting. Second, we address two fundamental problems in signature-based document image retrieval. Facing continually increasing volumes of documents, detecting and recognizing unique, evidentiary visual entities (\eg, signatures and logos) provides a practical and reliable supplement to the OCR recognition of printed text. We propose a novel multi-scale framework to detect and segment signatures jointly from document images, based on the structural saliency under a signature production model. We formulate the problem of signature retrieval in the unconstrained setting of geometry-invariant deformable shape matching and demonstrate state-of-the-art performance in signature matching and verification. Third, we present a model-based approach for extracting relevant named entities from unstructured documents. In a wide range of applications that require structured information from diverse, unstructured document images, processing OCR text does not give satisfactory results due to the absence of linguistic context. Our approach enables learning of inference rules collectively based on contextual information from both page layout and text features. Finally, we demonstrate the importance of mining general web user behavior data for improving document ranking and other web search experience. The context of web user activities reveals their preferences and intents, and we emphasize the analysis of individual user sessions for creating aggregate models. We introduce a novel algorithm for estimating web page and web site importance, and discuss its theoretical foundation based on an intentional surfer model. We demonstrate that our approach significantly improves large-scale document retrieval performance

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Natural Language Processing: Emerging Neural Approaches and Applications

    Get PDF
    This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains
    • …
    corecore