39 research outputs found

    An XML Approach of Coding a Morphological Database for Arabic Language

    Get PDF
    We present an XML approach for the production of an Arabic morphological database for Arabic language that will be used in morphological analysis for modern standard Arabic (MSA). Optimizing the production, maintenance, and extension of morphological database is one of the crucial aspects impacting natural language processing (NLP). For Arabic language, producing a morphological database is not an easy task, because this it has some particularities such as the phenomena of agglutination and a lot of morphological ambiguity phenomenon. The method presented can be exploited by NLP applications such as syntactic analysis, semantic analysis, information retrieval, and orthographical correction

    Exploiting and assessing multi-source data for supervised biomedical named entity recognition

    Get PDF
    Motivation: Recognition of biomedical entities from scientific text is a critica l component of natural language processing and automated information extraction platfo rms. Modern named entity recognition approaches rely heavily on supervised machine learning tech niques, which are critically dependent on annotated training corpora. These approaches have been shown to perform well when trained and tested on the same source. However, in such scenario, the performance and evaluation of these models may be optimistic, as such models may not necessarily generalize to in dependent corpora, resulting in potential non-optimal entity recognition for large-scale tagging of widel y diverse articles in databases such as PubMed. Results: Here we aggregated published corpora for the recognition of bio molecular entities (such as genes, RNA, proteins, variants, drugs, and metabolites), identi fied entity class overlap and performed leave-corpus-out cross validation strategy to test the efficiency o f existing models. We demonstrate that accuracies of models trained on individual corpora decre ase substantially for recognition of the same biomolecular entity classes in independent corpora. Thi s behavior is possibly due to limited generalizability of entity-class-related features captured by i ndividual corpora (model “overtraining”) which we investigated further at the orthographic level, as well as potenti al annotation standard differences. We show that the combined use of multi-source training corpora re sults in overall more generalizable models for named entity recognition, while achieving comparab le individual performance. By performing learning-curve-based power analysis we further identified that performance is often not limited by the quantity of the annotated data

    Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy

    Get PDF
    The Endoscopy Computer Vision Challenge (EndoCV) is a crowd-sourcing initiative to address eminent problems in developing reliable computer aided detection and diagnosis endoscopy systems and suggest a pathway for clinical translation of technologies. Whilst endoscopy is a widely used diagnostic and treatment tool for hollow-organs, there are several core challenges often faced by endoscopists, mainly: 1) presence of multi-class artefacts that hinder their visual interpretation, and 2) difficulty in identifying subtle precancerous precursors and cancer abnormalities. Artefacts often affect the robustness of deep learning methods applied to the gastrointestinal tract organs as they can be confused with tissue of interest. EndoCV2020 challenges are designed to address research questions in these remits. In this paper, we present a summary of methods developed by the top 17 teams and provide an objective comparison of state-of-the-art methods and methods designed by the participants for two sub-challenges: i) artefact detection and segmentation (EAD2020), and ii) disease detection and segmentation (EDD2020). Multi-center, multi-organ, multi-class, and multi-modal clinical endoscopy datasets were compiled for both EAD2020 and EDD2020 sub-challenges. The out-of-sample generalization ability of detection algorithms was also evaluated. Whilst most teams focused on accuracy improvements, only a few methods hold credibility for clinical usability. The best performing teams provided solutions to tackle class imbalance, and variabilities in size, origin, modality and occurrences by exploring data augmentation, data fusion, and optimal class thresholding techniques

    Splitting Arabic Texts into Elementary Discourse Units

    Get PDF
    International audienceIn this article, we propose the first work that investigates the feasibility of Arabic discourse segmentation into elementary discourse units within the segmented discourse representation theory framework. We first describe our annotation scheme that defines a set of principles to guide the segmentation process. Two corpora have been annotated according to this scheme: elementary school textbooks and newspaper documents extracted from the syntactically annotated Arabic Treebank. Then, we propose a multiclass supervised learning approach that predicts nested units. Our approach uses a combination of punctuation, morphological, lexical, and shallow syntactic features. We investigate how each feature contributes to the learning process. We show that an extensive morphological analysis is crucial to achieve good results in both corpora. In addition, we show that adding chunks does not boost the performance of our system

    Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy

    Get PDF
    The Endoscopy Computer Vision Challenge (EndoCV) is a crowd-sourcing initiative to address eminent problems in developing reliable computer aided detection and diagnosis endoscopy systems and suggest a pathway for clinical translation of technologies. Whilst endoscopy is a widely used diagnostic and treatment tool for hollow-organs, there are several core challenges often faced by endoscopists, mainly: 1) presence of multi-class artefacts that hinder their visual interpretation, and 2) difficulty in identifying subtle precancerous precursors and cancer abnormalities. Artefacts often affect the robustness of deep learning methods applied to the gastrointestinal tract organs as they can be confused with tissue of interest. EndoCV2020 challenges are designed to address research questions in these remits. In this paper, we present a summary of methods developed by the top 17 teams and provide an objective comparison of state-of-the-art methods and methods designed by the participants for two sub-challenges: i) artefact detection and segmentation (EAD2020), and ii) disease detection and segmentation (EDD2020). Multi-center, multi-organ, multi-class, and multi-modal clinical endoscopy datasets were compiled for both EAD2020 and EDD2020 sub-challenges. The out-of-sample generalization ability of detection algorithms was also evaluated. Whilst most teams focused on accuracy improvements, only a few methods hold credibility for clinical usability. The best performing teams provided solutions to tackle class imbalance, and variabilities in size, origin, modality and occurrences by exploring data augmentation, data fusion, and optimal class thresholding techniques. [Abstract copyright: Copyright © 2021 The Authors. Published by Elsevier B.V. All rights reserved.

    OXENDONET: A dilated convolutional neural networks for endoscopic artefact segmentation

    No full text
    Medical image segmentation plays a key role in many generic applications such as population analysis and, more accessibly, can be made into a crucial tool in diagnosis and treatment planning. Its output can vary from extracting practical clinical information such as pathologies (detection of cancer), to measuring anatomical structures (kidney volume, cartilage thickness, bone angles). Many prior approaches to this problem are based on one of two main architectures: a fully convolutional network or a U-Net-based architecture. These methods rely on multiple pooling and striding layers to increase the receptive field size of neurons. Since we are tackling a segmentation task, the way pooling layers are used reduce the feature map size and lead to the loss of important spatial information. In this paper, we propose a novel neural network, which we call OxEndoNet. Our network uses the pyramid dilated module (PDM) consisting of multiple dilated convolutions stacked in parallel. The PDM module eliminates the need of striding layers and has a very large receptive field which maintains spatial resolution. We combine several pyramid dilated modules to form our final OxEndoNet network. The proposed network is able to capture small and complex variations in the challenging problem of Endoscopy Artefact Detection and Segmentation where objects vary largely in scale and size

    Dopnet: Densely Oriented Pooling Network for medical image segmentation

    No full text
    Since manual annotation of medical images is time consuming for clinical experts, reliable automatic segmentation would be the ideal way to handle large medical datasets. Deep learning-based models have been the dominant approach, achieving remarkable performance on various medical segmentation tasks. There can be a significant variation in the size of the feature being segmented out of a medical image relative to the other features in the image, which can be challenging. In this paper, we propose a Densely Oriented Pooling Network (DOPNet) to capture variation in feature size in medical images and preserve spatial interconnection. DOPNet is based on two interdependent ideas: the dense connectivity and the pooling oriented layer. When tested on three publicly available medical image segmentation datasets, the proposed model achieves leading performance

    Empirical evaluation of word representations on arabic sentiment analysis

    No full text
    Sentiment analysis is the Natural Language Processing (NLP) task that aims to classify text to different classes such as positive, negative or neutral. In this paper, we focus on sentiment analysis for Arabic language. Most of the previous works use machine learning techniques combined with hand engineering features to do Arabic sentiment analysis (ASA). More recently, Deep Neural Networks (DNNs) were widely used for this task especially for English languages. In this work, we developed a system called CNN-ASAWR where we investigate the use of Convolutional Neural Networks (CNNs) for ASA on 2 datasets: ASTD and SemEval 2017 datasets. We explore the importance of various unsupervised word representations learned from unannotated corpora. Experimental results showed that we were able to outperform the previous state-of-the-art systems on the datasets without using any kind of hand engineering features.SCOPUS: cp.kinfo:eu-repo/semantics/publishe
    corecore