17,837 research outputs found

    Assessing robustness of radiomic features by image perturbation

    Full text link
    Image features need to be robust against differences in positioning, acquisition and segmentation to ensure reproducibility. Radiomic models that only include robust features can be used to analyse new images, whereas models with non-robust features may fail to predict the outcome of interest accurately. Test-retest imaging is recommended to assess robustness, but may not be available for the phenotype of interest. We therefore investigated 18 methods to determine feature robustness based on image perturbations. Test-retest and perturbation robustness were compared for 4032 features that were computed from the gross tumour volume in two cohorts with computed tomography imaging: I) 31 non-small-cell lung cancer (NSCLC) patients; II): 19 head-and-neck squamous cell carcinoma (HNSCC) patients. Robustness was measured using the intraclass correlation coefficient (1,1) (ICC). Features with ICC≥0.90\geq0.90 were considered robust. The NSCLC cohort contained more robust features for test-retest imaging than the HNSCC cohort (73.5%73.5\% vs. 34.0%34.0\%). A perturbation chain consisting of noise addition, affine translation, volume growth/shrinkage and supervoxel-based contour randomisation identified the fewest false positive robust features (NSCLC: 3.3%3.3\%; HNSCC: 10.0%10.0\%). Thus, this perturbation chain may be used to assess feature robustness.Comment: 31 pages, 14 figures pre-submission versio

    BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages

    Full text link
    We introduce BlaBla, an open-source Python library for extracting linguistic features with proven clinical relevance to neurological and psychiatric diseases across many languages. BlaBla is a unifying framework for accelerating and simplifying clinical linguistic research. The library is built on state-of-the-art NLP frameworks and supports multithreaded/GPU-enabled feature extraction via both native Python calls and a command line interface. We describe BlaBla's architecture and clinical validation of its features across 12 diseases. We further demonstrate the application of BlaBla to a task visualizing and classifying language disorders in three languages on real clinical data from the AphasiaBank dataset. We make the codebase freely available to researchers with the hope of providing a consistent, well-validated foundation for the next generation of clinical linguistic research.Comment: 5 pages. 1 figure. Under revie

    Some Bibliographical References on Intonation and Intonational Meaning

    Full text link
    A by-no-means-complete collection of references for those interested in intonational meaning, with other miscellaneous references on intonation included. Additional references are welcome, and should be sent to [email protected]: 14 pp of text and citations, bibtex added as separate fil

    Substitute Based SCODE Word Embeddings in Supervised NLP Tasks

    Full text link
    We analyze a word embedding method in supervised tasks. It maps words on a sphere such that words co-occurring in similar contexts lie closely. The similarity of contexts is measured by the distribution of substitutes that can fill them. We compared word embeddings, including more recent representations, in Named Entity Recognition (NER), Chunking, and Dependency Parsing. We examine our framework in multilingual dependency parsing as well. The results show that the proposed method achieves as good as or better results compared to the other word embeddings in the tasks we investigate. It achieves state-of-the-art results in multilingual dependency parsing. Word embeddings in 7 languages are available for public use.Comment: 11 page

    State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"

    Full text link
    Several Networks of Excellence have been set up in the framework of the European FP5 research program. Among these Networks of Excellence, the NEMIS project focuses on the field of Text Mining. Within this field, document processing and visualization was identified as one of the key topics and the WG1 working group was created in the NEMIS project, to carry out a detailed survey of techniques associated with the text mining process and to identify the relevant research topics in related research areas. In this document we present the results of this comprehensive survey. The report includes a description of the current state-of-the-art and practice, a roadmap for follow-up research in the identified areas, and recommendations for anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS

    Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

    Full text link
    This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out for learning and inference. A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations with maximum posterior probability. DPE uses a lightweight mixed character-subword transformer as a means of pre-processing parallel data to segment output sentences using dynamic programming. Empirical results on machine translation suggest that DPE is effective for segmenting output sentences and can be combined with BPE dropout for stochastic segmentation of source sentences. DPE achieves an average improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several WMT datasets including English (German, Romanian, Estonian, Finnish, Hungarian).Comment: update related wor

    Deep Learning applied to NLP

    Full text link
    Convolutional Neural Network (CNNs) are typically associated with Computer Vision. CNNs are responsible for major breakthroughs in Image Classification and are the core of most Computer Vision systems today. More recently CNNs have been applied to problems in Natural Language Processing and gotten some interesting results. In this paper, we will try to explain the basics of CNNs, its different variations and how they have been applied to NLP

    Evaluating Neural Morphological Taggers for Sanskrit

    Full text link
    Neural sequence labelling approaches have achieved state of the art results in morphological tagging. We evaluate the efficacy of four standard sequence labelling models on Sanskrit, a morphologically rich, fusional Indian language. As its label space can theoretically contain more than 40,000 labels, systems that explicitly model the internal structure of a label are more suited for the task, because of their ability to generalise to labels not seen during training. We find that although some neural models perform better than others, one of the common causes for error for all of these models is mispredictions due to syncretism.Comment: Accepted to SIGMORPHON Workshop at ACL 202

    svcR: An R Package for Support Vector Clustering improved with Geometric Hashing applied to Lexical Pattern Discovery

    Full text link
    We present a new R package which takes a numerical matrix format as data input, and computes clusters using a support vector clustering method (SVC). We have implemented an original 2D-grid labeling approach to speed up cluster extraction. In this sense, SVC can be seen as an efficient cluster extraction if clusters are separable in a 2-D map. Secondly we showed that this SVC approach using a Jaccard-Radial base kernel can help to classify well enough a set of terms into ontological classes and help to define regular expression rules for information extraction in documents; our case study concerns a set of terms and documents about developmental and molecular biology

    Natural language technology and query expansion: issues, state-of-the-art and perspectives

    Full text link
    The availability of an abundance of knowledge sources has spurred a large amount of effort in the development and enhancement of Information Retrieval techniques. Users information needs are expressed in natural language and successful retrieval is very much dependent on the effective communication of the intended purpose. Natural language queries consist of multiple linguistic features which serve to represent the intended search goal. Linguistic characteristics that cause semantic ambiguity and misinterpretation of queries as well as additional factors such as the lack of familiarity with the search environment affect the users ability to accurately represent their information needs, coined by the concept intention gap. The latter directly affects the relevance of the returned search results which may not be to the users satisfaction and therefore is a major issue impacting the effectiveness of information retrieval systems. Central to our discussion is the identification of the significant constituents that characterize the query intent and their enrichment through the addition of meaningful terms, phrases or even latent representations, either manually or automatically to capture their intended meaning. Specifically, we discuss techniques to achieve the enrichment and in particular those utilizing the information gathered from statistical processing of term dependencies within a document corpus or from external knowledge sources such as ontologies. We lay down the anatomy of a generic linguistic based query expansion framework and propose its module-based decomposition, covering topical issues from query processing, information retrieval, computational linguistics and ontology engineering. For each of the modules we review state-of-the-art solutions in the literature categorized and analyzed under the light of the techniques used
    • …
    corecore