17,837 research outputs found
Assessing robustness of radiomic features by image perturbation
Image features need to be robust against differences in positioning,
acquisition and segmentation to ensure reproducibility. Radiomic models that
only include robust features can be used to analyse new images, whereas models
with non-robust features may fail to predict the outcome of interest
accurately. Test-retest imaging is recommended to assess robustness, but may
not be available for the phenotype of interest. We therefore investigated 18
methods to determine feature robustness based on image perturbations.
Test-retest and perturbation robustness were compared for 4032 features that
were computed from the gross tumour volume in two cohorts with computed
tomography imaging: I) 31 non-small-cell lung cancer (NSCLC) patients; II): 19
head-and-neck squamous cell carcinoma (HNSCC) patients. Robustness was measured
using the intraclass correlation coefficient (1,1) (ICC). Features with
ICC were considered robust. The NSCLC cohort contained more robust
features for test-retest imaging than the HNSCC cohort ( vs. ).
A perturbation chain consisting of noise addition, affine translation, volume
growth/shrinkage and supervoxel-based contour randomisation identified the
fewest false positive robust features (NSCLC: ; HNSCC: ). Thus,
this perturbation chain may be used to assess feature robustness.Comment: 31 pages, 14 figures pre-submission versio
BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages
We introduce BlaBla, an open-source Python library for extracting linguistic
features with proven clinical relevance to neurological and psychiatric
diseases across many languages. BlaBla is a unifying framework for accelerating
and simplifying clinical linguistic research. The library is built on
state-of-the-art NLP frameworks and supports multithreaded/GPU-enabled feature
extraction via both native Python calls and a command line interface. We
describe BlaBla's architecture and clinical validation of its features across
12 diseases. We further demonstrate the application of BlaBla to a task
visualizing and classifying language disorders in three languages on real
clinical data from the AphasiaBank dataset. We make the codebase freely
available to researchers with the hope of providing a consistent,
well-validated foundation for the next generation of clinical linguistic
research.Comment: 5 pages. 1 figure. Under revie
Some Bibliographical References on Intonation and Intonational Meaning
A by-no-means-complete collection of references for those interested in
intonational meaning, with other miscellaneous references on intonation
included. Additional references are welcome, and should be sent to
[email protected]: 14 pp of text and citations, bibtex added as separate fil
Substitute Based SCODE Word Embeddings in Supervised NLP Tasks
We analyze a word embedding method in supervised tasks. It maps words on a
sphere such that words co-occurring in similar contexts lie closely. The
similarity of contexts is measured by the distribution of substitutes that can
fill them. We compared word embeddings, including more recent representations,
in Named Entity Recognition (NER), Chunking, and Dependency Parsing. We examine
our framework in multilingual dependency parsing as well. The results show that
the proposed method achieves as good as or better results compared to the other
word embeddings in the tasks we investigate. It achieves state-of-the-art
results in multilingual dependency parsing. Word embeddings in 7 languages are
available for public use.Comment: 11 page
State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"
Several Networks of Excellence have been set up in the framework of the
European FP5 research program. Among these Networks of Excellence, the NEMIS
project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as
one of the key topics and the WG1 working group was created in the NEMIS
project, to carry out a detailed survey of techniques associated with the text
mining process and to identify the relevant research topics in related research
areas.
In this document we present the results of this comprehensive survey. The
report includes a description of the current state-of-the-art and practice, a
roadmap for follow-up research in the identified areas, and recommendations for
anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of
Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS
Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
This paper introduces Dynamic Programming Encoding (DPE), a new segmentation
algorithm for tokenizing sentences into subword units. We view the subword
segmentation of output sentences as a latent variable that should be
marginalized out for learning and inference. A mixed character-subword
transformer is proposed, which enables exact log marginal likelihood estimation
and exact MAP inference to find target segmentations with maximum posterior
probability. DPE uses a lightweight mixed character-subword transformer as a
means of pre-processing parallel data to segment output sentences using dynamic
programming. Empirical results on machine translation suggest that DPE is
effective for segmenting output sentences and can be combined with BPE dropout
for stochastic segmentation of source sentences. DPE achieves an average
improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average
improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several
WMT datasets including English (German, Romanian, Estonian, Finnish,
Hungarian).Comment: update related wor
Deep Learning applied to NLP
Convolutional Neural Network (CNNs) are typically associated with Computer
Vision. CNNs are responsible for major breakthroughs in Image Classification
and are the core of most Computer Vision systems today. More recently CNNs have
been applied to problems in Natural Language Processing and gotten some
interesting results. In this paper, we will try to explain the basics of CNNs,
its different variations and how they have been applied to NLP
Evaluating Neural Morphological Taggers for Sanskrit
Neural sequence labelling approaches have achieved state of the art results
in morphological tagging. We evaluate the efficacy of four standard sequence
labelling models on Sanskrit, a morphologically rich, fusional Indian language.
As its label space can theoretically contain more than 40,000 labels, systems
that explicitly model the internal structure of a label are more suited for the
task, because of their ability to generalise to labels not seen during
training. We find that although some neural models perform better than others,
one of the common causes for error for all of these models is mispredictions
due to syncretism.Comment: Accepted to SIGMORPHON Workshop at ACL 202
svcR: An R Package for Support Vector Clustering improved with Geometric Hashing applied to Lexical Pattern Discovery
We present a new R package which takes a numerical matrix format as data
input, and computes clusters using a support vector clustering method (SVC). We
have implemented an original 2D-grid labeling approach to speed up cluster
extraction. In this sense, SVC can be seen as an efficient cluster extraction
if clusters are separable in a 2-D map. Secondly we showed that this SVC
approach using a Jaccard-Radial base kernel can help to classify well enough a
set of terms into ontological classes and help to define regular expression
rules for information extraction in documents; our case study concerns a set of
terms and documents about developmental and molecular biology
Natural language technology and query expansion: issues, state-of-the-art and perspectives
The availability of an abundance of knowledge sources has spurred a large
amount of effort in the development and enhancement of Information Retrieval
techniques. Users information needs are expressed in natural language and
successful retrieval is very much dependent on the effective communication of
the intended purpose. Natural language queries consist of multiple linguistic
features which serve to represent the intended search goal. Linguistic
characteristics that cause semantic ambiguity and misinterpretation of queries
as well as additional factors such as the lack of familiarity with the search
environment affect the users ability to accurately represent their information
needs, coined by the concept intention gap. The latter directly affects the
relevance of the returned search results which may not be to the users
satisfaction and therefore is a major issue impacting the effectiveness of
information retrieval systems. Central to our discussion is the identification
of the significant constituents that characterize the query intent and their
enrichment through the addition of meaningful terms, phrases or even latent
representations, either manually or automatically to capture their intended
meaning. Specifically, we discuss techniques to achieve the enrichment and in
particular those utilizing the information gathered from statistical processing
of term dependencies within a document corpus or from external knowledge
sources such as ontologies. We lay down the anatomy of a generic linguistic
based query expansion framework and propose its module-based decomposition,
covering topical issues from query processing, information retrieval,
computational linguistics and ontology engineering. For each of the modules we
review state-of-the-art solutions in the literature categorized and analyzed
under the light of the techniques used
- …