27,172 research outputs found
Conciseness: An Overlooked Language Task
We report on novel investigations into training models that make sentences
concise. We define the task and show that it is different from related tasks
such as summarization and simplification. For evaluation, we release two test
sets, consisting of 2000 sentences each, that were annotated by two and five
human annotators, respectively. We demonstrate that conciseness is a difficult
task for which zero-shot setups with large neural language models often do not
perform well. Given the limitations of these approaches, we propose a synthetic
data generation method based on round-trip translations. Using this data to
either train Transformers from scratch or fine-tune T5 models yields our
strongest baselines that can be further improved by fine-tuning on an
artificial conciseness dataset that we derived from multi-annotator machine
translation test sets.Comment: EMNLP 2022 Workshop on Text Simplification, Accessibility, and
Readability (TSAR
Recommended from our members
Meta-analysis of massively parallel reporter assays enables prediction of regulatory function across cell types.
Deciphering the potential of noncoding loci to influence gene regulation has been the subject of intense research, with important implications in understanding genetic underpinnings of human diseases. Massively parallel reporter assays (MPRAs) can measure regulatory activity of thousands of DNA sequences and their variants in a single experiment. With increasing number of publically available MPRA data sets, one can now develop data-driven models which, given a DNA sequence, predict its regulatory activity. Here, we performed a comprehensive meta-analysis of several MPRA data sets in a variety of cellular contexts. We first applied an ensemble of methods to predict MPRA output in each context and observed that the most predictive features are consistent across data sets. We then demonstrate that predictive models trained in one cellular context can be used to predict MPRA output in another, with loss of accuracy attributed to cell-type-specific features. Finally, we show that our approach achieves top performance in the Fifth Critical Assessment of Genome Interpretation "Regulation Saturation" Challenge for predicting effects of single-nucleotide variants. Overall, our analysis provides insights into how MPRA data can be leveraged to highlight functional regulatory regions throughout the genome and can guide effective design of future experiments by better prioritizing regions of interest
Accessibility and urban design - Knowledge matters
Copyright @ 2009 Birmingham City University Publicatio
Accurator: Nichesourcing for Cultural Heritage
With more and more cultural heritage data being published online, their
usefulness in this open context depends on the quality and diversity of
descriptive metadata for collection objects. In many cases, existing metadata
is not adequate for a variety of retrieval and research tasks and more specific
annotations are necessary. However, eliciting such annotations is a challenge
since it often requires domain-specific knowledge. Where crowdsourcing can be
successfully used for eliciting simple annotations, identifying people with the
required expertise might prove troublesome for tasks requiring more complex or
domain-specific knowledge. Nichesourcing addresses this problem, by tapping
into the expert knowledge available in niche communities. This paper presents
Accurator, a methodology for conducting nichesourcing campaigns for cultural
heritage institutions, by addressing communities, organizing events and
tailoring a web-based annotation tool to a domain of choice. The contribution
of this paper is threefold: 1) a nichesourcing methodology, 2) an annotation
tool for experts and 3) validation of the methodology and tool in three case
studies. The three domains of the case studies are birds on art, bible prints
and fashion images. We compare the quality and quantity of obtained annotations
in the three case studies, showing that the nichesourcing methodology in
combination with the image annotation tool can be used to collect high quality
annotations in a variety of domains and annotation tasks. A user evaluation
indicates the tool is suited and usable for domain specific annotation tasks
Recommended from our members
Medical Image Data and Datasets in the Era of Machine Learning-Whitepaper from the 2016 C-MIMI Meeting Dataset Session.
At the first annual Conference on Machine Intelligence in Medical Imaging (C-MIMI), held in September 2016, a conference session on medical image data and datasets for machine learning identified multiple issues. The common theme from attendees was that everyone participating in medical image evaluation with machine learning is data starved. There is an urgent need to find better ways to collect, annotate, and reuse medical imaging data. Unique domain issues with medical image datasets require further study, development, and dissemination of best practices and standards, and a coordinated effort among medical imaging domain experts, medical imaging informaticists, government and industry data scientists, and interested commercial, academic, and government entities. High-level attributes of reusable medical image datasets suitable to train, test, validate, verify, and regulate ML products should be better described. NIH and other government agencies should promote and, where applicable, enforce, access to medical image datasets. We should improve communication among medical imaging domain experts, medical imaging informaticists, academic clinical and basic science researchers, government and industry data scientists, and interested commercial entities
Secure Querying of Recursive XML Views: A Standard XPath-based Technique
Most state-of-the art approaches for securing XML documents allow users to
access data only through authorized views defined by annotating an XML grammar
(e.g. DTD) with a collection of XPath expressions. To prevent improper
disclosure of confidential information, user queries posed on these views need
to be rewritten into equivalent queries on the underlying documents. This
rewriting enables us to avoid the overhead of view materialization and
maintenance. A major concern here is that query rewriting for recursive XML
views is still an open problem. To overcome this problem, some works have been
proposed to translate XPath queries into non-standard ones, called Regular
XPath queries. However, query rewriting under Regular XPath can be of
exponential size as it relies on automaton model. Most importantly, Regular
XPath remains a theoretical achievement. Indeed, it is not commonly used in
practice as translation and evaluation tools are not available. In this paper,
we show that query rewriting is always possible for recursive XML views using
only the expressive power of the standard XPath. We investigate the extension
of the downward class of XPath, composed only by child and descendant axes,
with some axes and operators and we propose a general approach to rewrite
queries under recursive XML views. Unlike Regular XPath-based works, we provide
a rewriting algorithm which processes the query only over the annotated DTD
grammar and which can run in linear time in the size of the query. An
experimental evaluation demonstrates that our algorithm is efficient and scales
well.Comment: (2011
Towards Affordable Disclosure of Spoken Word Archives
This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken word archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, the least we want to be able to provide is search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition – supporting e.g., within-document search– are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is still far from satisfactory, and requires additional research
- …