125,840 research outputs found
A detailed analysis of phrase-based and syntax-based machine translation: the search for systematic differences
This paper describes a range of automatic and manual comparisons of phrase-based and syntax-based statistical machine translation methods applied to English-German and
English-French translation of user-generated content. The syntax-based methods underperform the phrase-based models and the relaxation of syntactic constraints to broaden translation rule coverage means that these models do not necessarily generate output which is more grammatical than the output produced by the phrase-based models. Although the
systems generate different output and can potentially
be fruitfully combined, the lack of systematic difference between these models makes the combination task more challenging
Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search
Retrieval pipelines commonly rely on a term-based search to obtain candidate
records, which are subsequently re-ranked. Some candidates are missed by this
approach, e.g., due to a vocabulary mismatch. We address this issue by
replacing the term-based search with a generic k-NN retrieval algorithm, where
a similarity function can take into account subtle term associations. While an
exact brute-force k-NN search using this similarity function is slow, we
demonstrate that an approximate algorithm can be nearly two orders of magnitude
faster at the expense of only a small loss in accuracy. A retrieval pipeline
using an approximate k-NN search can be more effective and efficient than the
term-based pipeline. This opens up new possibilities for designing effective
retrieval pipelines. Our software (including data-generating code) and
derivative data based on the Stack Overflow collection is available online
The impact of sequence database choice on metaproteomic results in gut microbiota studies
Background: Elucidating the role of gut microbiota in physiological and pathological processes has recently emerged as a key research aim in life sciences. In this respect, metaproteomics, the study of the whole protein complement of a microbial community, can provide a unique contribution by revealing which functions are actually being expressed by specific microbial taxa. However, its wide application to gut microbiota research has been hindered by challenges in data analysis, especially related to the choice of the proper sequence databases for protein identification.
Results: Here, we present a systematic investigation of variables concerning database construction and annotation and evaluate their impact on human and mouse gut metaproteomic results. We found that both publicly available and experimental metagenomic databases lead to the identification of unique peptide assortments, suggesting parallel database searches as a mean to gain more complete information. In particular, the contribution of experimental metagenomic databases was revealed to be mandatory when dealing with mouse samples. Moreover, the use of a "merged" database, containing all metagenomic sequences from the population under study, was found to be generally preferable over the use of sample-matched databases. We also observed that taxonomic and functional results are strongly database-dependent, in particular when analyzing the mouse gut microbiota. As a striking example, the Firmicutes/Bacteroidetes ratio varied up to tenfold depending on the database used. Finally, assembling reads into longer contigs provided significant advantages in terms of functional annotation yields.
Conclusions: This study contributes to identify host- and database-specific biases which need to be taken into account in a metaproteomic experiment, providing meaningful insights on how to design gut microbiota studies and to perform metaproteomic data analysis. In particular, the use of multiple databases and annotation tools has to be encouraged, even though this requires appropriate bioinformatic resources
Language (Technology) is Power: A Critical Survey of "Bias" in NLP
We survey 146 papers analyzing "bias" in NLP systems, finding that their
motivations are often vague, inconsistent, and lacking in normative reasoning,
despite the fact that analyzing "bias" is an inherently normative process. We
further find that these papers' proposed quantitative techniques for measuring
or mitigating "bias" are poorly matched to their motivations and do not engage
with the relevant literature outside of NLP. Based on these findings, we
describe the beginnings of a path forward by proposing three recommendations
that should guide work analyzing "bias" in NLP systems. These recommendations
rest on a greater recognition of the relationships between language and social
hierarchies, encouraging researchers and practitioners to articulate their
conceptualizations of "bias"---i.e., what kinds of system behaviors are
harmful, in what ways, to whom, and why, as well as the normative reasoning
underlying these statements---and to center work around the lived experiences
of members of communities affected by NLP systems, while interrogating and
reimagining the power relations between technologists and such communities
Four-Quark Effective Operators at Hadron Colliders
The robustness of translating effective operator constraints to BSM theories
crucially depends on the mass and coupling of BSM particles. This is especially
relevant for hadron colliders where the partonic centre of mass energy is
around the typical energy scales of natural BSM theories. The caveats in
applying the limits are discussed using Z' and G' models, illustrating the
effects for a large class of models. This analysis shows that the applicability
of effective operators mainly depends on the ratio of the transfer energy in
the events and the mass scale of the full theory. Moreover, based on these
results a method is developed to recast existing experimental limits on
four-quark effective operators to the full theory parameter space.Comment: 28 pages, 10 figures, version 2 updated to JHEP 03 (2015) 09
- …