233 research outputs found

    Substring-based Machine Translation

    Get PDF
    Abstract Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al (2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs

    Molecular architecture of Gαo and the structural basis for RGS16-mediated deactivation

    Get PDF
    Heterotrimeric G proteins relay extracellular cues from heptahelical transmembrane receptors to downstream effector molecules. Composed of an α subunit with intrinsic GTPase activity and a βγ heterodimer, the trimeric complex dissociates upon receptor-mediated nucleotide exchange on the α subunit, enabling each component to engage downstream effector targets for either activation or inhibition as dictated in a particular pathway. To mitigate excessive effector engagement and concomitant signal transmission, the Gα subunit's intrinsic activation timer (the rate of GTP hydrolysis) is regulated spatially and temporally by a class of GTPase accelerating proteins (GAPs) known as the regulator of G protein signaling (RGS) family. The array of G protein-coupled receptors, Gα subunits, RGS proteins and downstream effectors in mammalian systems is vast. Understanding the molecular determinants of specificity is critical for a comprehensive mapping of the G protein system. Here, we present the 2.9 Å crystal structure of the enigmatic, neuronal G protein Gαo in the GTP hydrolytic transition state, complexed with RGS16. Comparison with the 1.89 Å structure of apo-RGS16, also presented here, reveals plasticity upon Gαo binding, the determinants for GAP activity, and the structurally unique features of Gαo that likely distinguish it physiologically from other members of the larger Gαi family, affording insight to receptor, GAP and effector specificity

    The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

    Full text link
    Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by proposing AutoMQM, a prompting technique which leverages the reasoning and in-context learning capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple score prediction prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.Comment: 19 page

    Neutrino Oscillations and the Supernova 1987A Signal

    Get PDF
    We study the impact of neutrino oscillations on the interpretation of the supernova (SN) 1987A neutrino signal by means of a maximum-likelihood analysis. We focus on oscillations between νe\overline\nu_e with νμ\overline\nu_\mu or ντ\overline\nu_\tau with those mixing parameters that would solve the solar neutrino problem. For the small-angle MSW solution (Δm2105eV2\Delta m^2\approx10^{-5}\,\rm eV^2, sin22Θ00.007\sin^22\Theta_0\approx0.007), there are no significant oscillation effects on the Kelvin-Helmholtz cooling signal; we confirm previous best-fit values for the neutron-star binding energy and average spectral νe\overline\nu_e temperature. There is only marginal overlap between the upper end of the 95.4\% CL inferred range of Eνe\langle E_{\overline\nu_e}\rangle and the lower end of the range of theoretical predictions. Any admixture of the stiffer νμ\overline\nu_\mu spectrum by oscillations aggravates the conflict between experimentally inferred and theoretically predicted spectral properties. For mixing parameters in the neighborhood of the large-angle MSW solution (Δm2105eV2\Delta m^2\approx10^{-5}\,\rm eV^2, sin22Θ00.7\sin^22\Theta_0\approx0.7) the oscillations in the SN are adiabatic, but one needs to include the regeneration effect in the Earth which causes the Kamiokande and IMB detectors to observe different νe\overline\nu_e spectra. For the solar vacuum solution (Δm21010eV2\Delta m^2\approx10^{-10}\,\rm eV^2, sin22Θ01\sin^22\Theta_0\approx1) the oscillations in the SN are nonadiabatic; vacuum oscillations take place between the SN and the detector. If either of the large-angle solutions were borne out by the upcoming round of solar neutrino experiments, one would have to conclude that the SN~1987A νμ\overline\nu_\mu and/or νe\overline\nu_e spectra had been much softer than predicted by currentComment: Final version with very minor wording changes, to be published in Phys. Rev.

    Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation

    Full text link
    Many recent advances in natural language generation have been fueled by training large language models on internet-scale data. However, this paradigm can lead to models that generate toxic, inaccurate, and unhelpful content, and automatic evaluation metrics often fail to identify these behaviors. As models become more capable, human feedback is an invaluable signal for evaluating and improving models. This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation. First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization. Next, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models. We also discuss existing datasets for human-feedback data collection, and concerns surrounding feedback collection. Finally, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for human intervention.Comment: Work in Progres

    High-Throughput Screening for Small-Molecule Inhibitors of LARG-Stimulated RhoA Nucleotide Binding via a Novel Fluorescence Polarization Assay

    Get PDF
    Guanine nucleotide-exchange factors (GEFs) stimulate guanine nucleotide exchange and the subsequent activation of Rho-family proteins in response to extracellular stimuli acting upon cytokine, tyrosine kinase, adhesion, integrin, and G-protein coupled receptors (GPCRs). Upon Rho activation, several downstream events occur, such as morphological and cytokskeletal changes, motility, growth, survival, and gene transcription. The RhoGEF Leukemia-Associated RhoGEF (LARG) is a member of the Regulators of G-protein Signaling Homology Domain (RH) family of GEFs originally identified as a result of chromosomal translocation in acute myeloid leukemia. Using a novel fluorescence polarization guanine nucleotide binding assay utilizing BODIPY-Texas Red-GTPγS (BODIPY-TR-GTPγS), we performed a ten-thousand compound high-throughput screen for inhibitors of LARG-stimulated RhoA nucleotide binding. Five compounds identified from the high-throughput screen were confirmed in a non-fluorescent radioactive guanine nucleotide binding assay measuring LARG-stimulated [35S] GTPγS binding to RhoA, thus ruling out non-specific fluorescent effects. All five compounds selectively inhibited LARG-stimulated RhoA [35S] GTPγS binding, but had little to no effect upon RhoA or Gαo [35S] GTPγS binding. Therefore, these five compounds should serve as promising starting points for the development of small molecule inhibitors of LARG-mediated nucleotide exchange as both pharmacological tools and therapeutics. In addition, the fluorescence polarization guanine nucleotide binding assay described here should serve as a useful approach for both high-throughput screening and general biological applications

    Tiny-Scale Molecular Structures in the Magellanic Clouds (Part 1)

    Full text link
    We report on the {\small FUSE} detections of the HD and CO molecules {\bf on the lines of sight towards three Large Magellanic stars}: Sk -67D05, Sk -68D135, and Sk -69D246. HD is also detected for the first time {\bf on the lines of sight towards two Small Magellanic Cloud stars}: AV 95 and Sk 159. While the HD and CO abundances are expected to be lower in the Large Magellanic Cloud where molecular fractions are a third of the Galactic value and where the photodissociation flux is up to thousands times larger, we report an average HD/H2_2 ratio of 1.4±\pm0.5 ppm and CO/H2_2 ratio ranging from 0.8 to 2.7 ppm similar to the Galactic ones. We tentatively identify a deuterium reservoir (hereafter D--reservoir) towards the Small Magellanic Cloud, along the light path to AV 95. We derive a D/H ratio ranging from 1. 106^{-6} to 1.1 105^{-5}.Comment: 34 pages, 10 tables, 12 figures, accepted for publication in A&

    Highly Variable Chloroplast Markers for Evaluating Plant Phylogeny at Low Taxonomic Levels and for DNA Barcoding

    Get PDF
    BACKGROUND: At present, plant molecular systematics and DNA barcoding techniques rely heavily on the use of chloroplast gene sequences. Because of the relatively low evolutionary rates of chloroplast genes, there are very few choices suitable for molecular studies on angiosperms at low taxonomic levels, and for DNA barcoding of species. METHODOLOGY/PRINCIPAL FINDINGS: We scanned the entire chloroplast genomes of 12 genera to search for highly variable regions. The sequence data of 9 genera were from GenBank and 3 genera were of our own. We identified nearly 5% of the most variable loci from all variable loci in the chloroplast genomes of each genus, and then selected 23 loci that were present in at least three genera. The 23 loci included 4 coding regions, 2 introns, and 17 intergenic spacers. Of the 23 loci, the most variable (in order from highest variability to lowest) were intergenic regions ycf1-a, trnK, rpl32-trnL, and trnH-psbA, followed by trnS(UGA)-trnG(UCC), petA-psbJ, rps16-trnQ, ndhC-trnV, ycf1-b, ndhF, rpoB-trnC, psbE-petL, and rbcL-accD. Three loci, trnS(UGA)-trnG(UCC), trnT-psbD, and trnW-psaJ, showed very high nucleotide diversity per site (π values) across three genera. Other loci may have strong potential for resolving phylogenetic and species identification problems at the species level. The loci accD-psaI, rbcL-accD, rpl32-trnL, rps16-trnQ, and ycf1 are absent from some genera. To amplify and sequence the highly variable loci identified in this study, we designed primers from their conserved flanking regions. We tested the applicability of the primers to amplify target sequences in eight species representing basal angiosperms, monocots, eudicots, rosids, and asterids, and confirmed that the primers amplified the desired sequences of these species. SIGNIFICANCE/CONCLUSIONS: Chloroplast genome sequences contain regions that are highly variable. Such regions are the first consideration when screening the suitable loci to resolve closely related species or genera in phylogenetic analyses, and for DNA barcoding
    corecore