Search CORE

3,837 research outputs found

Vulgaris: Analysis of a Corpus for Middle-Age Varieties of Italian Language

Author: Andrea Zugarini
Marco Maggini
Matteo Tiezzi
Publication venue: International Committee on Computational Linguistics (ICCL)
Publication date: 01/01/2020
Field of study

Italian is a Romance language that has its roots in Vulgar Latin. The birth of the modern Italian started in Tuscany around the 14th century, and it is mainly attributed to the works of Dante Alighieri, Francesco Petrarca and Giovanni Boccaccio, who are among the most acclaimed authors of the medieval age in Tuscany. However, Italy has been characterized by a high variety of dialects, which are often loosely related to each other, due to the past fragmentation of the territory. Italian has absorbed influences from many of these dialects, as also from other languages due to dominion of portions of the country by other nations, such as Spain and France. In this work we present Vulgaris, a project aimed at studying a corpus of Italian textual resources from authors of different regions, ranging in a time period between 1200 and 1600. Each composition is associated to its author, and authors are also grouped in families, i.e. sharing similar stylistic/chronological characteristics. Hence, the dataset is not only a valuable resource for studying the diachronic evolution of Italian and the differences between its dialects, but it is also useful to investigate stylistic aspects between single authors. We provide a detailed statistical analysis of the data, and a corpus-driven study in dialectology and diachronic varieties

Archivio della Ricerca - Università degli Studi di Siena

Identifying Temporal Trends Based on Perplexity and Clustering: Are We Looking at Language Change?

Author: Aguirrezabal Zabaleta Manex
Boldsen Sidsel
Paggio Patrizia
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Crossref

Copenhagen University Research Information System

Latent semantic analysis of game models using LSTMs

Author: Alyahya Khulood
Ghica Dan R.
Publication venue
Publication date: 01/08/2019
Field of study

University of Birmingham Research Portal

ContraGen: Effective Contrastive Learning For Causal Language Model

Author: Ahmad Wasi Uddin
Bhatia Parminder
Jain Nihal
Li Xiaopeng
Ma Xiaofei
Nallapati Ramesh
Nan Feng
Ray Baishakhi
Tan Ming
Wang Zijian
Xiang Bing
Zhang Dejiao
Publication venue
Publication date: 03/10/2022
Field of study

Despite exciting progress in large-scale language generation, the expressiveness of its representations is severely limited by the \textit{anisotropy} issue where the hidden representations are distributed into a narrow cone in the vector space. To address this issue, we present ContraGen, a novel contrastive learning framework to improve the representation with better uniformity and discrimination. We assess ContraGen on a wide range of downstream tasks in natural and programming languages. We show that ContraGen can effectively enhance both uniformity and discrimination of the representations and lead to the desired improvement on various language understanding tasks where discriminative representations are crucial for attaining good performance. Specifically, we attain

44\%

relative improvement on the Semantic Textual Similarity tasks and

34\%

on Code-to-Code Search tasks. Furthermore, by improving the expressiveness of the representations, ContraGen also boosts the source code generation capability with

9\%

relative improvement on execution accuracy on the HumanEval benchmark.Comment: 10 page

arXiv.org e-Print Archive

Squeezing Bottlenecks: Exploring the Limits of Autoencoder Semantic Representation Capabilities

Author: Banchs
Blei
Bourlard
Brown
Deerwester
Erhan
Hinton
Hinton
Paolo Rosso
Parth Gupta
Quartz
Rafael E. Banchs
Salakhutdinov
Salakhutdinov
Utgoff
Publication venue: 'Elsevier BV'
Publication date: 29/01/2016
Field of study

This is the author’s version of a work that was accepted for publication in Neurocomputing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Neurocomputing 175 (2016) 1001–1008. DOI 10.1016/j.neucom.2015.06.091.[EN] We present a comprehensive study on the use of autoencoders for modelling text data, in which (differently from previous studies) we focus our attention on the various issues. We explore the suitability of two different models binary deep autencoders (bDA) and replicated-softmax deep autencoders (rsDA) for constructing deep autoencoders for text data at the sentence level. We propose and evaluate two novel metrics for better assessing the text-reconstruction capabilities of autoencoders. We propose an automatic method to find the critical bottleneck dimensionality for text representations (below which structural information is lost); and finally we conduct a comparative evaluation across different languages, exploring the regions of critical bottleneck dimensionality and its relationship to language perplexity. & 2015 Elsevier B.V. All rights reserved.A significant part of this research work was conducted during the first author's attachment to the HLT department of I2R in Singapore. The work of the first and third authors was carried out in the framework of the WIQ-EI IRSES project (Grant no. 269180) within the FP 7 Marie Curie, the DIANA APPLICATIONS Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Gupta, PA.; Banchs, R.; Rosso, P. (2016). Squeezing Bottlenecks: Exploring the Limits of Autoencoder Semantic Representation Capabilities. Neurocomputing. 175:1001-1008. https://doi.org/10.1016/j.neucom.2015.06.091S1001100817

CiteSeerX

Crossref

RiuNet

Paraphrastic language models

Author: Gales MJF
Liu X
Woodland PC
Publication venue: Computer Speech and Language
Publication date: 01/01/2014
Field of study

Natural languages are known for their expressive richness. Many sentences can be used to represent the same underlying meaning. Only modelling the observed surface word sequence can result in poor context coverage and generalization, for example, when using n-gram language models (LMs). This paper proposes a novel form of language model, the paraphrastic LM, that addresses these issues. A phrase level paraphrase model statistically learned from standard text data with no semantic annotation is used to generate multiple paraphrase variants. LM probabilities are then estimated by maximizing their marginal probability. Multi-level language models estimated at both the word level and the phrase level are combined. An efﬁcient weighted ﬁnite state transducer (WFST) based paraphrase generation approach is also presented. Signiﬁcant error rate reductions of 0.5–0.6% absolute were obtained over the baseline n-gram LMs on two state-of-the-art recognition tasks for English conversational telephone speech and Mandarin Chinese broadcast speech using a paraphrastic multi-level LM modelling both word and phrase sequences. When it is further combined with word and phrase level feed-forward neural network LMs, a signiﬁcant error rate reduction of 0.9% absolute (9% relative) and 0.5% absolute (5% relative) were obtained over the baseline n-gram and neural network LMs respectivelyThe research leading to these results was supported by EPSRC grant EP/I031022/1 (Natural Speech Technology) and DARPA under the Broad Operational Language Translation (BOLT) program.This version is the author accepted manuscript. The final published version can be found on the publisher's website at:http://www.sciencedirect.com/science/article/pii/S088523081400028X# © 2014 Elsevier Ltd. All rights reserved

CiteSeerX

Apollo (Cambridge)