115 research outputs found
Identifying Semantic Divergences in Parallel Text without Annotations
Recognizing that even correct translations are not always semantically
equivalent, we automatically detect meaning divergences in parallel sentence
pairs with a deep neural model of bilingual semantic similarity which can be
trained for any parallel corpus without any manual annotation. We show that our
semantic model detects divergences more accurately than models based on surface
features derived from word alignments, and that these divergences matter for
neural machine translation.Comment: Accepted as a full paper to NAACL 201
Exploring the State of the Art in Legal QA Systems
Answering questions related to the legal domain is a complex task, primarily
due to the intricate nature and diverse range of legal document systems.
Providing an accurate answer to a legal query typically necessitates
specialized knowledge in the relevant domain, which makes this task all the
more challenging, even for human experts. QA (Question answering systems) are
designed to generate answers to questions asked in human languages. They use
natural language processing to understand questions and search through
information to find relevant answers. QA has various practical applications,
including customer service, education, research, and cross-lingual
communication. However, they face challenges such as improving natural language
understanding and handling complex and ambiguous questions. Answering questions
related to the legal domain is a complex task, primarily due to the intricate
nature and diverse range of legal document systems. Providing an accurate
answer to a legal query typically necessitates specialized knowledge in the
relevant domain, which makes this task all the more challenging, even for human
experts. At this time, there is a lack of surveys that discuss legal question
answering. To address this problem, we provide a comprehensive survey that
reviews 14 benchmark datasets for question-answering in the legal field as well
as presents a comprehensive review of the state-of-the-art Legal Question
Answering deep learning models. We cover the different architectures and
techniques used in these studies and the performance and limitations of these
models. Moreover, we have established a public GitHub repository where we
regularly upload the most recent articles, open data, and source code. The
repository is available at:
\url{https://github.com/abdoelsayed2016/Legal-Question-Answering-Review}
Identifying Semantic Divergences Across Languages
Cross-lingual resources such as parallel corpora and bilingual dictionaries are cornerstones of multilingual natural language processing (NLP). They have been used to study the nature of translation, train automatic machine translation systems, as well as to transfer models across languages for an array of NLP tasks. However, the majority of work in cross-lingual and multilingual NLP assumes that translations recorded in these resources are semantically equivalent. This is often not the case---words and sentences that are considered to be translations of each other frequently divergein meaning, often in systematic ways.
In this thesis, we focus on such mismatches in meaning in text that we expect to be aligned across languages. We term such mismatches as cross-lingual semantic divergences. The core claim of this thesis is that translation is not always meaning preserving which leads to cross-lingual semantic divergences that affect multilingual NLP tasks. Detecting such divergences requires ways of directly characterizing differences in meaning across languages through novel cross-lingual tasks, as well as models that account for translation ambiguity and do not rely on expensive, task-specific supervision.
We support this claim through three main contributions. First, we show that a large fraction of data in multilingual resources (such as parallel corpora and bilingual dictionaries) is identified as semantically divergent by human annotators. Second, we introduce cross-lingual tasks that characterize differences in word meaning across languages by identifying the semantic relation between two words. We also develop methods to predict such semantic relations, as well as a model to predict whether sentences in different languages have the same meaning. Finally, we demonstrate the impact of divergences by applying the methods developed in the previous sections to two downstream tasks. We first show that our model for identifying semantic relations between words helps in separating equivalent word translations from divergent translations in the context of bilingual dictionary induction, even when the two words are close in meaning. We also show that identifying and filtering semantic divergences in parallel data helps in training a neural machine translation system twice as fast without sacrificing quality
SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning
We present SeaEval, a benchmark for multilingual foundation models. In
addition to characterizing how these models understand and reason with natural
language, we also investigate how well they comprehend cultural practices,
nuances, and values. Alongside standard accuracy metrics, we investigate the
brittleness of foundation models in the dimensions of semantics and
multilinguality. Our analyses span both open-sourced and closed models, leading
to empirical results across classic NLP tasks, reasoning, and cultural
comprehension. Key findings indicate (1) Most models exhibit varied behavior
when given paraphrased instructions. (2) Many models still suffer from exposure
bias (e.g., positional bias, majority label bias). (3) For questions rooted in
factual, scientific, and commonsense knowledge, consistent responses are
expected across multilingual queries that are semantically equivalent. Yet,
most models surprisingly demonstrate inconsistent performance on these queries.
(4) Multilingually-trained models have not attained "balanced multilingual"
capabilities. Our endeavors underscore the need for more generalizable semantic
representations and enhanced multilingual contextualization. SeaEval can serve
as a launchpad for more thorough investigations and evaluations for
multilingual and multicultural scenarios.Comment: 15 pages, 7 figure
BERTimbau : modelos BERT pré-treinados para português brasileiro
Orientador: Roberto de Alencar Lotufo, Rodrigo Frassetto NogueiraDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Os avanços recentes em representação de linguagem usando redes neurais e aprendizado profundo permitiram que os estados internos aprendidos por grandes modelos de linguagem (ML) pré-treinados fossem usados no tratamento de outras tarefas finais de processamento de linguagem natural (PLN). Essa abordagem de transferência de aprendizado melhora a performance em diversas tarefas e é bastante benéfica quando há escassez de dados rotulados, fazendo com que MLs pré-treinados sejam recursos de grande utilidade, especialmente para línguas cujos conjuntos de dados de treinamento possuam poucos exemplos anotados. Nesse trabalho, nós treinamos modelos BERT (Bidirectional Encoder Representations from Transformers) para Português brasileiro, os quais apelidamos de BERTimbau. Nós avaliamos os modelos em três tarefas finais de PLN: similaridade semântica, inferência textual e reconhecimento de entidades nomeadas. Nossos modelos desempenham melhor do que o estado da arte em todas essas tarefas, superando o BERT multilíngue e confirmando a efetividade de grandes MLs para Português. Nós disponibilizamos nossos modelos para a comunidade de modo a promover boas bases de comparação para pesquisas futuras em PLNAbstract: Recent advances in language representation using neural networks and deep learning have made it viable to transfer the learned internal states of large pretrained language models (LMs) to downstream natural language processing (NLP) tasks. This transfer learning approach improves the overall performance on many tasks and is highly beneficial whenlabeled data is scarce, making pretrained LMs valuable resources specially for languages with few annotated training examples. In this work, we train BERT (Bidirectional Encoder Representations from Transformers) models for Brazilian Portuguese, which we nickname BERTimbau. We evaluate our models on three downstream NLP tasks: sentence textual similarity, recognizing textual entailment, and named entity recognition. Our models improve the state-of-the-art in all of these tasks, outperforming Multilingual BERT and confirming the effectiveness of large pretrained LMs for Portuguese. We release our models to the community hoping to provide strong baselines for future NLP researchMestradoEngenharia de ComputaçãoMestre em Engenharia Elétric
Attentive Deep Neural Networks for Legal Document Retrieval
Legal text retrieval serves as a key component in a wide range of legal text
processing tasks such as legal question answering, legal case entailment, and
statute law retrieval. The performance of legal text retrieval depends, to a
large extent, on the representation of text, both query and legal documents.
Based on good representations, a legal text retrieval model can effectively
match the query to its relevant documents. Because legal documents often
contain long articles and only some parts are relevant to queries, it is quite
a challenge for existing models to represent such documents. In this paper, we
study the use of attentive neural network-based text representation for statute
law document retrieval. We propose a general approach using deep neural
networks with attention mechanisms. Based on it, we develop two hierarchical
architectures with sparse attention to represent long sentences and articles,
and we name them Attentive CNN and Paraformer. The methods are evaluated on
datasets of different sizes and characteristics in English, Japanese, and
Vietnamese. Experimental results show that: i) Attentive neural methods
substantially outperform non-neural methods in terms of retrieval performance
across datasets and languages; ii) Pretrained transformer-based models achieve
better accuracy on small datasets at the cost of high computational complexity
while lighter weight Attentive CNN achieves better accuracy on large datasets;
and iii) Our proposed Paraformer outperforms state-of-the-art methods on COLIEE
dataset, achieving the highest recall and F2 scores in the top-N retrieval
task.Comment: Preprint version. The official version will be published in
Artificial Intelligence and Law journa
- …