115 research outputs found

    Identifying Semantic Divergences in Parallel Text without Annotations

    Full text link
    Recognizing that even correct translations are not always semantically equivalent, we automatically detect meaning divergences in parallel sentence pairs with a deep neural model of bilingual semantic similarity which can be trained for any parallel corpus without any manual annotation. We show that our semantic model detects divergences more accurately than models based on surface features derived from word alignments, and that these divergences matter for neural machine translation.Comment: Accepted as a full paper to NAACL 201

    Exploring the State of the Art in Legal QA Systems

    Full text link
    Answering questions related to the legal domain is a complex task, primarily due to the intricate nature and diverse range of legal document systems. Providing an accurate answer to a legal query typically necessitates specialized knowledge in the relevant domain, which makes this task all the more challenging, even for human experts. QA (Question answering systems) are designed to generate answers to questions asked in human languages. They use natural language processing to understand questions and search through information to find relevant answers. QA has various practical applications, including customer service, education, research, and cross-lingual communication. However, they face challenges such as improving natural language understanding and handling complex and ambiguous questions. Answering questions related to the legal domain is a complex task, primarily due to the intricate nature and diverse range of legal document systems. Providing an accurate answer to a legal query typically necessitates specialized knowledge in the relevant domain, which makes this task all the more challenging, even for human experts. At this time, there is a lack of surveys that discuss legal question answering. To address this problem, we provide a comprehensive survey that reviews 14 benchmark datasets for question-answering in the legal field as well as presents a comprehensive review of the state-of-the-art Legal Question Answering deep learning models. We cover the different architectures and techniques used in these studies and the performance and limitations of these models. Moreover, we have established a public GitHub repository where we regularly upload the most recent articles, open data, and source code. The repository is available at: \url{https://github.com/abdoelsayed2016/Legal-Question-Answering-Review}

    Identifying Semantic Divergences Across Languages

    Get PDF
    Cross-lingual resources such as parallel corpora and bilingual dictionaries are cornerstones of multilingual natural language processing (NLP). They have been used to study the nature of translation, train automatic machine translation systems, as well as to transfer models across languages for an array of NLP tasks. However, the majority of work in cross-lingual and multilingual NLP assumes that translations recorded in these resources are semantically equivalent. This is often not the case---words and sentences that are considered to be translations of each other frequently divergein meaning, often in systematic ways. In this thesis, we focus on such mismatches in meaning in text that we expect to be aligned across languages. We term such mismatches as cross-lingual semantic divergences. The core claim of this thesis is that translation is not always meaning preserving which leads to cross-lingual semantic divergences that affect multilingual NLP tasks. Detecting such divergences requires ways of directly characterizing differences in meaning across languages through novel cross-lingual tasks, as well as models that account for translation ambiguity and do not rely on expensive, task-specific supervision. We support this claim through three main contributions. First, we show that a large fraction of data in multilingual resources (such as parallel corpora and bilingual dictionaries) is identified as semantically divergent by human annotators. Second, we introduce cross-lingual tasks that characterize differences in word meaning across languages by identifying the semantic relation between two words. We also develop methods to predict such semantic relations, as well as a model to predict whether sentences in different languages have the same meaning. Finally, we demonstrate the impact of divergences by applying the methods developed in the previous sections to two downstream tasks. We first show that our model for identifying semantic relations between words helps in separating equivalent word translations from divergent translations in the context of bilingual dictionary induction, even when the two words are close in meaning. We also show that identifying and filtering semantic divergences in parallel data helps in training a neural machine translation system twice as fast without sacrificing quality

    SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

    Full text link
    We present SeaEval, a benchmark for multilingual foundation models. In addition to characterizing how these models understand and reason with natural language, we also investigate how well they comprehend cultural practices, nuances, and values. Alongside standard accuracy metrics, we investigate the brittleness of foundation models in the dimensions of semantics and multilinguality. Our analyses span both open-sourced and closed models, leading to empirical results across classic NLP tasks, reasoning, and cultural comprehension. Key findings indicate (1) Most models exhibit varied behavior when given paraphrased instructions. (2) Many models still suffer from exposure bias (e.g., positional bias, majority label bias). (3) For questions rooted in factual, scientific, and commonsense knowledge, consistent responses are expected across multilingual queries that are semantically equivalent. Yet, most models surprisingly demonstrate inconsistent performance on these queries. (4) Multilingually-trained models have not attained "balanced multilingual" capabilities. Our endeavors underscore the need for more generalizable semantic representations and enhanced multilingual contextualization. SeaEval can serve as a launchpad for more thorough investigations and evaluations for multilingual and multicultural scenarios.Comment: 15 pages, 7 figure

    BERTimbau : modelos BERT pré-treinados para português brasileiro

    Get PDF
    Orientador: Roberto de Alencar Lotufo, Rodrigo Frassetto NogueiraDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Os avanços recentes em representação de linguagem usando redes neurais e aprendizado profundo permitiram que os estados internos aprendidos por grandes modelos de linguagem (ML) pré-treinados fossem usados no tratamento de outras tarefas finais de processamento de linguagem natural (PLN). Essa abordagem de transferência de aprendizado melhora a performance em diversas tarefas e é bastante benéfica quando há escassez de dados rotulados, fazendo com que MLs pré-treinados sejam recursos de grande utilidade, especialmente para línguas cujos conjuntos de dados de treinamento possuam poucos exemplos anotados. Nesse trabalho, nós treinamos modelos BERT (Bidirectional Encoder Representations from Transformers) para Português brasileiro, os quais apelidamos de BERTimbau. Nós avaliamos os modelos em três tarefas finais de PLN: similaridade semântica, inferência textual e reconhecimento de entidades nomeadas. Nossos modelos desempenham melhor do que o estado da arte em todas essas tarefas, superando o BERT multilíngue e confirmando a efetividade de grandes MLs para Português. Nós disponibilizamos nossos modelos para a comunidade de modo a promover boas bases de comparação para pesquisas futuras em PLNAbstract: Recent advances in language representation using neural networks and deep learning have made it viable to transfer the learned internal states of large pretrained language models (LMs) to downstream natural language processing (NLP) tasks. This transfer learning approach improves the overall performance on many tasks and is highly beneficial whenlabeled data is scarce, making pretrained LMs valuable resources specially for languages with few annotated training examples. In this work, we train BERT (Bidirectional Encoder Representations from Transformers) models for Brazilian Portuguese, which we nickname BERTimbau. We evaluate our models on three downstream NLP tasks: sentence textual similarity, recognizing textual entailment, and named entity recognition. Our models improve the state-of-the-art in all of these tasks, outperforming Multilingual BERT and confirming the effectiveness of large pretrained LMs for Portuguese. We release our models to the community hoping to provide strong baselines for future NLP researchMestradoEngenharia de ComputaçãoMestre em Engenharia Elétric

    Attentive Deep Neural Networks for Legal Document Retrieval

    Full text link
    Legal text retrieval serves as a key component in a wide range of legal text processing tasks such as legal question answering, legal case entailment, and statute law retrieval. The performance of legal text retrieval depends, to a large extent, on the representation of text, both query and legal documents. Based on good representations, a legal text retrieval model can effectively match the query to its relevant documents. Because legal documents often contain long articles and only some parts are relevant to queries, it is quite a challenge for existing models to represent such documents. In this paper, we study the use of attentive neural network-based text representation for statute law document retrieval. We propose a general approach using deep neural networks with attention mechanisms. Based on it, we develop two hierarchical architectures with sparse attention to represent long sentences and articles, and we name them Attentive CNN and Paraformer. The methods are evaluated on datasets of different sizes and characteristics in English, Japanese, and Vietnamese. Experimental results show that: i) Attentive neural methods substantially outperform non-neural methods in terms of retrieval performance across datasets and languages; ii) Pretrained transformer-based models achieve better accuracy on small datasets at the cost of high computational complexity while lighter weight Attentive CNN achieves better accuracy on large datasets; and iii) Our proposed Paraformer outperforms state-of-the-art methods on COLIEE dataset, achieving the highest recall and F2 scores in the top-N retrieval task.Comment: Preprint version. The official version will be published in Artificial Intelligence and Law journa
    corecore