35 research outputs found
Predicting the Semantic Textual Similarity with Siamese CNN and LSTM
National audienceSemantic Textual Similarity (STS) is the basis of many applications in Natural Language Processing (NLP). Our system combines convolution and recurrent neural networks to measure the semantic similarity of sentences. It uses a convolution network to take account of the local context of words and an LSTM to consider the global context of sentences. This combination of networks helps to preserve the relevant information of sentences and improves the calculation of the similarity between sentences. Our model has achieved good results and is competitive with the best state-of-the-art systems.La Similarité Textuelle Sémantique (STS) est la base de nombreuses applications dans le Traitement Automatique du Langage Naturel (TALN). Notre système combine des réseaux neuronaux convolutifs et récurrents pour mesurer la similarité sémantique des phrases. Il utilise un réseau convolutif pour tenir compte du contexte local des mots et un LSTM pour prendre en considération le contexte global d'une phrase. Cette combinaison des réseaux préserve mieux les informations significatives des phrases et améliore le calcul de la similarité entre les phrases. Notre modèle a obtenu de bons résultats et est compétitif avec les meilleurs systèmes de l'état de l'art
Leveraging BERT Language Models for Multi-Lingual ESG Issue Identification
Environmental, Social, and Governance (ESG) has been used as a metric to
measure the negative impacts and enhance positive outcomes of companies in
areas such as the environment, society, and governance. Recently, investors
have increasingly recognized the significance of ESG criteria in their
investment choices, leading businesses to integrate ESG principles into their
operations and strategies. The Multi-Lingual ESG Issue Identification (ML-ESG)
shared task encompasses the classification of news documents into 35 distinct
ESG issue labels. In this study, we explored multiple strategies harnessing
BERT language models to achieve accurate classification of news documents
across these labels. Our analysis revealed that the RoBERTa classifier emerged
as one of the most successful approaches, securing the second-place position
for the English test dataset, and sharing the fifth-place position for the
French test dataset. Furthermore, our SVM-based binary model tailored for the
Chinese language exhibited exceptional performance, earning the second-place
rank on the test dataset
Microblog Contextualization Using Continuous Space Vectors: Multi-Sentence Compression of Cultural Documents
International audienceIn this paper we describe our work for the MC2 CLEF 2017 lab. We participated in the content analysis task that involves filtering, language recognition and summarization. We combine Information Retrieval with Multi-Sentence Compression methods to contextualize mi-croblogs using Wikipedia's pages
Automatic Text Summarization with a Reduced Vocabulary Using Continuous Space Vectors
poster paperInternational audienceIn this paper, we propose a new method that uses continuous vectors to map words to a reduced vocabulary, in the context of Automatic Text Summarization (ATS). This method is evaluated on the MultiLing corpus by the ROUGE evaluation measures with four ATS systems. Our experiments show that the reduced vocabulary improves the performance of state-of-the-art systems
SASI: sumarizador automático de documentos baseado no problema do subconjunto independente de vértices
XLVI Simpósio Brasileiro de Pesquisa OperacionalThis article discusses a summarizer system of documents named SASI. This system features an innovative approach to provide automatic summaries, based on the determination of the maximum independent subset of vertices, modeling the problem a graph of phrases (vertices) and the relationships between them (edges). The concepts and operation of the proposed summarizer and a series of tests comparing the results provided by SASI with others summarizer systems are described. Initial results are promising, evaluating questions of informativeness of the produced summaries on the parameters of time and algorithmic complexity
A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming
Multi-Sentence Compression (MSC) aims to generate a short sentence with the
key information from a cluster of similar sentences. MSC enables summarization
and question-answering systems to generate outputs combining fully formed
sentences from one or several documents. This paper describes an Integer Linear
Programming method for MSC using a vertex-labeled graph to select different
keywords, with the goal of generating more informative sentences while
maintaining their grammaticality. Our system is of good quality and outperforms
the state of the art for evaluations led on news datasets in three languages:
French, Portuguese and Spanish. We led both automatic and manual evaluations to
determine the informativeness and the grammaticality of compressions for each
dataset. In additional tests, which take advantage of the fact that the length
of compressions can be modulated, we still improve ROUGE scores with shorter
output sentences.Comment: Preprint versio
Cross-Language Text Summarization using Sentence and Multi-Sentence Compression
long paperInternational audienceCross-Language Automatic Text Summarization produces a summary in a language different from the language of the source documents. In this paper, we propose a French-to-English cross-lingual sum-marization framework that analyzes the information in both languages to identify the most relevant sentences. In order to generate more informative cross-lingual summaries, we introduce the use of chunks and two compression methods at the sentence and multi-sentence levels. Experimental results on the MultiLing 2011 dataset show that our framework improves the results obtained by state-of-the art approaches according to ROUGE metrics