36,998 research outputs found
Recommended from our members
Corpus-based and Knowledge-based Measures of Text Semantic Similarity
This article discusses corpus-based and knowledge-based measures of text semantic similarity
Automated Identification of National Implementations of European Union Directives with Multilingual Information Retrieval based on Semantic Textual Similarity
The effective transposition of European Union (EU) directives into Member States is important to achieve the policy goals defined in the Treaties and secondary legislation. National Implementing Measures (NIMs) are the legal texts officially adopted by the Member States to transpose the provisions of an EU directive. The measures undertaken by the Commission to monitor NIMs are time-consuming and expensive, as they resort to manual conformity checking studies and legal analysis. In this thesis, we developed a legal information retrieval system using semantic textual similarity techniques to automatically identify the transposition of EU directives into the national law at a fine-grained provision level. We modeled and developed various text similarity approaches such as lexical, semantic, knowledge-based, embeddings-based and concept-based methods. The text similarity systems utilized both textual features (tokens, N-grams, topic models, word and paragraph embeddings) and semantic knowledge from external knowledge bases (EuroVoc, IATE and Babelfy) to identify transpositions. This thesis work also involved the development of a multilingual corpus of 43 directives and their corresponding NIMs from Ireland (English legislation), Italy (Italian legislation) and Luxembourg (French legislation) to validate the text similarity based information retrieval system. A gold standard mapping (prepared by two legal researchers) between directive articles and NIM provisions was prepared to evaluate the various text similarity models. The results show that the lexical and semantic text similarity techniques were more effective in identifying transpositions as compared to the embeddings-based techniques. We also observed that the unsupervised text similarity techniques had the best performance in case of the Luxembourg Directive-NIM corpus
Multi-Ontology Refined Embeddings (MORE): A Hybrid Multi-Ontology and Corpus-based Semantic Representation for Biomedical Concepts
Objective: Currently, a major limitation for natural language processing (NLP) analyses in clinical applications is that a concept can be referenced in various forms across different texts. This paper introduces Multi-Ontology Refined Embeddings (MORE), a novel hybrid framework for incorporating domain knowledge from various ontologies into a distributional semantic model, learned from a corpus of clinical text. This approach generates word embeddings that are more accurate and extensible for computing the semantic similarity of biomedical concepts than previous methods. Materials and Methods: We use the RadCore and MIMIC-III free-text datasets for the corpus-based component of MORE. For the ontology-based component, we use the Medical Subject Headings (MeSH) ontology and two state-of-the-art ontology-based similarity measures. In our approach, we propose a new learning objective, modified from the Sigmoid cross-entropy objective function, to incorporate domain knowledge into the process for generating the word embeddings. Results and Discussion: We evaluate the quality of the generated word embeddings using an established dataset of semantic similarities among biomedical concept pairs. We show that the similarity scores produced by MORE have the highest average correlation (60.2%), with the similarity scores being established by multiple physicians and domain experts, which is 4.3% higher than that of the word2vec baseline model and 6.8% higher than that of the best ontology-based similarity measure. Conclusion: MORE incorporates knowledge from biomedical ontologies into an existing distributional semantics model (i.e. word2vec), improving both the flexibility and accuracy of the learned word embeddings. We demonstrate that MORE outperforms the baseline word2vec model, as well as the individual UMLS-Similarity ontology similarity measures
Recommended from our members
Sentence Similarity Analysis with Applications in Automatic Short Answer Grading
In this dissertation, I explore unsupervised techniques for the task of automatic short answer grading. I compare a number of knowledge-based and corpus-based measures of text similarity, evaluate the effect of domain and size on the corpus-based measures, and also introduce a novel technique to improve the performance of the system by integrating automatic feedback from the student answers. I continue to combine graph alignment features with lexical semantic similarity measures and employ machine learning techniques to show that grade assignment error can be reduced compared to a system that considers only lexical semantic measures of similarity. I also detail a preliminary attempt to align the dependency graphs of student and instructor answers in order to utilize a structural component that is necessary to simulate human-level grading of student answers. I further explore the utility of these techniques to several related tasks in natural language processing including the detection of text similarity, paraphrase, and textual entailment
Automated Identification of National Implementations of European Union Directives With Multilingual Information Retrieval Based On Semantic Textual Similarity
The effective transposition of European Union (EU) directives into Member States is important to achieve the policy goals defined in the Treaties and secondary legislation. National Implementing Measures (NIMs) are the legal texts officially adopted by the Member States to transpose the provisions of an EU directive. The measures undertaken by the Commission to monitor NIMs are time-consuming and expensive, as they resort to manual conformity checking studies and legal analysis. In this thesis, we developed a legal information retrieval system using semantic textual similarity techniques to automatically identify the transposition of EU directives into the national law at a fine-grained provision level. We modeled and developed various text similarity approaches such as lexical, semantic, knowledge-based, embeddings-based and concept-based methods. The text similarity systems utilized both textual features (tokens, N-grams, topic models, word and paragraph embeddings) and semantic knowledge from external knowledge bases (EuroVoc, IATE and Babelfy) to identify transpositions. This thesis work also involved the development of a multilingual corpus of 43 directives and their corresponding NIMs from Ireland (English legislation), Italy (Italian legislation) and Luxembourg (French legislation) to validate the text similarity based information retrieval system. A gold standard mapping (prepared by two legal researchers) between directive articles and NIM provisions was prepared to evaluate the various text similarity models. The results show that the lexical and semantic text similarity techniques were more effective in identifying transpositions as compared to the embeddings-based techniques. We also observed that the unsupervised text similarity techniques had the best performance in case of the Luxembourg Directive-NIM corpus.
We also developed a concept recognition system based on conditional random fields (CRFs) to identify concepts in European directives and national legislation. The results indicate that the concept recognitions system improved over the dictionary lookup program by tagging the concepts which were missed by dictionary lookup. The concept recognition system was extended to develop a concept-based text similarity system using word-sense disambiguation and dictionary concepts. The performance of the concept-based text similarity measure was competitive with the best performing text similarity measure. The labeled corpus of 43 directives and their corresponding NIMs was utilized to develop supervised text similarity systems by using machine learning classifiers. We modeled three machine learning classifiers with different textual features to identify transpositions. The results show that support vector machines (SVMs) with term frequency-inverse document frequency (TF-IDF) features had the best overall performance over the multilingual corpus. Among the unsupervised models, the best performance was achieved by TF-IDF Cosine similarity model with macro average F-score of 0.8817, 0.7771 and 0.6997 for the Luxembourg, Italian and Irish corpus respectively. These results demonstrate that the system was able to identify transpositions in different national jurisdictions with a good performance. Thus, it has the potential to be useful as a support tool for legal practitioners and Commission officials involved in the transposition monitoring process
Recommended from our members
Data Collection and Normalization for Building the Scenario-Based Lexical Knowledge Resource of a Text-to-Scene Conversion System
WordsEye is a system for converting from English text into three-dimensional graphical scenes that represent that text. It works by performing syntactic and semantic analyses on the input text, producing a description of the arrangement of objects in a scene. At the core of WordsEye is the Scenario-Based Lexical Knowledge Resource (SBLR), a unified knowledge base and representational system for expressing lexical
and real-world knowledge needed to depict scenes from text. This paper explores information collection methods for building the SBLR, using Amazon’s Mechanical Turk (AMT) and manual normalization of raw AMT data. The paper follows with manual review of existing relations in the SBLR and classification of the AMT data into existing and new semantic relations. Since manual annotation is a time-consuming and expensive approach, we also explored the use of automatic normalization of AMT data through log-odds and log-likelihood ratios extracted from the English Gigaword corpus, as well as through WordNet similarity measures
Short-text semantic similarity (STSS): Techniques, challenges and future perspectives
In natural language processing, short-text semantic similarity (STSS) is a very prominent field. It has a significant impact on a broad range of applications, such as question-answering systems, information retrieval, entity recognition, text analytics, sentiment classification, and so on. Despite their widespread use, many traditional machine learning techniques are incapable of identifying the semantics of short text. Traditional methods are based on ontologies, knowledge graphs, and corpus-based methods. The performance of these methods is influenced by the manually defined rules. Applying such measures is still difficult, since it poses various semantic challenges. In the existing literature, the most recent advances in short-text semantic similarity (STSS) research are not included. This study presents the systematic literature review (SLR) with the aim to (i) explain short sentence barriers in semantic similarity, (ii) identify the most appropriate standard deep learning techniques for the semantics of a short text, (iii) classify the language models that produce high-level contextual semantic information, (iv) determine appropriate datasets that are only intended for short text, and (v) highlight research challenges and proposed future improvements. To the best of our knowledge, we have provided an in-depth, comprehensive, and systematic review of short text semantic similarity trends, which will assist the researchers to reuse and enhance the semantic information.Yayasan UTP Pre-commercialization grant (YUTP-PRG) [015PBC-005]; Computer and Information Science Department of Universiti Teknologi PETRONASYayasan UTP, YUTP: 015PBC-00
Recommended from our members
Collecting Semantic Data by Mechanical Turk for the Lexical Knowledge Resource of a Text-to-Picture Generating System
WordsEye is a system for automatically converting natural language text into 3D scenes representing the meaning of that text. At the core of WordsEye is the Scenario-Based Lexical Knowledge Resource (SBLR), a unified knowledge base and representational system for expressing lexical and real-world knowledge needed to depict scenes from text. To enrich a portion of the SBLR, we need to fill out some contextual information about its objects, including information about their typical parts, typical locations and typical objects located near them. This paper explores our proposed methodology to achieve this goal. First we try to collect some semantic information by using Amazon’s Mechanical Turk (AMT). Then, we manually filter and classify the collected data and finally, we compare the manual results with the output of some automatic filtration techniques which use several WordNet similarity and corpus association measures
- …