Search CORE

10 research outputs found

A Real-Time N-Gram Approach to Choosing Synonyms Based on Context

Author: Moore Brian J
Publication venue: Scholarship@Western
Publication date: 07/01/2015
Field of study

Synonymy is an important part of all natural language but not all synonyms are created equal. Just because two words are synonymous, it usually doesn’t mean they can always be interchanged. The problem that we attempt to address is that of near-synonymy and choosing the right word based purely on its surrounding words. This new computational method, unlike previous methods used on this problem, is capable of making multiple word suggestions which more accurately models human choice. It contains a large number of words, does not require training, and is able to be run in real-time. On previous testing data, when able to make multiple suggestions, it improved by over 17 percentage points on the previous best method and 4.5 percentage points on average, with a maximum of 14 percentage points, on the human annotators near-synonym choice. In addition this thesis also presents new synonym sets and human annotated test data that more accurately fits this problem

Scholarship@Western

Extraction of temporal networks from term co-occurrences in online textual sources

Author: Grčar Miha
Kralj Novak Petra
Mozetič Igor
Popović Marko
Puliga Michelangelo
Sluban Borut
Zlatić Vinko
Štefančić Hrvoje
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

A stream of unstructured news can be a valuable source of hidden relations between different entities, such as financial institutions, countries, or persons. We present an approach to continuously collect online news, recognize relevant entities in them, and extract time-varying networks. The nodes of the network are the entities, and the links are their co-occurrences. We present a method to estimate the significance of co-occurrences, and a benchmark model against which their robustness is evaluated. The approach is applied to a large set of financial news, collected over a period of two years. The entities we consider are 50 countries which issue sovereign bonds, and which are insured by Credit Default Swaps (CDS) in turn. We compare the country co-occurrence networks to the CDS networks constructed from the correlations between the CDS. The results show relatively small, but significant overlap between the networks extracted from the news and those from the CDS correlations

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

Full-text Institutional Repository of the Ruđer Bošković Institute

Choosing the word most typical in context using a lexical co-occurrence network

Author
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/1997
Field of study

Crossref

A history and theory of textual event detection and recognition

Author: Chen Yanping
Ding Zehua
Huang Ruizhang
Qin Yongbin
Shah Nazaraf
Zheng Qinghua
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/11/2020
Field of study

Coventry University Pure Portal

Introspective knowledge acquisition for case retrieval networks in textual case base reasoning.

Author: Chakraborti Sutanu
Publication venue
Publication date: 31/08/2007
Field of study

Textual Case Based Reasoning (TCBR) aims at effective reuse of information contained in unstructured documents. The key advantage of TCBR over traditional Information Retrieval systems is its ability to incorporate domain-specific knowledge to facilitate case comparison beyond simple keyword matching. However, substantial human intervention is needed to acquire and transform this knowledge into a form suitable for a TCBR system. In this research, we present automated approaches that exploit statistical properties of document collections to alleviate this knowledge acquisition bottleneck. We focus on two important knowledge containers: relevance knowledge, which shows relatedness of features to cases, and similarity knowledge, which captures the relatedness of features to each other. The terminology is derived from the Case Retrieval Network (CRN) retrieval architecture in TCBR, which is used as the underlying formalism in this thesis applied to text classification. Latent Semantic Indexing (LSI) generated concepts are a useful resource for relevance knowledge acquisition for CRNs. This thesis introduces a supervised LSI technique called sprinkling that exploits class knowledge to bias LSI's concept generation. An extension of this idea, called Adaptive Sprinkling has been proposed to handle inter-class relationships in complex domains like hierarchical (e.g. Yahoo directory) and ordinal (e.g. product ranking) classification tasks. Experimental evaluation results show the superiority of CRNs created with sprinkling and AS, not only over LSI on its own, but also over state-of-the-art classifiers like Support Vector Machines (SVM). Current statistical approaches based on feature co-occurrences can be utilized to mine similarity knowledge for CRNs. However, related words often do not co-occur in the same document, though they co-occur with similar words. We introduce an algorithm to efficiently mine such indirect associations, called higher order associations. Empirical results show that CRNs created with the acquired similarity knowledge outperform both LSI and SVM. Incorporating acquired knowledge into the CRN transforms it into a densely connected network. While improving retrieval effectiveness, this has the unintended effect of slowing down retrieval. We propose a novel retrieval formalism called the Fast Case Retrieval Network (FCRN) which eliminates redundant run-time computations to improve retrieval speed. Experimental results show FCRN's ability to scale up over high dimensional textual casebases. Finally, we investigate novel ways of visualizing and estimating complexity of textual casebases that can help explain performance differences across casebases. Visualization provides a qualitative insight into the casebase, while complexity is a quantitative measure that characterizes classification or retrieval hardness intrinsic to a dataset. We study correlations of experimental results from the proposed approaches against complexity measures over diverse casebases

Open Access Institutional Repository at Robert Gordon University

The Hermeneutics Of The Hard Drive: Using Narratology, Natural Language Processing, And Knowledge Management To Improve The Effectiveness Of The Digital Forensic Process

Author: Pollitt Mark
Publication venue: University of Central Florida
Publication date: 01/01/2013
Field of study

In order to protect the safety of our citizens and to ensure a civil society, we ask our law enforcement, judiciary and intelligence agencies, under the rule of law, to seek probative information which can be acted upon for the common good. This information may be used in court to prosecute criminals or it can be used to conduct offensive or defensive operations to protect our national security. As the citizens of the world store more and more information in digital form, and as they live an ever-greater portion of their lives online, law enforcement, the judiciary and the Intelligence Community will continue to struggle with finding, extracting and understanding the data stored on computers. But this trend affords greater opportunity for law enforcement. This dissertation describes how several disparate approaches: knowledge management, content analysis, narratology, and natural language processing, can be combined in an interdisciplinary way to positively impact the growing difficulty of developing useful, actionable intelligence from the ever-increasing corpus of digital evidence. After exploring how these techniques might apply to the digital forensic process, I will suggest two new theoretical constructs, the Hermeneutic Theory of Digital Forensics and the Narrative Theory of Digital Forensics, linking existing theories of forensic science, knowledge management, content analysis, narratology, and natural language processing together in order to identify and extract narratives from digital evidence. An experimental approach will be described and prototyped. The results of these experiments demonstrate the potential of natural language processing techniques to digital forensics

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Recommended from our members

Approaches to Using in Information Word Collocation Retrieval

Author: Vechtomova O.
Publication venue
Publication date
Field of study

The thesis explores long-span collocation and its application in information retrieval. The basic research question of the thesis is whether the use of long-span collocates can improve performance of a probabilistic model of IR. The model used in the project is the Robertson & Sparck Jones probabilistic model. The basic research question was explored by investigating three different ways of integrating collocation information with the probabilistic model: 1. Global collocation analysis. The method consists in expanding the original query with long-span global collocates of query terms. Global collocates of a query term are selected from large fixed-size windows around all occurrences of a term in the corpus and ranked by statistical measures of Mutual Information (MI) and Z score. A fixed number of top-ranked collocates is used in query expansion. Query expansion with global collocates did not show to be superior to the original queries, the possible reason being the fact that query terms often have a fairly broad meaning and, hence, a rather semantically heterogeneous pattern of occurrence. 2. Local collocation analysis. This method is a form of iterative query expansion following relevance or pseudo-relevance (blind) feedback. The original query is expanded with the query terms’ collocates which are extracted from the long-span windows around all occurrences of query terms in the known relevant documents, and selected using statistical measures of MI and Z. Some parameters whose effect was systematically studied in this experiment set are: window size, measure of collocation significance for collocate ranking, number of query expansion collocates and categories of terms in the expanded queries. Some results showed a tendency towards performance gain over relevance feedback in the probabilistic model, however it was not significant enough to conclude that this method is superior to the existing relevance feedback used in the model. 3. Lexical cohesion analysis using local collocations. This experiment set aimed to explore whether the level of lexical cohesion between query terms in a document can be linked to the document’s relevance property, and if so, whether it can be used to predict documents’ relevance to the query. Lexical cohesion between different query terms is estimated from the number of collocates they have in common. The experiments proved that there exists a statistically significant association between the level of lexical cohesion of the query terms in documents and relevance. Another set of experiments, aimed at using lexical cohesion to improve probabilistic document ranking, showed that sets re-ranked by their lexical cohesion scores have similar performance as the original ranking

City Research Online