Search CORE

121,023 research outputs found

Design and Construction of Semantic Document Networks Using Concept Extraction

Author: Boese S.
Reiners Torsten
Wood Lincoln
Publication venue
Publication date: 01/01/2012
Field of study

Processing of unstructured documents according to their content is required in many disciplines; e.g., machine translation, text analysis and mining, and information extraction and retrieval. Whilst research in fields like text analysis, conceptualisation, or design of semantic networks progressed crucially over the last years, we still observe gaps between state-of-the-art algorithms to extract concepts from documents and how these concepts are linked effective and efficiently. This paper proposes a framework to store processed documents in a specialised semantic network database to enhance retrieval and analysis of common concepts in documents. We apply natural language reduction to calculate semantic cores for the concept-based indexing of stored documents. The developed prototype demonstrates an advanced document storage as well as a fast (semantical) retrieval of documents based on given key concepts

espace@Curtin

A New Similarity Measure for Document Classification and Text Mining

Author: Eminağaoğlu Mete
Gökşen Yılmaz
Publication venue: 'Knowledge E'
Publication date: 01/01/2020
Field of study

Accurate, efficient and fast processing of textual data and classification of electronic documents have become an important key factor in knowledge management and related businesses in today’s world. Text mining, information retrieval, and document classification systems have a strong positive impact on digital libraries and electronic content management, e-marketing, electronic archives, customer relationship management, decision support systems, copyright infringement, and plagiarism detection, which strictly affect economics, businesses, and organizations. In this study, we propose a new similarity measure that can be used with k-nearest neighbors (k-NN) and Rocchio algorithms, which are some of the well-known algorithms for document classification, information retrieval, and some other text mining purposes. We have tested our novel similarity measure with some structured textual data sets and we have compared the results with some other standard distance metrics and similarity measures such as Cosine similarity, Euclidean distance, and Pearson correlation coefficient. We have obtained some promising results, which show that this proposed similarity measure could be alternatively used within all suitable algorithms, methods, and models for text mining, document classification, and relevant knowledge management systems. Keywords: text mining, document classification, similarity measures, k-NN, Rocchio algorith

Neliti

KnE Publishing Platform

Dokuz Eylul University Research Information System

Visualization and Clustering of Text Retrieval

Author: Pouamoun Abdel Naser
Publication venue: American Academic Scientific Research Journal for Engineering, Technology, and Sciences
Publication date: 14/02/2021
Field of study

In a fast transforming world where all objects will be generating data, dealing with large data collections has been a major concern for data scientists. Major challenges faced by those scientists are among others the difficulty to represent these data in a better way and therefore to communicate hidden information from these data to the users. Accordingly, many data analysis and data visualization techniques have been proposed. Moreover, depending on the nature of data to visualize and the type of information to communicate, a certain number of data processing techniques should be considered. In this work, we analyze and visualize a sample data of TREC-6 from the TREC (Text Retrieval Conference) collections. TREC document collections comprise full text from newspapers articles and US government records. They are primarily dedicated to researchers in Information Retrieval (IR) systems and Natural Language Processing for the development of their works. First, documents are parsed and words extracted to build a corpus in a form of a matrix. Then, Principal Component Analysis is applied to the corpus matrix to reduce the dimension from  to 2. Eventually, the unsupervised K-means algorithm is used to discriminate data into clusters that are interactively visualized thanks to the popular visualization tools such as Pie Chart, Stacked Bar Chart and Scatter Chart. The diversity of the nature of information contained in TREC-6 can be observed thanks to the most frequent words of each cluster that appear on the Bar Chart upon clicking on the Pie Chart of the corresponding cluster

American Scientific Research Journal for Engineering, Technology, and Sciences (ASRJETS)

Learning to Disambiguate Syntactic Relations

Author: Schneider G
Publication venue: Europa-Universitaet Viadrina
Publication date: 01/01/2003
Field of study

Many extensions to text-based, data-intensive knowledge management approaches, such as Information Retrieval or Data Mining, focus on integrating the impressive recent advances in language technology. For this, they need fast, robust parsers that deliver linguistic data which is meaningful for the subsequent processing stages. This paper introduces such a parsing system and discusses some of its disambiguation techniques which are based on learning from a large syntactically annotated corpus. The paper is organized as follows. Section 2 explains the motivations for writing the parser, and why it profits from Dependency grammar assumptions. Section 3 gives a brief introduction to the parsing system and to evaluation questions. Section 4 presents the probabilistic models and the conducted experiments in detail

ZORA

Flexible and efficient IR using array databases

Author: Arjen P. de Vries
Marcin Zukowski
Peter Boncz
Roberto Cornacchia
Sándor Héman
Publication venue: Springer Nature
Publication date: 01/01/2007
Field of study

textabstractThe Matrix Framework is a recent proposal by IR researchers to flexibly represent all important information retrieval models in a single multi-dimensional array framework. Computational support for exactly this framework is provided by the array database system SRAM (Sparse Relational Array Mapping) that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules and demonstrate their effect on text retrieval in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage

Springer - Publisher Connector

CWI's Institutional Repository

Flexible and efficient IR using array databases

Author: Boncz P.A. (Peter)
Cornacchia R. (Roberto)
Héman S. (Sándor)
Vries A.P. (Arjen) de
Zukowski M. (Marcin)
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2008
Field of study

The Matrix Framework is a recent proposal by IR researchers to flexibly represent all important information retrieval models in a single multi-dimensional array framework. Computational support for exactly this framework is provided by the array database system SRAM (Sparse Relational Array Mapping) that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules and demonstrate their effect on text retrieval in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage

CWI's Institutional Repository

Recommended from our members

Parallel methods for the update of partitioned inverted files

Author: A. MacFarlane
David Bawden
J.A. McCann
S.E. Robertson
Publication venue: 'Emerald'
Publication date: 12/07/2007
Field of study

Purpose – An issue which tends to be ignored in information retrieval is the issue of updating inverted files. This is largely because inverted files were devised to provide fast query service, and much work has been done with the emphasis strongly on queries. In this paper we study the effect of using parallel methods for the update of inverted files in order to reduce costs, by looking at two types of partitioning for inverted files: document identifier and term identifier. Design/methodology/approach – Raw update service and update with query service are studied with these partitioning schemes using an incremental update strategy. We use standard measures used in parallel computing such as speedup to examine the computing results and also the costs of reorganising indexes while servicing transactions. Findings – Empirical results show that for both transaction processing and index reorganisation the document identifier method is superior. However, there is evidence that the term identifier partitioning method could be useful in a concurrent transaction processing context. Practical implications – There is an increasing need to service updates which is now becoming a requirement of inverted files (for dynamic collections such as the Web), demonstrating that a shift in requirements of inverted file maintenance is needed from the past. Originality/value – The paper is of value to database administrators who manage large-scale and dynamic text collections, and who need to use parallel computing to implement their text retrieval services

City Research Online

Crossref

Compressing High-Dimensional Data Spaces Using Non-Differential Augmented Vector Quantization

Author: Atayero A. A.
Olugbara O. O.
Publication venue
Publication date: 01/01/2007
Field of study

query processing times and space requirements. Database compression has been discovered to alleviate the I/O bottleneck, reduce disk space, improve disk access speed, speed up query, reduce overall retrieval time and increase the effective I/O bandwidth. However, random access to individual tuples in a compressed database is very difficult to achieve with most available compression techniques. We propose a lossless compression technique called non-differential augmented vector quantization, a close variant of the novel augmented vector quantization. The technique is applicable to a collection of tuples and especially effective for tuples with many low to medium cardinality fields. In addition, the technique supports standard database operations, permits very fast random access and atomic decompression of tuples in large collections. The technique maps a database relation into a static bitmap index cached access structure. Consequently, we were able to achieve substantial savings in space by storing each database tuple as a bit value in the computer memory. Important distinguishing characteristics of our technique is that individual tuples can be compressed and decompressed, rather than a full page or entire relation at a time, (b) the information needed for tuple compression and decompression can reside in the memory or at worst in a single page. Promising application domains include decision support systems, statistical databases and life databases with low cardinality fields and possibly no text field

Covenant University Repository

Video browsing interfaces and applications: a review

Author: Boeszoermenyi L.
Hopfgartner F.
Jose J.
Marques O.
Schoeffmann K.
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date: 01/02/2010
Field of study

We present a comprehensive review of the state of the art in video browsing and retrieval systems, with special emphasis on interfaces and applications. There has been a significant increase in activity (e.g., storage, retrieval, and sharing) employing video data in the past decade, both for personal and professional use. The ever-growing amount of video content available for human consumption and the inherent characteristics of video data—which, if presented in its raw format, is rather unwieldy and costly—have become driving forces for the development of more effective solutions to present video contents and allow rich user interaction. As a result, there are many contemporary research efforts toward developing better video browsing solutions, which we summarize. We review more than 40 different video browsing and retrieval interfaces and classify them into three groups: applications that use video-player-like interaction, video retrieval applications, and browsing solutions based on video surrogates. For each category, we present a summary of existing work, highlight the technical aspects of each solution, and compare them against each other

Enlighten

White Rose Research Online