Search CORE

1,996 research outputs found

Using noun phrases extraction for the improvement of hybrid clustering with text- and citation-based components. The example of “Information Systems Research”

Author: Glänzel Wolfgang
Meyer Martin S.
Thijs Bart
Publication venue
Publication date: 29/06/2015
Field of study

The hybrid clustering approach combining lexical and link-based similarities suffered for a long time from the different properties of the underlying networks. We propose a method based on noun phrase extraction using natural language processing to improve the measurement of the lexical component. Term shingles of different length are created form each of the extracted noun phrases. Hybrid networks are built based on weighted combination of the two types of similarities with seven different weights. We conclude that removing all single term shingles provides the best results at the level of computational feasibility, comparability with bibliographic coupling and also in a community detection application

Kent Academic Repository

Natural language processing

Author: Adams
Amsler
Bangalore
Barker
Benoît
Bian
Bondale
Carrick
Ceric
Chandrasekar
Chang
Charniak
Chen
Chowdhury
Chowdhury
Costantino
Cowie
Craven
Craven
Craven
Dogru
Evans
Feldman
Fernandez
Gaizauskas
Glasgow
Haas
Hayes
Hayes
Hedlund
Herath
Ide
Isahara
Jelinek
Jeong
Jurafsky
Kazakov
Kehler
Khoo
Kim
King
Lange
Lee
Lehmam
Lehtokangas
Lewis
Liddy
Liddy
Lovis
Ma
Magnini
Mani
Manning
Marquez
Martinez
Martinez
McMurchie
Meyer
Mihalcea
Mock
Moens
Morin
Narita
Nerbonne
Oard
Ogura
Oudet
Owei
Paris
Pasero
Pedersen
Perez-Carballo
Petreley
Pirkola
Poesio
Rosenfield
Roux
Say
Scarlett
Schenker
Silber
Smeaton
Smeaton
Smith
Sokol
Song
Sparck Jones
Staab
Stock
Tolle
Trybula
Tsuda
Vickery
Waldrop
Warner
Weigard
Wilks
Wong
Yang
Yang
Zadrozny
Zweigenbaum
Publication venue: 'Wiley'
Publication date: 01/01/2003
Field of study

Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

Crossref

University of Strathclyde Institutional Repository

OPUS - University of Technology Sydney

Term Clustering of Syntactic Phrases

Author: David D. Lewis
W. Bruce Croft
Publication venue
Publication date: 01/01/1990
Field of study

Term clustering and syntactic phrase formation are methods for transforming natural language text. Both have had only mixed success as strategies for improving the quality of text representations for document retrieval. Since the strengths of these methods are complementary, we have explored combining them to produce superior representations. In this paper we discuss our implementation of a syntactic phrase generator, as well as our preliminary experiments with producing phrase clusters. These experiments show small improvements in retrieval effectiveness resulting from the use of phrase clusters, but it is clear that corpora much larger than standard information retrieval test collections will be required to thoroughly evaluate the use of this technique

CiteSeerX

Crossref

Logical-Linguistic Model and Experiments in Document Retrieval

Author: Sembok Tengku Mohd Tengku
Publication venue: ProQuest Dissertations & Theses,
Publication date: 01/01/1989
Field of study

Conventional document retrieval systems have relied on the extensive use of the keyword approach with statistical parameters in their implementations. Now, it seems that such an approach has reached its upper limit of retrieval effectiveness, and therefore, new approaches should be investigated for the development of future systems. With current advances in hardware, programming languages and techniques, natural language processing and understanding, and generally, in the field of artificial intelligence, there are now attempts being made to include linguistic processing into document retrieval systems. Few attempts have been made to include parsing or syntactic analysis into document retrieval systems, and the results reported show some improvements in the level of retrieval effectiveness. The first part of this thesis sets out to investigate further the use of linguistic processing by including translation, instead of only parsing, into a document retrieval system. The translation process implemented is based on unification categorial grammar and uses C-Prolog as the building tool. It is used as the main part of the indexing process of documents and queries into a knowledge base predicate representation. Instead of using the vector space model to represent documents and queries, we have used a kind of knowledge base model which we call logical-linguistic model. A development of a robust parser-translator to perform the translation is discussed in detail in the thesis. A method of dealing with ambiguity is also incorporated in the parser-translator implementation. The retrieval process of this model is based on a logical implication process implemented in C-Prolog. In order to handle uncertainty in evaluating similarity values between documents and queries, meta level constructs are built upon the C-Prolog system. A logical meta language, called UNIL (UNcertain Implication Language), is proposed for controlling the implication process. Using UNIL, one can write a set of implication rules and thesaurus to define the matching function of a particular retrieval strategy. Thus, we have demonstrated and implemented the matching operation between a document and a query as an inference using unification. An inference from a document to a query is done in the context of global information represented by the implication rules and the thesaurus. A set of well structured experiments is performed with various retrieval strategies on a test collection of documents and queries in order to evaluate the performance of the system. The results obtained are analysed and discussed. The second part of the thesis sets out to implement and evaluate the imaging retrieval strategy as originally defined by van Rijsbergen. The imaging retrieval is implemented as a relevance feedback retrieval with nearest neighbour information which is defined as follows. One of the best retrieval strategies from the earlier experiments is chosen to perform the initial ranking of the documents, and a few top ranked documents will be retrieved and identified as relevant or not by the user. From this set of retrieved and relevant documents, we can obtain all other unretrieved documents which have any of the retrieved and relevant documents as their nearest neighbour. These unretrieved documents have the potential of also being relevant since they are 'close' to the retrieved and relevant ones, and thus their initial similarity values to the query will be updated according to their distances from their nearest neighbours. From the updated similarity values, a new ranking of documents can be obtained and evaluated. A few sets of experiments using imaging retrieval strategy are performed for the following objectives: to search for an appropriate updating function in order to produce a new ranking of documents, to determine an appropriate nearest neighbour set, to find the relationship of the retrieval effectiveness to the size of the documents shown to the user for relevance judgement, and lastly, to find the effectiveness of a multi-stage imaging retrieval. The results obtained are analysed and discussed. Generally, the thesis sets out to define the logical-linguistic model in document retrieval and demonstrates it by building an experimental system which will be referred to as SILOL (a Simple Logical-linguistic document retrieval system). A set of retrieval strategies will be experimented with and the results obtained will be analysed and discussed

Glasgow Theses Service

From Frequency to Meaning: Vector Space Models of Semantics

Author: Pantel Patrick
Turney Peter D.
Publication venue: 'AI Access Foundation'
Publication date: 01/01/2010
Field of study

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

arXiv.org e-Print Archive

CiteSeerX

NRC Publications Archive

Crossref

Document highlighting - message classification in printed business letters

Author: Dengel Andreas
Hoch Rainer
Publication venue: Sonstige Einrichtungen. DFKI Deutsches Forschungszentrum für Künstliche Intelligenz
Publication date: 01/01/1993
Field of study

This paper presents the INFOCLAS system applying statistical methods of information retrieval primarily for the classification of German business letters into corresponding message types such as order, offer, confirmation, etc. INFOCLAS is a first step towards understanding of documents. Actually, it is composed of three modules: the central indexer (extraction and weighting of indexing terms), the classifier (classification of business letters into given types) and the focuser (highlighting relevant letter parts). The system employs several knowledge sources including a database of about 100 letters, word frequency statistics for German, message type specific words, morphological knowledge as well as the underlying document model. As output, the system evaluates a set of weighted hypotheses about the type of letter at hand, or highlights relevant text (text focus), respectively. Classification of documents allows the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis

Universaar

Acronym

Computer Supported Indexing: A History and Evaluation of NASA's MAI System

Author: Silvester June P.
Publication venue
Publication date
Field of study

Computer supported or machine aided indexing (MAI) can be categorized in multiple ways. The system used by the National Aeronautics and Space Administration's (NASA's) Center for AeroSpace Information (CASI) is described as semantic and computational. It's based on the co-occurrence of domain-specific terminology in parts of a sentence, and the probability that an indexer will assign a particular index term when a given word or phrase is encountered in text. The NASA CASI system is run on demand by the indexer and responds in 3 to 9 seconds with a list of suggested, authorized terms. The system was originally based on a syntactic system used in the late 1970's by the Defense Technical Information Center (DTIC). The NASA mainframe-supported system consists of three components: two programs and a knowledge base (KB). The evolution of the system is described and flow charts illustrate the MAI procedures. Tests used to evaluate NASA's MAI system were limited to those that would not slow production. A very early test indicated that MAI saved about 3 minutes and provided several additional terms for each document indexed. It also was determined that time and other resources spent in careful construction of the KB pay off with high-quality output and indexer acceptance of MAI results

NASA Technical Reports Server