Search CORE

14 research outputs found

On-line learning for adaptive text filtering.

Author
Publication venue
Publication date: 01/01/1999
Field of study

Yu Kwok Leung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1999.Includes bibliographical references (leaves 91-96).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- The Problem --- p.1Chapter 1.2 --- Information Filtering --- p.2Chapter 1.3 --- Contributions --- p.7Chapter 1.4 --- Organization Of The Thesis --- p.10Chapter 2 --- Related Work --- p.12Chapter 3 --- Adaptive Text Filtering --- p.22Chapter 3.1 --- Representation --- p.22Chapter 3.1.1 --- Textual Document --- p.23Chapter 3.1.2 --- Filtering Profile --- p.28Chapter 3.2 --- On-line Learning Algorithms For Adaptive Text Filtering --- p.29Chapter 3.2.1 --- The Sleeping Experts Algorithm --- p.29Chapter 3.2.2 --- The EG-based Algorithms --- p.32Chapter 4 --- The REPGER Algorithm --- p.37Chapter 4.1 --- A New Approach --- p.37Chapter 4.2 --- Relevance Prediction By RElevant feature Pool --- p.42Chapter 4.3 --- Retrieving Good Training Examples --- p.45Chapter 4.4 --- Learning Dissemination Threshold Dynamically --- p.49Chapter 5 --- The Threshold Learning Algorithm --- p.50Chapter 5.1 --- Learning Dissemination Threshold Dynamically --- p.50Chapter 5.2 --- Existing Threshold Learning Techniques --- p.51Chapter 5.3 --- A New Threshold Learning Algorithm --- p.53Chapter 6 --- Empirical Evaluations --- p.55Chapter 6.1 --- Experimental Methodology --- p.55Chapter 6.2 --- Experimental Settings --- p.59Chapter 6.3 --- Experimental Results --- p.62Chapter 7 --- Integrating With Feature Clustering --- p.76Chapter 7.1 --- Distributional Clustering Algorithm --- p.79Chapter 7.2 --- Integrating With Our REPGER Algorithm --- p.82Chapter 7.3 --- Empirical Evaluation --- p.84Chapter 8 --- Conclusions --- p.87Chapter 8.1 --- Summary --- p.87Chapter 8.2 --- Future Work --- p.88Bibliography --- p.91Chapter A --- Experimental Results On The AP Corpus --- p.97Chapter A.1 --- The EG Algorithm --- p.97Chapter A.2 --- The EG-C Algorithm --- p.98Chapter A.3 --- The REPGER Algorithm --- p.100Chapter B --- Experimental Results On The FBIS Corpus --- p.102Chapter B.1 --- The EG Algorithm --- p.102Chapter B.2 --- The EG-C Algorithm --- p.103Chapter B.3 --- The REPGER Algorithm --- p.105Chapter C --- Experimental Results On The WSJ Corpus --- p.107Chapter C.1 --- The EG Algorithm --- p.107Chapter C.2 --- The EG-C Algorithm --- p.108Chapter C.3 --- The REPGER Algorithm --- p.11

CUHK Digital Repository

Supporting the Chinese Language in Oracle Text

Author: Lai Poh Choo
Publication venue
Publication date: 10/05/2005
Field of study

Gegenstand dieser Arbeit sind die Problematik von chinesischem Information Retrieval (IR) sowie die Faktoren, die die Leistung eines chinesischen IR-System beeinflussen können. Experimente wurden im Rahmen des Bewertungsmodells von „TREC-5 Chinese Track“ und der Nutzung eines großen Korpusses von über 160.000 chinesischen Nachrichtenartikeln auf einer Oracle10g (Beta Version) Datenbank durchgeführt. Schließlich wurde die Leistung von Oracle® Text in einem so genannten „Benchmarking“ Prozess gegenüber den Ergebnissen der Teilnehmer von TREC-5 verglichen. Die Hauptergebnisse dieser Arbeit sind: (a) Die Wirksamkeit eines chinesischen IR Systems ist durch die Art und Weise der Formulierung einer Abfrage stark beeinflusst. Besonders sollte man während der Formulierung einer Anfrage die Vielzahl von Abkürzungen und die regionalen Unterschiede in der chinesischen Sprache, sowie die verschiedenen Transkriptionen der nicht-chinesischen Eigennamen beachten; (b) Stopwords haben keinen Einfluss auf die Leistungsfähigkeit eines chinesischen IR Systems; (c) die Benutzer neigen dazu, kürzere Abfragen zu formulieren, und die Suchergebnisse sind besonders schlecht, wenn Feedback und Expansion von Anfragen („query expansion“) nicht genutzt werden; (d) im Vergleich zu dem Chinese_Vgram_Lexer, hat der Chinese_Lexer den Vorteil, reale Wörter und einen kleineren Index zu erzeugen, sowie höhere Präzision in den Suchergebnissen zu erzielen; und (e) die Leistung von Oracle® Text für chinesisches IR ist vergleichbar mit den Ergebnissen von TREC-5

Hochschulschriftenserver der Hochschule der Medien Stuttgart

Inter-relaão das técnicas Term Extration e Query Expansion aplicadas na recuperação de documentos textuais

Author: Bettio Raphael Winckler de
Publication venue: Florianópolis, SC
Publication date: 01/01/2007
Field of study

Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-graduação em Engenharia e Gestão do ConhecimentoConforme Sighal (2006) as pessoas reconhecem a importância do armazenamento e busca da informação e, com o advento dos computadores, tornou-se possível o armazenamento de grandes quantidades dela em bases de dados. Em conseqüência, catalogar a informação destas bases tornou-se imprescindível. Nesse contexto, o campo da Recuperação da Informação, surgiu na década de 50, com a finalidade de promover a construção de ferramentas computacionais que permitissem aos usuários utilizar de maneira mais eficiente essas bases de dados. O principal objetivo da presente pesquisa é desenvolver um Modelo Computacional que possibilite a recuperação de documentos textuais ordenados pela similaridade semântica, baseado na intersecção das técnicas de Term Extration e Query Expansion

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositório Institucional da UFSC

RCAAP - Repositório Científico de Acesso Aberto de Portugal

Utilisation of metadata fields and query expansion in cross-lingual search of user-generated Internet video

Author: Ganguly Debasis
Jones Gareth J.F.
Khwileh Ahmad
Publication venue: 'AI Access Foundation'
Publication date: 27/01/2016
Field of study

Recent years have seen signicant eorts in the area of Cross Language Information Retrieval (CLIR) for text retrieval. This work initially focused on formally published content, but more recently research has begun to concentrate on CLIR for informal social media content. However, despite the current expansion in online multimedia archives, there has been little work on CLIR for this content. While there has been some limited work on Cross-Language Video Retrieval (CLVR) for professional videos, such as documentaries or TV news broadcasts, there has to date, been no signicant investigation of CLVR for the rapidly growing archives of informal user generated (UGC) content. Key differences between such UGC and professionally produced content are the nature and structure of the textual UGC metadata associated with it, as well as the form and quality of the content itself. In this setting, retrieval eectiveness may not only suer from translation errors common to all CLIR tasks, but also recognition errors associated with the automatic speech recognition (ASR) systems used to transcribe the spoken content of the video and with the informality and inconsistency of the associated user-created metadata for each video. This work proposes and evaluates techniques to improve CLIR effectiveness of such noisy UGC content. Our experimental investigation shows that dierent sources of evidence, e.g. the content from dierent elds of the structured metadata, significantly affect CLIR effectiveness. Results from our experiments also show that each metadata eld has a varying robustness to query expansion (QE) and hence can have a negative impact on the CLIR eectiveness. Our work proposes a novel adaptive QE technique that predicts the most reliable source for expansion and shows how this technique can be effective for improving CLIR effectiveness for UGC content

Irish Universities

DCU Online Research Access Service

Automatische Indexierung unter Einbeziehung semantischer Relationen : Ergebnisse des Retrievaltests zum MILOS II-Projekt

Author: Gödert Winfried
Liebig Martina
Sachse Elisabeth
Publication venue
Publication date: 18/02/2003
Field of study

Im Rahmen von MILOS II wurde das erste MILOS-Projekt zur automatischen Indexierung von Titeldaten um eine semantische Komponente erweitert, indem Thesaurusrelationen der Schlagwortnormdatei eingebunden wurden. Der abschließend zur Evaluierung des Projekts durchgeführte Retrievaltest und seine Ergebnisse stehen im Mittelpunkt dieses Texts. Zusätzlich wird ein Überblick über bereits durchgeführte Retrievaltests (vorwiegend des anglo-amerikanischen Raums) gegeben und es wird erläutert, welche grundlegenden Fragestellungen bei der praktischen Durchführung eines Retrievaltests zu beachten sind

ePublications

Automatische Indexierung unter Einbeziehung semantischer Relationen

Author: Gödert Winfried
Liebig Martina
Sachse Elisabeth
Publication venue: Köln : Fachhochschule, Institut für Informationswissenschaft
Publication date: 01/01/1998
Field of study

Institut für Informationswissenschaft

Financial information extraction using pre-defined and user-definable templates in the Lolita system

Author: Constantino Marco
Publication venue
Publication date: 01/01/1997
Field of study

Financial operators have today access to an extremely large amount of data, both quantitative and qualitative, real-time or historical and can use this information to support their decision-making process. Quantitative data are largely processed by automatic computer programs, often based on artificial intelligence techniques, that produce quantitative analysis, such as historical price analysis or technical analysis of price behaviour. Differently, little progress has been made in the processing of qualitative data, which mainly consists of financial news articles from financial newspapers or on-line news providers. As a result the financial market players are overloaded with qualitative information which is potentially extremely useful but, due to the lack of time, is often ignored. The goal of this work is to reduce the qualitative data-overload of the financial operators. The research involves the identification of the information in the source financial articles which is relevant for the financial operators' investment decision making process and to implement the associated templates in the LOLITA system. The system should process a large number of source articles and extract specific templates according to the relevant information located in the source articles. The project also involves the design and implementation in LOLITA of a user- definable template interface for allowing the users to easily design new templates using sentences in natural language. This allows user-defined information extraction from source texts. This differs from most of existing information extraction systems which require the developers to code the templates directly in the system. The results of the research have shown that the system performed well in the extraction of financial templates from source articles which would allow the financial operator to reduce his qualitative data-overload. The results have also shown that the user-definable template interface is a viable approach to user-defined information extraction. A trade-off has been identified between the ease of use of the user-definable template interface and the loss of performance compared to hand- coded templates

Durham e-Theses

CiteSeerX

Towards effective cross-lingual search of user-generated internet speech

Author: Khwileh Ahmad
Publication venue: Dublin City University. ADAPT
Publication date: 01/11/2018
Field of study

The very rapid growth in user-generated social spoken content on online platforms is creating new challenges for Spoken Content Retrieval (SCR) technologies. There are many potential choices for how to design a robust SCR framework for UGS content, but the current lack of detailed investigation means that there is a lack of understanding of the specifc challenges, and little or no guidance available to inform these choices. This thesis investigates the challenges of effective SCR for UGS content, and proposes novel SCR methods that are designed to cope with the challenges of UGS content. The work presented in this thesis can be divided into three areas of contribution as follows. The first contribution of this work is critiquing the issues and challenges that in influence the effectiveness of searching UGS content in both mono-lingual and cross-lingual settings. The second contribution is to develop an effective Query Expansion (QE) method for UGS. This research reports that, encountered in UGS content, the variation in the length, quality and structure of the relevant documents can harm the effectiveness of QE techniques across different queries. Seeking to address this issue, this work examines the utilisation of Query Performance Prediction (QPP) techniques for improving QE in UGS, and presents a novel framework specifically designed for predicting of the effectiveness of QE. Thirdly, this work extends the utilisation of QPP in UGS search to improve cross-lingual search for UGS by predicting the translation effectiveness. The thesis proposes novel methods to estimate the quality of translation for cross-lingual UGS search. An empirical evaluation that demonstrates the quality of the proposed method on alternative translation outputs extracted from several Machine Translation (MT) systems developed for this task. The research then shows how this framework can be integrated in cross-lingual UGS search to find relevant translations for improved retrieval performance

Irish Universities

DCU Online Research Access Service

Recommended from our members

Approaches to Using in Information Word Collocation Retrieval

Author: Vechtomova O.
Publication venue
Publication date
Field of study

The thesis explores long-span collocation and its application in information retrieval. The basic research question of the thesis is whether the use of long-span collocates can improve performance of a probabilistic model of IR. The model used in the project is the Robertson & Sparck Jones probabilistic model. The basic research question was explored by investigating three different ways of integrating collocation information with the probabilistic model: 1. Global collocation analysis. The method consists in expanding the original query with long-span global collocates of query terms. Global collocates of a query term are selected from large fixed-size windows around all occurrences of a term in the corpus and ranked by statistical measures of Mutual Information (MI) and Z score. A fixed number of top-ranked collocates is used in query expansion. Query expansion with global collocates did not show to be superior to the original queries, the possible reason being the fact that query terms often have a fairly broad meaning and, hence, a rather semantically heterogeneous pattern of occurrence. 2. Local collocation analysis. This method is a form of iterative query expansion following relevance or pseudo-relevance (blind) feedback. The original query is expanded with the query terms’ collocates which are extracted from the long-span windows around all occurrences of query terms in the known relevant documents, and selected using statistical measures of MI and Z. Some parameters whose effect was systematically studied in this experiment set are: window size, measure of collocation significance for collocate ranking, number of query expansion collocates and categories of terms in the expanded queries. Some results showed a tendency towards performance gain over relevance feedback in the probabilistic model, however it was not significant enough to conclude that this method is superior to the existing relevance feedback used in the model. 3. Lexical cohesion analysis using local collocations. This experiment set aimed to explore whether the level of lexical cohesion between query terms in a document can be linked to the document’s relevance property, and if so, whether it can be used to predict documents’ relevance to the query. Lexical cohesion between different query terms is estimated from the number of collocates they have in common. The experiments proved that there exists a statistically significant association between the level of lexical cohesion of the query terms in documents and relevance. Another set of experiments, aimed at using lexical cohesion to improve probabilistic document ranking, showed that sets re-ranked by their lexical cohesion scores have similar performance as the original ranking

City Research Online