24 research outputs found

    On-line learning for adaptive text filtering.

    Get PDF
    Yu Kwok Leung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1999.Includes bibliographical references (leaves 91-96).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- The Problem --- p.1Chapter 1.2 --- Information Filtering --- p.2Chapter 1.3 --- Contributions --- p.7Chapter 1.4 --- Organization Of The Thesis --- p.10Chapter 2 --- Related Work --- p.12Chapter 3 --- Adaptive Text Filtering --- p.22Chapter 3.1 --- Representation --- p.22Chapter 3.1.1 --- Textual Document --- p.23Chapter 3.1.2 --- Filtering Profile --- p.28Chapter 3.2 --- On-line Learning Algorithms For Adaptive Text Filtering --- p.29Chapter 3.2.1 --- The Sleeping Experts Algorithm --- p.29Chapter 3.2.2 --- The EG-based Algorithms --- p.32Chapter 4 --- The REPGER Algorithm --- p.37Chapter 4.1 --- A New Approach --- p.37Chapter 4.2 --- Relevance Prediction By RElevant feature Pool --- p.42Chapter 4.3 --- Retrieving Good Training Examples --- p.45Chapter 4.4 --- Learning Dissemination Threshold Dynamically --- p.49Chapter 5 --- The Threshold Learning Algorithm --- p.50Chapter 5.1 --- Learning Dissemination Threshold Dynamically --- p.50Chapter 5.2 --- Existing Threshold Learning Techniques --- p.51Chapter 5.3 --- A New Threshold Learning Algorithm --- p.53Chapter 6 --- Empirical Evaluations --- p.55Chapter 6.1 --- Experimental Methodology --- p.55Chapter 6.2 --- Experimental Settings --- p.59Chapter 6.3 --- Experimental Results --- p.62Chapter 7 --- Integrating With Feature Clustering --- p.76Chapter 7.1 --- Distributional Clustering Algorithm --- p.79Chapter 7.2 --- Integrating With Our REPGER Algorithm --- p.82Chapter 7.3 --- Empirical Evaluation --- p.84Chapter 8 --- Conclusions --- p.87Chapter 8.1 --- Summary --- p.87Chapter 8.2 --- Future Work --- p.88Bibliography --- p.91Chapter A --- Experimental Results On The AP Corpus --- p.97Chapter A.1 --- The EG Algorithm --- p.97Chapter A.2 --- The EG-C Algorithm --- p.98Chapter A.3 --- The REPGER Algorithm --- p.100Chapter B --- Experimental Results On The FBIS Corpus --- p.102Chapter B.1 --- The EG Algorithm --- p.102Chapter B.2 --- The EG-C Algorithm --- p.103Chapter B.3 --- The REPGER Algorithm --- p.105Chapter C --- Experimental Results On The WSJ Corpus --- p.107Chapter C.1 --- The EG Algorithm --- p.107Chapter C.2 --- The EG-C Algorithm --- p.108Chapter C.3 --- The REPGER Algorithm --- p.11

    Support Vector Machines and Kernel Functions for Text Processing

    Get PDF
    This work presents kernel functions that can be used in conjunction with the Support Vector Machine – SVM – learning algorithm to solve the automatic text classification task. Initially the Vector Space Model for text processing is presented. According to this model text is seen as a set of vectors in a high dimensional space; then extensions and alternative models are derived, and some preprocessing procedures are discussed. The SVM learning algorithm, largely employed for text classification, is outlined: its decision procedure is obtained as a solution of an optimization problem. The “kernel trick”, that allows the algorithm to be applied in non-linearly separable cases, is presented, as well as some kernel functions that are currently used in text applications. Finally some text classification experiments employing the SVM classifier are conducted, in order to illustrate some text preprocessing techniques and the presented kernel functions

    Improving Search via Named Entity Recognition in Morphologically Rich Languages – A Case Study in Urdu

    Get PDF
    University of Minnesota Ph.D. dissertation. February 2018. Major: Computer Science. Advisors: Vipin Kumar, Blake Howald. 1 computer file (PDF); xi, 236 pages.Search is not a solved problem even in the world of Google and Bing's state of the art engines. Google and similar search engines are keyword based. Keyword-based searching suffers from the vocabulary mismatch problem -- the terms in document and user's information request don't overlap. For example, cars and automobiles. This phenomenon is called synonymy. Similarly, the user's term may be polysemous -- a user is inquiring about a river's bank, but documents about financial institutions are matched. Vocabulary mismatch exacerbated when the search occurs in Morphological Rich Language (MRL). Concept search techniques like dimensionality reduction do not improve search in Morphological Rich Languages. Names frequently occur news text and determine the "what," "where," "when," and "who" in the news text. Named Entity Recognition attempts to recognize names automatically in text, but these techniques are far from mature in MRL, especially in Arabic Script languages. Urdu is one the focus MRL of this dissertation among Arabic, Farsi, Hindi, and Russian, but it does not have the enabling technologies for NER and search. A corpus, stop word generation algorithm, a light stemmer, a baseline, and NER algorithm is created so the NER-aware search can be accomplished for Urdu. This dissertation demonstrates that NER-aware search on Arabic, Russian, Urdu, and English shows significant improvement over baseline. Furthermore, this dissertation highlights the challenges for researching in low-resource MRL languages

    Effective retrieval techniques for Arabic text

    Get PDF
    Arabic is a major international language, spoken in more than 23 countries, and the lingua franca of the Islamic world. The number of Arabic-speaking Internet users has grown over nine-fold in the Middle East between the year 2000 and 2007, yet research in Arabic Information Retrieval (AIR) has not advanced as in other languages such as English. In this thesis, we explore techniques that improve the performance of AIR systems. Stemming is considered one of the most important factors to improve retrieval effectiveness of AIR systems. Most current stemmers remove affixes without checking whether the removed letters are actually affixes. We propose lexicon-based improvements to light stemming that distinguish core letters from proper Arabic affixes. We devise rules to stem most affixes and show their effects on retrieval effectiveness. Using the TREC 2001 test collection, we show that applying relevance feedback with our rules produces significantly better results than light stemming. Techniques for Arabic information retrieval have been studied in depth on clean collections of newswire dispatches. However, the effectiveness of such techniques is not known on other noisy collections in which text is generated using automatic speech recognition (ASR) systems and queries are generated using machine translations (MT). Using noisy collections, we show that normalisation, stopping and light stemming improve results as in normal text collections but that n-grams and root stemming decrease performance. Most recent AIR research has been undertaken using collections that are far smaller than the collections used for English text retrieval; consequently, the significance of some published results is debatable. Using the LDC Arabic GigaWord collection that contains more than 1 500 000 documents, we create a test collection of~90 topics with their relevance judgements. Using this test collection, we show empirically that for a large collection, root stemming is not competitive. Of the approaches we have studied, lexicon-based stemming approaches perform better than light stemming approaches alone. Arabic text commonly includes foreign words transliterated into Arabic characters. Several transliterated forms may be in common use for a single foreign word, but users rarely use more than one variant during search tasks. We test the effectiveness of lexicons, Arabic patterns, and n-grams in distinguishing foreign words from native Arabic words. We introduce rules that help filter foreign words and improve the n-gram approach used in language identification. Our combined n-grams and lexicon approach successfully identifies 80% of all foreign words with a precision of 93%. To find variants of a specific foreign word, we apply phonetic and string similarity techniques and introduce novel algorithms to normalise them in Arabic text. We modify phonetic techniques used for English to suit the Arabic language, and compare several techniques to determine their effectiveness in finding foreign word variants. We show that our algorithms significantly improve recall. We also show that expanding queries using variants identified by our Soutex4 phonetic algorithm results in a significant improvement in precision and recall. Together, the approaches described in this thesis represent an important step towards realising highly effective retrieval of Arabic text

    Conferentie informatiewetenschap 1999 : Centrum voor Wiskunde en Informatica, 12 november 1999 : proceedings

    Get PDF

    Conferentie informatiewetenschap 1999 : Centrum voor Wiskunde en Informatica, 12 november 1999 : proceedings

    Get PDF

    Public Key Infrastructure

    Full text link

    Congenial Web Search : A Conceptual Framework for Personalized, Collaborative, and Social Peer-to-Peer Retrieval

    Get PDF
    Traditional information retrieval methods fail to address the fact that information consumption and production are social activities. Most Web search engines do not consider the social-cultural environment of users' information needs and the collaboration between users. This dissertation addresses a new search paradigm for Web information retrieval denoted as Congenial Web Search. It emphasizes personalization, collaboration, and socialization methods in order to improve effectiveness. The client-server architecture of Web search engines only allows the consumption of information. A peer-to-peer system architecture has been developed in this research to improve information seeking. Each user is involved in an interactive process to produce meta-information. Based on a personalization strategy on each peer, the user is supported to give explicit feedback for relevant documents. His information need is expressed by a query that is stored in a Peer Search Memory. On one hand, query-document associations are incorporated in a personalized ranking method for repeated information needs. The performance is shown in a known-item retrieval setting. On the other hand, explicit feedback of each user is useful to discover collaborative information needs. A new method for a controlled grouping of query terms, links, and users was developed to maintain Virtual Knowledge Communities. The quality of this grouping represents the effectiveness of grouped terms and links. Both strategies, personalization and collaboration, tackle the problem of a missing socialization among searchers. Finally, a concept for integrated information seeking was developed. This incorporates an integrated representation to improve effectiveness of information retrieval and information filtering. An integrated information retrieval process explores a virtual search network of Peer Search Memories in order to accomplish a reputation-based ranking. In addition, the community structure is considered by an integrated information filtering process. Both concepts have been evaluated and shown to have a better performance than traditional techniques. The methods presented in this dissertation offer the potential towards more transparency, and control of Web search
    corecore