11 research outputs found

    Deduction over Mixed-Level Logic Representations for Text Passage Retrieval

    Full text link
    A system is described that uses a mixed-level representation of (part of) meaning of natural language documents (based on standard Horn Clause Logic) and a variable-depth search strategy that distinguishes between the different levels of abstraction in the knowledge representation to locate specific passages in the documents. Mixed-level representations as well as variable-depth search strategies are applicable in fields outside that of NLP.Comment: 8 pages, Proceedings of the Eighth International Conference on Tools with Artificial Intelligence (TAI'96), Los Alamitos C

    Information retrieval (Part 2):Document representations

    Get PDF

    Information retrieval (Part I):Introduction

    Get PDF

    Text mining for social sciences: new approaches

    Get PDF
    The rise of the Internet has determined an important change in the way we look at the world, and then the mode we measure it. In June 2018, more than 55% of the world’s population has an Internet access. It follows that, every day we are able to quantify what more than four billion people do, how and when they do it. This means data. The availability of all these data raised more than one questions: How to manage them? How to treat them? How to extract information from them? Now, more than ever before, we need to think about new rules, new methods and new procedures for handling this huge amount of data, which are characterized by being unstructured, raw and messy. One of the most interesting challenge in this field regards the implementation of processes for deriving information from textual sources; this process is also known as Text Mining. Born in the mid-90s, Text Mining represents a prolific field which has evolved – thanks to technology evolution – from the Automatic Text Analysis, a set of methods for the description and the analysis of documents. Textual data, even if transformed into a structured format, present several criticisms as they are characterized by high dimensionality and noise. Moreover, online texts – like social media posts or blogs comments – are most of the time very short, and this means more sparseness of the matrices when the data are encoded. All these findings pose the problem of looking at new and advanced solutions for treating Web Data, that are able to overcome these criticisms and at the same time, return the information contained into these texts. The objective is to propose a fast and scalable method, able to deal with the findings of the online texts, and then with big and sparse matrices. To do that, we propose a procedure that starts from the collection of texts to the interpretation of the results. The innovative parts of this procedure consist of the choice of the weighting scheme for the term-document matrix and the co-clustering approach for data classification. To verify the validity of the procedure, we test it through two real applications: one concerning the topic of the safety and health at work and another regarding the subject of the Brexit vote. It will be shown how the technique works on different types of texts, allowing us to obtain meaningful results. For the reasons described above, in this research work we implement and test on real datasets a new procedure for content analysis of textual data, using a two-way approach in the Text Clustering field. As will be shown in the following pages, Text Clustering is a process of unsupervised classification that reproduces the internal structure of the data, by dividing the text into different groups on the basis of the lexical similarities. Text Clustering is mostly utilized for content analysis, and it might be applied for the classification of words, documents or both. In latter case we refer to two-way clustering, that is the specific approach we implemented within this research work for the treatment of the texts. To better organize the research work, we divided it into two parts: a first part of theory and a second one of application. The first part contains a preliminary chapter of literature review on the field of the Automatic Text Analysis in the context of data revolution, and a second chapter where the new procedure for text co-clustering is proposed. The second part regards the application of the proposed techniques on two different set of texts, one composed of news and another one composed of tweets. The idea is to test the same procedure on different type of texts, in order to verify the validity and the robustness of the method

    Measuring the Stability of Query Term Collocations and Using it in Document Ranking

    Get PDF
    Delivering the right information to the user is fundamental in information retrieval system. Many traditional information retrieval models assume word independence and view a document as bag-of-words, however getting the right information requires a deep understanding of the content of the document and the relationships that exist between words in the text. This study focuses on developing two new document ranking techniques, which are based on a lexical cohesive relationship of collocation. Collocation relationship is a semantic relationship that exists between words that co-occur in the same lexical environment. Two types of collocation relationship have been considered; collocation in the same grammatical structure (such as a sentence), and collocation in the same semantic structure where query terms occur in different sentences but they co-occur with the same words. In the first technique, we only considered the first type of collocation to calculate the document score; where the positional frequency of query terms co-occurrence have been used to identify collocation relationship between query terms and calculating query term’s weight. In the second technique, both types of collocation have been considered; where the co-occurrence frequency distribution within a predefined window has been used to determine query terms collocations and computing query term’s weight. Evaluation of the proposed techniques show performance gain in some of the collocations over the chosen baseline runs

    Text mining techniques for patent analysis.

    Get PDF
    Abstract Patent documents contain important research results. However, they are lengthy and rich in technical terminology such that it takes a lot of human efforts for analyses. Automatic tools for assisting patent engineers or decision makers in patent analysis are in great demand. This paper describes a series of text mining techniques that conforms to the analytical process used by patent analysts. These techniques include text segmentation, summary extraction, feature selection, term association, cluster generation, topic identification, and information mapping. The issues of efficiency and effectiveness are considered in the design of these techniques. Some important features of the proposed methodology include a rigorous approach to verify the usefulness of segment extracts as the document surrogates, a corpus-and dictionary-free algorithm for keyphrase extraction, an efficient co-word analysis method that can be applied to large volume of patents, and an automatic procedure to create generic cluster titles for ease of result interpretation. Evaluation of these techniques was conducted. The results confirm that the machine-generated summaries do preserve more important content words than some other sections for classification. To demonstrate the feasibility, the proposed methodology was applied to a realworld patent set for domain analysis and mapping, which shows that our approach is more effective than existing classification systems. The attempt in this paper to automate the whole process not only helps create final patent maps for topic analyses, but also facilitates or improves other patent analysis tasks such as patent classification, organization, knowledge sharing, and prior art searches

    Extração de informação para busca semântica na web baseada em ontologias

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-Graduação em Engenharia Elétrica.Sistemas de Recuperação de Informação (RI) prestam um papel fundamental na busca por páginas na Web. Entretanto, os resultados oferecidos por estes sistemas são pouco precisos, trazendo muitas informações que não condizem com o interesse do usuário. Isto ocorredevido à falta de semântica nas páginas da Web e nos critérios de busca adotados pelos sistemas de RI. Neste trabalho propomos um sistema de Extração de Informação (EI) baseado em ontologias. O objetivo é extrair informações de páginas previamente classificadas semanticamente pelo sistema MASTER-Web que é um sistema multiagente cognitivo para recuperação, classificação e extração de informação na Web. Ontologias são empregadas como formalismo de representação de conhecimento e permitem que o conhecimento seja discriminado em três tipos: conhecimento do domínio, conhecimento sobre a página Web e conhecimento sobre a informação a ser extraída. Regras de produção são usadas como representação do conhecimento sobre o processo de extração. A informação é tratada como um conjunto formado por dados que são extraídos individualmente e depois combinados de modo que componham uma informação consistente. Estes dois passos definem as duas fases da extração que são a extração individual e a integração. Na primeira fase os dados são extraídos individualmente e na segunda fase, os dados, que de alguma forma se relacionam, são unidos formando a informação. O sistema proposto permite portabilidade e reusabilidade do conhecimento, bem como flexibilidade na representação e manutenção do conhecimento sobre a extração. Experimentos foram feitos com o sistema visando avaliá-lo. Para validar os experimentos, os resultados obtidos foram confrontados com os resultados de um outro sistema de EI obtendo resultados bastante satisfatórios

    A stemming algorithm for Latvian

    Get PDF
    The thesis covers construction, application and evaluation of a stemming algorithm for advanced information searching and retrieval in Latvian databases. Its aim is to examine the following two questions: Is it possible to apply for Latvian a suffix removal algorithm originally designed for English? Can stemming in Latvian produce the same or better information retrieval results than manual truncation? In order to achieve these aims, the role and importance of automatic word conflation both for document indexing and information retrieval are characterised. A review of literature, which analyzes and evaluates different types of stemming techniques and retrospective development of stemming algorithms, justifies the necessity to apply this advanced IR method also for Latvian. Comparative analysis of morphological structure both for English and Latvian language determined the selection of Porter's suffix removal algorithm as a basis for the Latvian sternmer. An extensive list of Latvian stopwords including conjunctions, particles and adverbs, was designed and added to the initial sternmer in order to eliminate insignificant words from further processing. A number of specific modifications and changes related to the Latvian language were carried out to the structure and rules of the original stemming algorithm. Analysis of word stemming based on Latvian electronic dictionary and Latvian text fragments confirmed that the suffix removal technique can be successfully applied also to Latvian language. An evaluation study of user search statements revealed that the stemming algorithm to a certain extent can improve effectiveness of information retrieval
    corecore