19,460 research outputs found

    Text Mining Untuk Pencarian Dokumen Bahasa Inggris Menggunakan Suffix Tree Clustering

    Get PDF
    A search of the collection of documents generally provide excerpts of the documents are arranged according to rank matches in a long list. Not infrequently a search result in tens and even hundreds of fragments of documents that caused a user to scroll the screen up and down (scrolling) to examine the documents snippet one by one. This situation causes a user is having difficulty in determining which documents relevant to the topic he wants. In this Final Project developed an application web-based document segmentation with suffix tree clustering method. The basic concept of this method is to classify documents in the search results to form groups or clusters based on words or phrases contained in these documents. The application requires the search input and output will result in clusters containing the corresponding documents. This cluster can be stratified depending on the word or phrase that might be distinguished on the same parent cluster. Clusters generated is displayed to the user. Then on the last cluster is selected will display a collection of documents, each consisting of the title and snippet of the document. With this method expected results would be easier to trace. Keywords : text mining, suffix tree, suffix tree clustering, the grouping of documents

    From Frequency to Meaning: Vector Space Models of Semantics

    Full text link
    Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

    Towards Affordable Disclosure of Spoken Word Archives

    Get PDF
    This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken word archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, the least we want to be able to provide is search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition – supporting e.g., within-document search– are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is still far from satisfactory, and requires additional research

    Robust audio indexing for Dutch spoken-word collections

    Get PDF
    Abstract—Whereas the growth of storage capacity is in accordance with widely acknowledged predictions, the possibilities to index and access the archives created is lagging behind. This is especially the case in the oral history domain and much of the rich content in these collections runs the risk to remain inaccessible for lack of robust search technologies. This paper addresses the history and development of robust audio indexing technology for searching Dutch spoken-word collections and compares Dutch audio indexing in the well-studied broadcast news domain with an oral-history case-study. It is concluded that despite significant advances in Dutch audio indexing technology and demonstrated applicability in several domains, further research is indispensable for successful automatic disclosure of spoken-word collections
    corecore