10 research outputs found

    Efficient String Dictionary Compression Using String Dictionaries

    Get PDF
    文字列集合を保管するためのデータ構造である文字列辞書に関して,近年,多くの用途でコンパクト性が求められるという実例が報告されている.また,その背景に応じて,Trie や Front-Coding などの辞書を実現するための優れた技法に,Re-Pair などの強力な文書圧縮技法を組み合わせた圧縮文字列辞書が提案されている.本稿では,既存の圧縮文字列辞書の改良を目的とし,文字列辞書の圧縮に文字列辞書を用いるという方策に基づいた辞書構造を提案する.実データを用いた実験より,提案による文字列辞書はRe-Pair により圧縮した辞書と比べ,メモリ効率や検索・復元速度のトレードオフに関して同等の性能を示しつつ,短い時間で構築できることを示した.A string dictionary is a data structure to store a set of strings. Recently, instances have emerged in practice where the size of string dictionaries has become a critical problem in many applications. Consequently, compressed string dictionaries have been proposed by leveraging efficient implementation techniques, such as Trie and Front-Coding, and powerful text compression techniques, such as Re-Pair. In this paper, we propose new dictionary structures based on a strategy using string dictionaries for the compression in order to improve existing compressed ones. We show that our string dictionaries can be constructed in a shorter time compared to the Re-Pair versions with competitive space usage and operation speed, through experiments on real-world datasets

    An Efficient Method of Summarizing Documents Using Impression Measurements

    Get PDF
    Automatic generic document summarization based on unsupervised schemes is a very useful approach because it does not require training data. Although techniques using latent semantic analysis (LSA) and non-negative matrix factorization (NMF) have been applied to determine topics of documents, there are no researches on reduction of matrix and speeding up of computation of the NMF method. In order to achieve this scheme, this paper utilizes the generic impressive expressions from newspapers to extract important sentences as summary. Therefore, it has no stemming processes and no filtering of stop words. Generally, novels are typical documents providing sentimental impression for readers. However, newspapers deliver different impressions for new knowledge because they inform readers about current events, informative articles and diverse features. The proposed method introduces impressive expressions for newspapers and their measurements are applied to the NMF method. From 100 KB text data of experimental results by the proposed method, it turns out that the matrix size reduces by 80 % and the computation of the NMF method becomes 7 times faster than with the original method, without degrading the relevancy of extracted sentences

    Insights into the genomic evolution of insects from cricket genomes

    Get PDF
    Most of our knowledge of insect genomes comes from Holometabolous species, which undergo complete metamorphosis and have genomes typically under 2 Gb with little signs of DNA methylation. In contrast, Hemimetabolous insects undergo the presumed ancestral process of incomplete metamorphosis, and have larger genomes with high levels of DNA methylation. Hemimetabolous species from the Orthopteran order (grasshoppers and crickets) have some of the largest known insect genomes. What drives the evolution of these unusual insect genome sizes, remains unknown. Here we report the sequencing, assembly and annotation of the 1.66-Gb genome of the Mediterranean field cricket Gryllus bimaculatus, and the annotation of the 1.60-Gb genome of the Hawaiian cricket Laupala kohalensis. We compare these two cricket genomes with those of 14 additional insects and find evidence that hemimetabolous genomes expanded due to transposable element activity. Based on the ratio of observed to expected CpG sites, we find higher conservation and stronger purifying selection of methylated genes than non-methylated genes. Finally, our analysis suggests an expansion of the pickpocket class V gene family in crickets, which we speculate might play a role in the evolution of cricket courtship, including their characteristic chirping

    A METHOD OF EXTRACTING AND EVALUATING GOOD AND BAD REPUTATIONS FOR NATURAL LANGUAGE EXPRESSIONS

    No full text
    Although a users' opinion or a live voice is a very useful information for text mining of the business, it is difficult to extract good and bad reputations of users from texts written in natural language. The good and bad reputations discussed here depend on users' claims, interests and demands. This paper presents a method of determining these reputations in commodity review sentences. Multi-attribute rule is introduced to extract the reputations from sentences, and four-stage-rules are defined in order to evaluate good and bad reputations step by step. A deterministic multi-attribute pattern matching algorithm is utilized to determine the reputations efficiently.From simulation results for 2,240 review comments, it is verified that the multi-attribute pattern matching algorithm is 63.1 times faster than the Aho and Corasick method. The precision and recall of extracted reputations for each commodity are 94% and 93% respectively. Moreover, the precision and recall of the resulting reputations for each rule are 95% and 95% respectively.Good and bad reputations, text mining, natural language understanding, multi-attribute rules, deterministic multi-attribute pattern-matching
    corecore