90 research outputs found

    Thai Multi-Document Summarization: Unit Segmentation, Unit-Graph Formulation, and Unit Selection

    Get PDF
    There have been several challenges in summarization of Thai multiple documents since Thai language itself lacks of explicit word/phrase/sentence boundaries. This paper gives definition of Thai Elementary Discourse Unit (TEDU) and then presents our three-stage summarization process. Towards implementation of this process, we propose unit segmentation using TEDUs and their derivatives, unit-graph formation using iterative unit weighting and cosine similarity, and unit selection using highest-weight priority, redundancy removal, and post-selection weight recalculation. To examine performance of the proposed methods, a number of experiments are conducted using fifty sets of Thai news articles with their manually constructed reference summary. By three common evaluation measures of ROUGE-1, ROUGE-2, and ROUGE-SU4, the results evidence that (1) our TEDU-based summarization outperforms paragraph-based summarization, (2) our iterative weighting is superior to traditional TF-IDF, (3) the highest-weight priority without centroid preference and unit redundancy consideration helps improving summary quality, and (4) post-selection weight recalculation tends to raise summarization performance under some certain circumstances

    Effect of Term Weighting on Keyword Extraction in Hierarchical Category Structure

    Get PDF
    While there have been several studies related to the effect of term weighting on classification accuracy, relatively few works have been conducted on how term weighting affects the quality of keywords extracted for characterizing a document or a category (i.e., document collection). Moreover, many tasks require more complicated category structure, such as hierarchical and network category structure, rather than a flat category structure. This paper presents a qualitative and quantitative study on how term weighting affects keyword extraction in the hierarchical category structure, in comparison to the flat category structure. A hierarchical structure triggers special characteristic in assigning a set of keywords or tags to represent a document or a document collection, with support of statistics in a hierarchy, including category itself, its parent category, its child categories, and sibling categories. An enhancement of term weighting is proposed particularly in the form of a series of modified TFIDF's, for improving keyword extraction. A text collection of public-hearing opinions is used to evaluate variant TFs and IDFs to identify which types of information in hierarchical category structure are useful. By experiments, we found that the most effective IDF family, namely TF-IDFr, is identity>sibling>child>parent in order. The TF-IDFr outperforms the vanilla version of TFIDF with a centroid-based classifier

    A Structure-Shared Trie Compression Method

    Get PDF

    Khmer Treebank Construction via Interactive Tree Visualization

    Get PDF
    Despite the fact that there are a number of researches working on Khmer Language in the field of Natural Language Processing along with some resources regarding words segmentation and POS Tagging, we still lack of high-level resources regarding syntax, Treebanks and grammars, for example. This paper illustrates the semi-automatic framework of constructing Khmer Treebank and the extraction of the Khmer grammar rules from a set of sentences taken from the Khmer grammar books. Initially, these sentences will be manually annotated and processed to generate a number of grammar rules with their probabilities once the Treebank is obtained. In our experiments, the annotated trees and the extracted grammar rules are analyzed in both quantitative and qualitative way. Finally, the results will be evaluated in three evaluation processes including Self-Consistency, 5-Fold Cross-Validation, Leave-One-Out Cross-Validation along with the three validation methods such as Precision, Recall, F1-Measure. According to the result of the three validations, Self-Consistency has shown the best result with more than 92%, followed by the Leave-One-Out Cross-Validation and 5-Fold Cross Validation with the average of 88% and 75% respectively. On the other hand, the crossing bracket data shows that Leave-One-Out Cross Validation holds the highest average with 96% while the other two are 85% and 89%, respectively

    Design and Synthesis of Potent in Vitro and in Vivo Anticancer Agents Based on 1-(3′,4′,5′-Trimethoxyphenyl)-2-Aryl-1H-Imidazole

    Get PDF
    A novel series of tubulin polymerization inhibitors, based on the 1-(3',4',5'-trimethoxyphenyl)-2-aryl-1H-imidazole scaffold and designed as cis-restricted combretastatin A-4 analogues, was synthesized with the goal of evaluating the effects of various patterns of substitution on the phenyl at the 2-position of the imidazole ring on biological activity. A chloro and ethoxy group at the meta- and para-positions, respectively, produced the most active compound in the series (4o), with IC50 values of 0.4-3.8 nM against a panel of seven cancer cell lines. Except in HL-60 cells, 4o had greater antiproliferative than CA-4, indicating that the 3'-chloro-4'-ethoxyphenyl moiety was a good surrogate for the CA-4 B-ring. Experiments carried out in a mouse syngenic model demonstrated high antitumor activity of 4o, which significantly reduced the tumor mass at a dose thirty times lower than that required for CA-4P, which was used as a reference compound. Altogether, our findings suggest that 4o is a promising anticancer drug candidate that warrants further preclinical evaluation

    Learning a Grammar from a Bracketed Corpus

    No full text
    In this paper, we propose a method to group brackets in a bracketed corpus (with lexical tags), according to their local contextual information, as a rst step towards the automatic acquisition of a context-free grammar. Using a bracketed corpus, the learning task is reduced to the problem of how to determine the nonterminal label of each bracket in the corpus. In a grouping process, a single nonterminal label is assigned to each group of brackets which are similar. Two techniques, distributional analysis and hierarchical Bayesian clustering, are applied to exploit local contextual information for computing similarity between two brackets. We also show a technique developed for determining the appropriate number of bracket groups based on the concept of entropy analysis. Finally, we present a set of experimental results and evaluate the obtained results with a model solution given by humans. Key Words grammar acquisition, distribution analysis, hierarchical Bayesian clustering, local c..

    Passage-Based Web Text Mining

    No full text
    A large amount of textual information on the Web is very useful information resource. In the past, traditional text mining research treated a text document as a single piece of information. However, some Web documents are long and heterogeneous in their contents. This paper presents a new approach to apply the concept of a passage to Web text mining. A single Web text document is considered as several passages, instead of a single text. The effectiveness is investigated using real Thai Web documents. As the preliminary step, we explore influence of the passage-based method on construction of association rules by comparing rules generated by the passage-based method with those generated by the non-passage-based method

    KEY WORDS-Statistical Parsing, Grammar Acquisition, Clustering Analysis, Local Contextual

    No full text
    ABSTRACT-This paper proposes a new method for learning a context-sensitive conditional probability context-free grammar from an unlabeled bracketed corpus based on clustering analysis and describes a natural language parsing model which uses a probability-based scoring function of the grammar to rank parses of a sentence. By grouping brackets in a corpus into a number of similar bracket groups based on their local contextual information, the corpus is automatically labeled with some nonterminal labels, and consequently a grammar with conditional probabilities is acquired. The statistical parsing model provides a framework for finding the most likely parse of a sentence based on these conditional probabilities. Experiments using Wall Street Journal data show that our approach achieves a relatively high accuracy: 88 % recall, 72 % precision and 0.7 crossing brackets per sentence for sentences shorter than 10 words, and 71 % recall, 51 % precision and 3.4 crossing brackets for sentences between 10-19 words. This result supports the assumption that local contextual statistics obtained from an unlabeled bracketed corpus are effective for learning a useful grammar and parsing
    • …
    corecore