22 research outputs found

    Improving text classification using local latent semantic indexing

    No full text
    Latent Semantic Indexing (LSI) has been shown to be extremely useful in information retrieval, but it is not an optimal representation for text classification. It always drops the text classification performance when being applied to the whole training set (global LSI) because this completely unsupervised method ignores class discrimination while only concentrating on representation. Some local LSI methods have been proposed to improve the classification by utilizing class discrimination information. However, their performance improvements over original term vectors are still very limited. In this paper, we propose a new local LSI method called “Local Relevancy Weighted LSI ” to improve text classification by performing a separate Single Value Decomposition (SVD) on the transformed local region of each class. Experimental results show that our method is much better than global LSI and traditional local LSI methods on classification within a much smaller LSI dimension. 1

    Affinity Rank: A New Scheme for Efficient Web Search

    No full text
    Maximizing only the relevance between queries and documents will not satisfy users if they want the top search results to present a wide coverage of topics by a few representative documents. In this paper, we propose two new metrics to evaluate the performance of information retrieval: diversity, which measures the topic coverage of a group of documents, and information richness, which measures the amount of information contained in a document. Then we present a novel ranking scheme, Affinity Rank, which utilizes these two metrics to improve search results. We demonstrate how Affinity Rank works by a toy data set, and verify our method by experiments on real-world data sets

    Quantitative Analysis and Stability Study on Iridoid Glycosides from Seed Meal of Eucommia ulmoides Oliver

    No full text
    As a traditional Chinese medicine, Eucommia ulmoides Oliver (E. ulmoides Oliv.) is an important medicinal plant, and its barks, male flowers, leaves, and fruits have high value of utilization. The seed meal of E. ulmoides Oliv. is the waste residue produced after oil extraction from seeds of E. ulmoides Oliv. Though the seed meal of E. ulmoides Oliv. is an ideal feed additive, its medicinal value is far from being developed and utilized. We identified six natural iridoid compounds from the seed meal of E. ulmoides Oliv., namely geniposidic acid (GPA), scyphiphin D (SD), ulmoidoside A (UA), ulmoidoside B (UB), ulmoidoside C (UC), and ulmoidoside D (UD). Six natural iridoid compounds were validated to have anti-inflammatory activities. Hence, six compounds were quantified at the optimum extracting conditions in the seed meal of E. ulmoides Oliv. by an established ultra-performance liquid chromatography (UPLC) method. Some interesting conversion phenomena of six tested compounds were uncovered by a systematic study of stability performed under different temperatures and pH levels. GPA was certified to be stable. SD, UA, and UC were only hydrolyzed under strong alkaline solution. UB and UD were affected by high temperature, alkaline, and strong acid conditions. Our findings reveal the active compounds and explore the quantitative analysis of the tested compounds, contributing to rational utilization for the seeds residues of E. ulmoides Oliv

    Web-page Classification through Summarization

    No full text
    Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Webpage classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8 % improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9 % improvement over pure-text based methods

    Mining ratio rules via principal sparse non-negative matrix factorization

    No full text
    Association rules are traditionally designed to capture statistical relationship among itemsets in a given database. To additionally capture the quantitative association knowledge, F.Korn et al recently proposed a paradigm named Ratio Rules [4] for quantifiable data mining. However, their approach is mainly based on Principle Component Analysis (PCA) and as a result, it cannot guarantee that the ratio coefficient is non-negative. This may lead to serious problems in the rules’ application. In this paper, we propose a new method, called Principal Sparse Non-Negative Matrix Factoriza-tion (PSNMF), for learning the associations between itemsets in the form of Ratio Rules. In addition, we provide a support measurement to weigh the importance of each rule for the entire dataset. 1

    Improving web search results using affinity graph

    No full text
    In this paper, we propose a novel ranking scheme named Affinity Ranking (AR) to re-rank search results by optimizing two metrics: (1) diversity-- which indicates the variance of topics in a group of documents; (2) information richness-- which measures the coverage of a single document to its topic. Both of the two metrics are calculated from a directed link graph named Affinity Graph (AG). AG models the structure of a group of documents based on the asymmetric content similarities between each pair of documents. Experimental results in Yahoo! Directory, ODP Data, and Newsgroup data demonstrate that our proposed ranking algorithm significantly improves the search performance. Specifically, the algorithm achieves 31 % improvement in diversity and 12 % improvement in information richness relatively within the top 10 search results

    Mathematical

    No full text
    {byzhang, zhengc

    Ocfs: Optimal orthogonal centroid feature selection for text categorization

    No full text
    ABSTRACT 1 Text categorization is an important research area in many Information Retrieval (IR) applications. To save the storage space and computation time in text categorization, efficient and effective algorithms for reducing the data before analysis are highly desired. Traditional techniques for this purpose can generally be classified into feature extraction and feature selection. Because of efficiency, the latter is more suitable for text data such as web documents. However, many popular feature selection techniques such as Information Gain (IG) and 2 χ-test (CHI) are all greedy in nature and thus may not be optimal according to some criterion. Moreover, the performance of these greedy methods may be deteriorated when the reserved data dimension is extremely low. In this paper, we propose an efficient optimal feature selection algorithm by optimizing the objective function of Orthogonal Centroid (OC) subspace learning algorithm in a discrete solution space, called Orthogonal Centroid Feature Selection (OCFS). Experiments on 20 Newsgroups (20NG), Reuters Corpus Volume 1 (RCV1) and Open Directory Project (ODP) data show that OCFS is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small
    corecore