17 research outputs found

    Latin Etymologies as Features on BNC Text Categorization

    Get PDF
    PACLIC 23 / City University of Hong Kong / 3-5 December 200

    MVMR-FS : Non-parametric feature selection algorithm based on Maximum inter-class Variation and Minimum Redundancy

    Full text link
    How to accurately measure the relevance and redundancy of features is an age-old challenge in the field of feature selection. However, existing filter-based feature selection methods cannot directly measure redundancy for continuous data. In addition, most methods rely on manually specifying the number of features, which may introduce errors in the absence of expert knowledge. In this paper, we propose a non-parametric feature selection algorithm based on maximum inter-class variation and minimum redundancy, abbreviated as MVMR-FS. We first introduce supervised and unsupervised kernel density estimation on the features to capture their similarities and differences in inter-class and overall distributions. Subsequently, we present the criteria for maximum inter-class variation and minimum redundancy (MVMR), wherein the inter-class probability distributions are employed to reflect feature relevance and the distances between overall probability distributions are used to quantify redundancy. Finally, we employ an AGA to search for the feature subset that minimizes the MVMR. Compared with ten state-of-the-art methods, MVMR-FS achieves the highest average accuracy and improves the accuracy by 5% to 11%

    Conditional Dynamic Mutual Information-Based Feature Selection

    Get PDF
    With emergence of new techniques, data in many fields are getting larger and larger, especially in dimensionality aspect. The high dimensionality of data may pose great challenges to traditional learning algorithms. In fact, many of features in large volume of data are redundant and noisy. Their presence not only degrades the performance of learning algorithms, but also confuses end-users in the post-analysis process. Thus, it is necessary to eliminate irrelevant features from data before being fed into learning algorithms. Currently, many endeavors have been attempted in this field and many outstanding feature selection methods have been developed. Among different evaluation criteria, mutual information has also been widely used in feature selection because of its good capability of quantifying uncertainty of features in classification tasks. However, the mutual information estimated on the whole dataset cannot exactly represent the correlation between features. To cope with this issue, in this paper we firstly re-estimate mutual information on identified instances dynamically, and then introduce a new feature selection method based on conditional mutual information. Performance evaluations on sixteen UCI datasets show that our proposed method achieves comparable performance to other well-established feature selection algorithms in most cases

    Natural Image Statistics and Low-Complexity Feature Selection

    Full text link

    Task-based user profiling for query refinement (toque)

    Get PDF
    The information needs of search engine users vary in complexity. Some simple needs can be satisfied by using a single query, while complicated ones require a series of queries spanning a period of time. A search task, consisting of a sequence of search queries serving the same information need, can be treated as an atomic unit for modeling user’s search preferences and has been applied in improving the accuracy of search results. However, existing studies on user search tasks mainly focus on applying user’s interests in re-ranking search results. Only few studies have examined the effects of utilizing search tasks to assist users in obtaining effective queries. Moreover, fewer existing studies have examined the dynamic characteristics of user’s search interests within a search task. Furthermore, even fewer studies have examined approaches to selective personalization for candidate refined queries that are expected to benefit from its application. This study proposes a framework of modeling user’s task-based dynamic search interests to address these issues and makes the following contributions. First, task identification: a cross-session based method is proposed to discover tasks by modeling the best-link structure of queries, based on the commonly shared clicked results. A graph-based representation method is introduced to improve the effectiveness of link prediction in a query sequence. Second, dynamic task-level search interest representation: a four-tuple user profiling model is introduced to represent long- and short-term user interests extracted from search tasks and sessions. It models user’s interests at the task level to re-rank candidate queries through modules of task identification and update. Third, selective personalization: a two-step personalization algorithm is proposed to improve the rankings of candidate queries for query refinement by assessing the task dependency via exploiting a latent task space. Experimental results show that the proposed TOQUE framework contributes to an increased precision of candidate queries and thus shortened search sessions

    Modeling multivariate financial time series based on correlation clustering.

    Get PDF
    Zhou, Tu.Thesis (M.Phil.)--Chinese University of Hong Kong, 2008.Includes bibliographical references (leaves 61-70).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.0Chapter 1.1 --- Motivation and Objective --- p.0Chapter 1.2 --- Major Contribution --- p.2Chapter 1.3 --- Thesis Organization --- p.4Chapter 2 --- Measurement of Relationship between financial time series --- p.5Chapter ´ب2.1 --- Linear Correlation --- p.5Chapter 2.1.1 --- Pearson Correlation Coefficient --- p.6Chapter 2.1.2 --- Rank Correlation --- p.6Chapter 2.2 --- Mutual Information --- p.7Chapter 2.2.1 --- Approaches of Mutual Information Estimation --- p.10Chapter 2.3 --- Copula --- p.12Chapter 2.4 --- Analysis from Experimental Data --- p.14Chapter 2.4.1 --- Experiment 1: Nonlinearity --- p.14Chapter 2.4.2 --- Experiment 2: Sensitivity of Outliers --- p.16Chapter 2.4.3 --- Experiment 3: Transformation Invariance --- p.20Chapter 2.5 --- Chapter Summary --- p.23Chapter 3 --- Clustered Dynamic Conditional Correlation Model --- p.26Chapter 3.1 --- Background Review --- p.26Chapter 3.1.1 --- GARCH Model --- p.26Chapter 3.1.2 --- Multivariate GARCH model --- p.29Chapter 3.2 --- DCC Multivariate GARCH Models --- p.31Chapter 3.2.1 --- DCC GARCH Model --- p.31Chapter 3.2.2 --- Generalized DCC GARCH Model --- p.32Chapter 3.2.3 --- Block-DCC GARCH Model --- p.32Chapter 3.3 --- Clustered DCC GARCH Model --- p.34Chapter 3.3.1 --- Minimum Distance Estimation (MDE) --- p.36Chapter 3.3.2 --- Clustered DCC (CDCC) based on MDE --- p.37Chapter 3.4 --- Clustering Method Selection --- p.40Chapter 3.5 --- Model Estimation and Testing Method --- p.42Chapter 3.5.1 --- Maximum Likelihood Estimation --- p.42Chapter 3.5.2 --- Box-Pierce Statistic Test --- p.44Chapter 3.6 --- Chapter Summary --- p.44Chapter 4 --- Experimental Result and Applications on CDCC --- p.46Chapter 4.1 --- Model Comparison and Analysis --- p.46Chapter 4.2 --- Portfolio Selection Application --- p.50Chapter 4.3 --- Value at Risk Application --- p.52Chapter 4.4 --- Chapter Summary --- p.55Chapter 5 --- Conclusion --- p.57Bibliography --- p.6

    Improving Feature Selection Techniques for Machine Learning

    Get PDF
    As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization. Researchers have introduced many feature selection algorithms with different selection criteria. However, it has been discovered that no single criterion is best for all applications. We proposed a hybrid feature selection framework called based on genetic algorithms (GAs) that employs a target learning algorithm to evaluate features, a wrapper method. We call it hybrid genetic feature selection (HGFS) framework. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for the target algorithm. The experiments on genomic data demonstrate that ours is a robust and effective approach that can find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm. A common characteristic of text categorization tasks is multi-label classification with a great number of features, which makes wrapper methods time-consuming and impractical. We proposed a simple filter (non-wrapper) approach called Relation Strength and Frequency Variance (RSFV) measure. The basic idea is that informative features are those that are highly correlated with the class and distribute most differently among all classes. The approach is compared with two well-known feature selection methods in the experiments on two standard text corpora. The experiments show that RSFV generate equal or better performance than the others in many cases
    corecore