312,753 research outputs found

    New techniques for Arabic document classification

    Get PDF
    Text classification (TC) concerns automatically assigning a class (category) label to a text document, and has increasingly many applications, particularly in the domain of organizing, for browsing in large document collections. It is typically achieved via machine learning, where a model is built on the basis of a typically large collection of document features. Feature selection is critical in this process, since there are typically several thousand potential features (distinct words or terms). In text classification, feature selection aims to improve the computational e ciency and classification accuracy by removing irrelevant and redundant terms (features), while retaining features (words) that contain su cient information that help with the classification task. This thesis proposes binary particle swarm optimization (BPSO) hybridized with either K Nearest Neighbour (KNN) or Support Vector Machines (SVM) for feature selection in Arabic text classi cation tasks. Comparison between feature selection approaches is done on the basis of using the selected features in conjunction with SVM, Decision Trees (C4.5), and Naive Bayes (NB), to classify a hold out test set. Using publically available Arabic datasets, results show that BPSO/KNN and BPSO/SVM techniques are promising in this domain. The sets of selected features (words) are also analyzed to consider the di erences between the types of features that BPSO/KNN and BPSO/SVM tend to choose. This leads to speculation concerning the appropriate feature selection strategy, based on the relationship between the classes in the document categorization task at hand. The thesis also investigates the use of statistically extracted phrases of length two as terms in Arabic text classi cation. In comparison with Bag of Words text representation, results show that using phrases alone as terms in Arabic TC task decreases the classification accuracy of Arabic TC classifiers significantly while combining bag of words and phrase based representations may increase the classification accuracy of the SVM classifier slightly

    From Review to Rating: Exploring Dependency Measures for Text Classification

    Full text link
    Various text analysis techniques exist, which attempt to uncover unstructured information from text. In this work, we explore using statistical dependence measures for textual classification, representing text as word vectors. Student satisfaction scores on a 3-point scale and their free text comments written about university subjects are used as the dataset. We have compared two textual representations: a frequency word representation and term frequency relationship to word vectors, and found that word vectors provide a greater accuracy. However, these word vectors have a large number of features which aggravates the burden of computational complexity. Thus, we explored using a non-linear dependency measure for feature selection by maximizing the dependence between the text reviews and corresponding scores. Our quantitative and qualitative analysis on a student satisfaction dataset shows that our approach achieves comparable accuracy to the full feature vector, while being an order of magnitude faster in testing. These text analysis and feature reduction techniques can be used for other textual data applications such as sentiment analysis.Comment: 8 page

    Attribute Selection for Classification

    Get PDF
    The selection of attributes used to construct a classification model is crucial in machine learning, in particular with instance similarity methods. We present a new algorithm to select and rank attributes based on weighing features according to their ability to help class prediction. The algorithm uses the same structure that holds training records for classification. Attribute values and their classes are projected into a one-dimensional space, to account for various degrees of the relationship between them. With the user deciding on the degree of this relation, any of several potential solutions can be used as criterion to determine attribute relevance. This low complexity algorithm increases classification predictive accuracy and also helps to reduce the feature dimension problem

    Multi-modal feature selection with self-expression topological manifold for end-stage renal disease associated with mild cognitive impairment

    Get PDF
    Effectively selecting discriminative brain regions in multi-modal neuroimages is one of the effective means to reveal the neuropathological mechanism of end-stage renal disease associated with mild cognitive impairment (ESRDaMCI). Existing multi-modal feature selection methods usually depend on the Euclidean distance to measure the similarity between data, which tends to ignore the implied data manifold. A self-expression topological manifold based multi-modal feature selection method (SETMFS) is proposed to address this issue employing self-expression topological manifold. First, a dynamic brain functional network is established using functional magnetic resonance imaging (fMRI), after which the betweenness centrality is extracted. The feature matrix of fMRI is constructed based on this centrality measure. Second, the feature matrix of arterial spin labeling (ASL) is constructed by extracting the cerebral blood flow (CBF). Then, the topological relationship matrices are constructed by calculating the topological relationship between each data point in the two feature matrices to measure the intrinsic similarity between the features, respectively. Subsequently, the graph regularization is utilized to embed the self-expression model into topological manifold learning to identify the linear self-expression of the features. Finally, the selected well-represented feature vectors are fed into a multicore support vector machine (MKSVM) for classification. The experimental results show that the classification performance of SETMFS is significantly superior to several state-of-the-art feature selection methods, especially its classification accuracy reaches 86.10%, which is at least 4.34% higher than other comparable methods. This method fully considers the topological correlation between the multi-modal features and provides a reference for ESRDaMCI auxiliary diagnosis

    Exclusive lasso-based k-nearest-neighbor classification

    Get PDF
    Conventionally, the k nearest-neighbor (kNN) classification is implemented with the use of the Euclidean distance-based measures, which are mainly the one-to-one similarity relationships such as to lose the connections between different samples. As a strategy to alleviate this issue, the coefficients coded by sparse representation have played a role of similarity gauger for nearest-neighbor classification as well. Although SR coefficients enjoy remarkable discrimination nature as a one-to-many relationship, it carries out variable selection at the individual level so that possible inherent group structure is ignored. In order to make the most of information implied in the group structure, this paper employs the exclusive lasso strategy to perform the similarity evaluation in two novel nearest-neighbor classification methods. Experimental results on both benchmark data sets and the face recognition problem demonstrate that the EL-based kNN method outperforms certain state-of-the-art classification techniques and existing representation-based nearest-neighbor approaches, in terms of both the size of feature reduction and the classification accuracy

    Predicting students' happiness from physiology, phone, mobility, and behavioral data

    Get PDF
    In order to model students' happiness, we apply machine learning methods to data collected from undergrad students monitored over the course of one month each. The data collected include physiological signals, location, smartphone logs, and survey responses to behavioral questions. Each day, participants reported their wellbeing on measures including stress, health, and happiness. Because of the relationship between happiness and depression, modeling happiness may help us to detect individuals who are at risk of depression and guide interventions to help them. We are also interested in how behavioral factors (such as sleep and social activity) affect happiness positively and negatively. A variety of machine learning and feature selection techniques are compared, including Gaussian Mixture Models and ensemble classification. We achieve 70% classification accuracy of self-reported happiness on held-out test data.MIT Media Lab ConsortiumRobert Wood Johnson Foundation (Wellbeing Initiative)National Institutes of Health (U.S.) (Grant R01GM105018)Samsung (Firm)Natural Sciences and Engineering Research Council of Canad

    Various Feature Selection Techniques in Type 2 Diabetic Patients for the Prediction of Cardiovascular Disease

    Get PDF
    Cardiovascular disease (CVD) is a serious but preventable complication of type 2 diabetes mellitus (T2DM) that results in substantial disease burden, increased health services use, and higher risk of premature mortality [10]. People with diabetes are also at a greatly increased risk of cardiovascular which results in sudden death, which increases year by year. Data mining is the search for relationships and global patterns that exist in large databases but are `hidden' among the vast amount of data, such as a relationship between patient data and their medical diagnosis. Usually medical databases of type 2 diabetic patients are high dimensional in nature. If a training dataset contains irrelevant and redundant features (i.e., attributes), classification analysis may produce less accurate results. In order for data mining algorithms to perform efficiently and effectively on high-dimensional data, it is imperative to remove irrelevant and redundant features. Feature selection is one of the important and frequently used data preprocessing techniques for data mining applications in medicine. Many of the research area in data mining has improved the predictive accuracy of the classifiers by applying the various techniques of feature selection This paper illustrates, the application of feature selection technique in medical databases, will enable to find small number of informative features leading to potential improvement in medical diagnosis. It is proposed to find an optimal feature subset of the PIMA Indian Diabetes Dataset using Artificial Bee Colony technique with Differential Evolution, Symmetrical Uncertainty Attribute set Evaluator and Fast Correlation-Based Filter (FCBF). Then Mutual information based feature selection is done by introducing normalized mutual information feature selection (NMIFS). And valid classes of input features are selected by applying Hybrid Fuzzy C Means algorithm (HFCM)

    An ensemble of machine learning and anti-learning methods for predicting tumour patient survival rates

    Get PDF
    This paper primarily addresses a dataset relating to cellular, chemical and physical conditions of patients gathered at the time they are operated upon to remove colorectal tumours. This data provides a unique insight into the biochemical and immunological status of patients at the point of tumour removal along with information about tumour classification and post-operative survival. The relationship between severity of tumour, based on TNM staging, and survival is still unclear for patients with TNM stage 2 and 3 tumours. We ask whether it is possible to predict survival rate more accurately using a selection of machine learning techniques applied to subsets of data to gain a deeper understanding of the relationships between a patient’s biochemical markers and survival. We use a range of feature selection and single classification techniques to predict the 5 year survival rate of TNM stage 2 and 3 patients which initially produces less than ideal results. The performance of each model individually is then compared with subsets of the data where agreement is reached for multiple models. This novel method of selective ensembling demonstrates that significant improvements in model accuracy on an unseen test set can be achieved for patients where agreement between models is achieved. Finally we point at a possible method to identify whether a patients prognosis can be accurately predicted or not
    • …
    corecore