312,753 research outputs found
New techniques for Arabic document classification
Text classification (TC) concerns automatically assigning a class (category) label to
a text document, and has increasingly many applications, particularly in the domain
of organizing, for browsing in large document collections. It is typically achieved
via machine learning, where a model is built on the basis of a typically large collection
of document features. Feature selection is critical in this process, since there
are typically several thousand potential features (distinct words or terms). In text
classification, feature selection aims to improve the computational e ciency and
classification accuracy by removing irrelevant and redundant terms (features), while
retaining features (words) that contain su cient information that help with the
classification task.
This thesis proposes binary particle swarm optimization (BPSO) hybridized with
either K Nearest Neighbour (KNN) or Support Vector Machines (SVM) for feature
selection in Arabic text classi cation tasks. Comparison between feature selection
approaches is done on the basis of using the selected features in conjunction with
SVM, Decision Trees (C4.5), and Naive Bayes (NB), to classify a hold out test
set. Using publically available Arabic datasets, results show that BPSO/KNN and
BPSO/SVM techniques are promising in this domain. The sets of selected features
(words) are also analyzed to consider the di erences between the types of features
that BPSO/KNN and BPSO/SVM tend to choose. This leads to speculation concerning
the appropriate feature selection strategy, based on the relationship between
the classes in the document categorization task at hand.
The thesis also investigates the use of statistically extracted phrases of length
two as terms in Arabic text classi cation. In comparison with Bag of Words text
representation, results show that using phrases alone as terms in Arabic TC task
decreases the classification accuracy of Arabic TC classifiers significantly while combining
bag of words and phrase based representations may increase the classification
accuracy of the SVM classifier slightly
From Review to Rating: Exploring Dependency Measures for Text Classification
Various text analysis techniques exist, which attempt to uncover unstructured
information from text. In this work, we explore using statistical dependence
measures for textual classification, representing text as word vectors. Student
satisfaction scores on a 3-point scale and their free text comments written
about university subjects are used as the dataset. We have compared two textual
representations: a frequency word representation and term frequency
relationship to word vectors, and found that word vectors provide a greater
accuracy. However, these word vectors have a large number of features which
aggravates the burden of computational complexity. Thus, we explored using a
non-linear dependency measure for feature selection by maximizing the
dependence between the text reviews and corresponding scores. Our quantitative
and qualitative analysis on a student satisfaction dataset shows that our
approach achieves comparable accuracy to the full feature vector, while being
an order of magnitude faster in testing. These text analysis and feature
reduction techniques can be used for other textual data applications such as
sentiment analysis.Comment: 8 page
Attribute Selection for Classification
The selection of attributes used to construct a classification model is crucial in machine learning, in particular with instance similarity methods. We present a new algorithm to select and rank attributes based on weighing features according to their ability to help class prediction. The algorithm uses the same structure that holds training records for classification. Attribute values and their classes are projected into a one-dimensional space, to account for various degrees of the relationship between them. With the user deciding on the degree of this relation, any of several potential solutions can be used as criterion to determine attribute relevance. This low complexity algorithm increases classification predictive accuracy and also helps to reduce the feature dimension problem
Multi-modal feature selection with self-expression topological manifold for end-stage renal disease associated with mild cognitive impairment
Effectively selecting discriminative brain regions in multi-modal neuroimages is one of the effective means to reveal the neuropathological mechanism of end-stage renal disease associated with mild cognitive impairment (ESRDaMCI). Existing multi-modal feature selection methods usually depend on the Euclidean distance to measure the similarity between data, which tends to ignore the implied data manifold. A self-expression topological manifold based multi-modal feature selection method (SETMFS) is proposed to address this issue employing self-expression topological manifold. First, a dynamic brain functional network is established using functional magnetic resonance imaging (fMRI), after which the betweenness centrality is extracted. The feature matrix of fMRI is constructed based on this centrality measure. Second, the feature matrix of arterial spin labeling (ASL) is constructed by extracting the cerebral blood flow (CBF). Then, the topological relationship matrices are constructed by calculating the topological relationship between each data point in the two feature matrices to measure the intrinsic similarity between the features, respectively. Subsequently, the graph regularization is utilized to embed the self-expression model into topological manifold learning to identify the linear self-expression of the features. Finally, the selected well-represented feature vectors are fed into a multicore support vector machine (MKSVM) for classification. The experimental results show that the classification performance of SETMFS is significantly superior to several state-of-the-art feature selection methods, especially its classification accuracy reaches 86.10%, which is at least 4.34% higher than other comparable methods. This method fully considers the topological correlation between the multi-modal features and provides a reference for ESRDaMCI auxiliary diagnosis
Exclusive lasso-based k-nearest-neighbor classification
Conventionally, the k nearest-neighbor (kNN) classification is implemented with the use of the Euclidean distance-based measures, which are mainly the one-to-one similarity relationships such as to lose the connections between different samples. As a strategy to alleviate this issue, the coefficients coded by sparse representation have played a role of similarity gauger for nearest-neighbor classification as well. Although SR coefficients enjoy remarkable discrimination nature as a one-to-many relationship, it carries out variable selection at the individual level so that possible inherent group structure is ignored. In order to make the most of information implied in the group structure, this paper employs the exclusive lasso strategy to perform the similarity evaluation in two novel nearest-neighbor classification methods. Experimental results on both benchmark data sets and the face recognition problem demonstrate that the EL-based kNN method outperforms certain state-of-the-art classification techniques and existing representation-based nearest-neighbor approaches, in terms of both the size of feature reduction and the classification accuracy
Predicting students' happiness from physiology, phone, mobility, and behavioral data
In order to model students' happiness, we apply machine learning methods to data collected from undergrad students monitored over the course of one month each. The data collected include physiological signals, location, smartphone logs, and survey responses to behavioral questions. Each day, participants reported their wellbeing on measures including stress, health, and happiness. Because of the relationship between happiness and depression, modeling happiness may help us to detect individuals who are at risk of depression and guide interventions to help them. We are also interested in how behavioral factors (such as sleep and social activity) affect happiness positively and negatively. A variety of machine learning and feature selection techniques are compared, including Gaussian Mixture Models and ensemble classification. We achieve 70% classification accuracy of self-reported happiness on held-out test data.MIT Media Lab ConsortiumRobert Wood Johnson Foundation (Wellbeing Initiative)National Institutes of Health (U.S.) (Grant R01GM105018)Samsung (Firm)Natural Sciences and Engineering Research Council of Canad
Various Feature Selection Techniques in Type 2 Diabetic Patients for the Prediction of Cardiovascular Disease
Cardiovascular disease (CVD) is a serious but preventable complication of type 2 diabetes mellitus (T2DM) that results in substantial disease burden, increased health services use, and higher risk of premature mortality [10]. People with diabetes are also at a greatly increased risk of cardiovascular which results in sudden death, which increases year by year. Data mining is the search for relationships and global patterns that exist in large databases but are `hidden' among the vast amount of data, such as a relationship between patient data and their medical diagnosis. Usually medical databases of type 2 diabetic patients are high dimensional in nature. If a training dataset contains irrelevant and redundant features (i.e., attributes), classification analysis may produce less accurate results. In order for data mining algorithms to perform efficiently and effectively on high-dimensional data, it is imperative to remove irrelevant and redundant features. Feature selection is one of the important and frequently used data preprocessing techniques for data mining applications in medicine. Many of the research area in data mining has improved the predictive accuracy of the classifiers by applying the various techniques of feature selection This paper illustrates, the application of feature selection technique in medical databases, will enable to find small number of informative features leading to potential improvement in medical diagnosis. It is proposed to find an optimal feature subset of the PIMA Indian Diabetes Dataset using Artificial Bee Colony technique with Differential Evolution, Symmetrical Uncertainty Attribute set Evaluator and Fast Correlation-Based Filter (FCBF). Then Mutual information based feature selection is done by introducing normalized mutual information feature selection (NMIFS). And valid classes of input features are selected by applying Hybrid Fuzzy C Means algorithm (HFCM)
An ensemble of machine learning and anti-learning methods for predicting tumour patient survival rates
This paper primarily addresses a dataset relating to cellular, chemical and physical conditions of patients gathered at the time they are operated upon to remove colorectal tumours. This data provides a unique insight into the biochemical and immunological status of patients at the point of tumour removal along with information about tumour classification and post-operative survival. The relationship between severity of tumour, based on TNM staging, and survival is still unclear for patients with TNM stage 2 and 3 tumours. We ask whether it is possible to predict survival rate more accurately using a selection of machine learning techniques applied to subsets of data to gain a deeper understanding of the relationships between a patient’s biochemical markers and survival. We use a range of feature selection and single classification techniques to predict the 5 year survival rate of TNM stage 2 and 3 patients which initially produces less than ideal results. The performance of each model individually is then compared with subsets of the data where agreement is reached for multiple models. This novel method of selective ensembling demonstrates that significant improvements in model accuracy on an unseen test set can be achieved for patients where agreement between models is achieved. Finally we point at a possible method to identify whether a patients prognosis can be accurately predicted or not
- …