82 research outputs found
Concept learning of text documents
Concept learning of text documents can be viewed as the problem of acquiring the definition of a general category of documents. To definite the category of a text document, the Conjunctive of keywords is usually be used. These keywords should be fewer and comprehensible. A naïve method is enumerating all combinations of keywords to extract suitable ones. However, because of the enormous number of keyword combinations, it is impossible to extract the most relevant keywords to describe the categories of documents by enumerating all possible combinations of keywords. Many heuristic methods are proposed, such as GA-base, immune based algorithm. In this work, we introduce pruning power technique and propose a robust enumeration-based concept learning algorithm. Experimental results show that the rules produce by our approach has more comprehensible and simplicity than by other methods. <br /
Finding short patterns to classify text documents
Many classification methods have been proposed to find patterns in text documents. However, according to Occam\u27s razor principle, "the explanation of any phenomenon should make as few assumptions as possible", short patterns usually have more explainable and meaningful for classifying text documents. In this paper, we propose a depth-first pattern generation algorithm, which can find out short patterns from text document more effectively, comparing with breadth-first algorithm <br /
Finding coverage using incremental attribute combinations
Coverage is the range that covers only positive samples in attribute (or feature) space. Finding coverage is the kernel problem in induction algorithms because of the fact that coverage can be used as rules to describe positive samples. To reflect the characteristic of training samples, it is desirable that the large coverage that cover more positive samples. However, it is difficult to find large coverage, because the attribute space is usually very high dimensionality. Many heuristic methods such as ID3, AQ and CN2 have been proposed to find large coverage. A robust algorithm also has been proposed to find the largest coverage, but the complexities of time and space are costly when the dimensionality becomes high. To overcome this drawback, this paper proposes an algorithm that adopts incremental feature combinations to effectively find the largest coverage. In this algorithm, the irrelevant coverage can be pruned away at early stages because potentially large coverage can be found earlier. Experiments show that the space and time needed to find the largest coverage has been significantly reduced.<br /
Finding rule groups to classify high dimensional gene expression datasets
Microarray data provides quantitative information about the transcription profile of cells. To analyze microarray datasets, methodology of machine learning has increasingly attracted bioinformatics researchers. Some approaches of machine learning are widely used to classify and mine biological datasets. However, many gene expression datasets are extremely high dimensionality, traditional machine learning methods can not be applied effectively and efficiently. This paper proposes a robust algorithm to find out rule groups to classify gene expression datasets. Unlike the most classification algorithms, which select dimensions (genes) heuristically to form rules groups to identify classes such as cancerous and normal tissues, our algorithm guarantees finding out best-k dimensions (genes), which are most discriminative to classify samples in different classes, to form rule groups for the classification of expression datasets. Our experiments show that the rule groups obtained by our algorithm have higher accuracy than that of other classification approaches <br /
Keyword extraction for text categorization
Text categorization (TC) is one of the main applications of machine learning. Many methods have been proposed, such as Rocchio method, Naive bayes based method, and SVM based text classification method. These methods learn labeled text documents and then construct a classifier. A new coming text document\u27s category can be predicted. However, these methods do not give the description of each category. In the machine learning field, there are many concept learning algorithms, such as, ID3 and CN2. This paper proposes a more robust algorithm to induce concepts from training examples, which is based on enumeration of all possible keywords combinations. Experimental results show that the rules produced by our approach have more precision and simplicity than that of other methods.<br /
One-step Multi-view Clustering with Diverse Representation
Multi-view clustering has attracted broad attention due to its capacity to
utilize consistent and complementary information among views. Although
tremendous progress has been made recently, most existing methods undergo high
complexity, preventing them from being applied to large-scale tasks. Multi-view
clustering via matrix factorization is a representative to address this issue.
However, most of them map the data matrices into a fixed dimension, which
limits the expressiveness of the model. Moreover, a range of methods suffer
from a two-step process, i.e., multimodal learning and the subsequent
-means, inevitably causing a sub-optimal clustering result. In light of
this, we propose a one-step multi-view clustering with diverse representation
method, which incorporates multi-view learning and -means into a unified
framework. Specifically, we first project original data matrices into various
latent spaces to attain comprehensive information and auto-weight them in a
self-supervised manner. Then we directly use the information matrices under
diverse dimensions to obtain consensus discrete clustering labels. The unified
work of representation learning and clustering boosts the quality of the final
results. Furthermore, we develop an efficient optimization algorithm to solve
the resultant problem with proven convergence. Comprehensive experiments on
various datasets demonstrate the promising clustering performance of our
proposed method
Scalable Incomplete Multi-View Clustering with Structure Alignment
The success of existing multi-view clustering (MVC) relies on the assumption
that all views are complete. However, samples are usually partially available
due to data corruption or sensor malfunction, which raises the research of
incomplete multi-view clustering (IMVC). Although several anchor-based IMVC
methods have been proposed to process the large-scale incomplete data, they
still suffer from the following drawbacks: i) Most existing approaches neglect
the inter-view discrepancy and enforce cross-view representation to be
consistent, which would corrupt the representation capability of the model; ii)
Due to the samples disparity between different views, the learned anchor might
be misaligned, which we referred as the Anchor-Unaligned Problem for Incomplete
data (AUP-ID). Such the AUP-ID would cause inaccurate graph fusion and degrades
clustering performance. To tackle these issues, we propose a novel incomplete
anchor graph learning framework termed Scalable Incomplete Multi-View
Clustering with Structure Alignment (SIMVC-SA). Specially, we construct the
view-specific anchor graph to capture the complementary information from
different views. In order to solve the AUP-ID, we propose a novel structure
alignment module to refine the cross-view anchor correspondence. Meanwhile, the
anchor graph construction and alignment are jointly optimized in our unified
framework to enhance clustering quality. Through anchor graph construction
instead of full graphs, the time and space complexity of the proposed SIMVC-SA
is proven to be linearly correlated with the number of samples. Extensive
experiments on seven incomplete benchmark datasets demonstrate the
effectiveness and efficiency of our proposed method. Our code is publicly
available at https://github.com/wy1019/SIMVC-SA
Easy Begun is Half Done: Spatial-Temporal Graph Modeling with ST-Curriculum Dropout
Spatial-temporal (ST) graph modeling, such as traffic speed forecasting and
taxi demand prediction, is an important task in deep learning area. However,
for the nodes in graph, their ST patterns can vary greatly in difficulties for
modeling, owning to the heterogeneous nature of ST data. We argue that
unveiling the nodes to the model in a meaningful order, from easy to complex,
can provide performance improvements over traditional training procedure. The
idea has its root in Curriculum Learning which suggests in the early stage of
training models can be sensitive to noise and difficult samples. In this paper,
we propose ST-Curriculum Dropout, a novel and easy-to-implement strategy for
spatial-temporal graph modeling. Specifically, we evaluate the learning
difficulty of each node in high-level feature space and drop those difficult
ones out to ensure the model only needs to handle fundamental ST relations at
the beginning, before gradually moving to hard ones. Our strategy can be
applied to any canonical deep learning architecture without extra trainable
parameters, and extensive experiments on a wide range of datasets are conducted
to illustrate that, by controlling the difficulty level of ST relations as the
training progresses, the model is able to capture better representation of the
data and thus yields better generalization
- …