1,909 research outputs found
Large-Scale Multi-Label Learning with Incomplete Label Assignments
Multi-label learning deals with the classification problems where each
instance can be assigned with multiple labels simultaneously. Conventional
multi-label learning approaches mainly focus on exploiting label correlations.
It is usually assumed, explicitly or implicitly, that the label sets for
training instances are fully labeled without any missing labels. However, in
many real-world multi-label datasets, the label assignments for training
instances can be incomplete. Some ground-truth labels can be missed by the
labeler from the label set. This problem is especially typical when the number
instances is very large, and the labeling cost is very high, which makes it
almost impossible to get a fully labeled training set. In this paper, we study
the problem of large-scale multi-label learning with incomplete label
assignments. We propose an approach, called MPU, based upon positive and
unlabeled stochastic gradient descent and stacked models. Unlike prior works,
our method can effectively and efficiently consider missing labels and label
correlations simultaneously, and is very scalable, that has linear time
complexities over the size of the data. Extensive experiments on two real-world
multi-label datasets show that our MPU model consistently outperform other
commonly-used baselines
ALEC: Active learning with ensemble of classifiers for clinical diagnosis of coronary artery disease
Invasive angiography is the reference standard for coronary artery disease (CAD) diagnosis but is expensive and
associated with certain risks. Machine learning (ML) using clinical and noninvasive imaging parameters can be
used for CAD diagnosis to avoid the side effects and cost of angiography. However, ML methods require labeled
samples for efficient training. The labeled data scarcity and high labeling costs can be mitigated by active
learning. This is achieved through selective query of challenging samples for labeling. To the best of our
knowledge, active learning has not been used for CAD diagnosis yet. An Active Learning with Ensemble of
Classifiers (ALEC) method is proposed for CAD diagnosis, consisting of four classifiers. Three of these classifiers
determine whether a patient’s three main coronary arteries are stenotic or not. The fourth classifier predicts
whether the patient has CAD or not. ALEC is first trained using labeled samples. For each unlabeled sample, if the
outputs of the classifiers are consistent, the sample along with its predicted label is added to the pool of labeled
samples. Inconsistent samples are manually labeled by medical experts before being added to the pool. The
training is performed once more using the samples labeled so far. The interleaved phases of labeling and training
are repeated until all samples are labeled. Compared with 19 other active learning algorithms, ALEC combined
with a support vector machine classifier attained superior performance with 97.01% accuracy. Our method is
justified mathematically as well. We also comprehensively analyze the CAD dataset used in this paper. As part of
dataset analysis, features pairwise correlation is computed. The top 15 features contributing to CAD and stenosis
of the three main coronary arteries are determined. The relationship between stenosis of the main arteries is
presented using conditional probabilities. The effect of considering the number of stenotic arteries on sample
discrimination is investigated. The discrimination power over dataset samples is visualized, assuming each of the
three main coronary arteries as a sample label and considering the two remaining arteries as sample features
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
Multi-Instance Multi-Label Learning
In this paper, we propose the MIML (Multi-Instance Multi-Label learning)
framework where an example is described by multiple instances and associated
with multiple class labels. Compared to traditional learning frameworks, the
MIML framework is more convenient and natural for representing complicated
objects which have multiple semantic meanings. To learn from MIML examples, we
propose the MimlBoost and MimlSvm algorithms based on a simple degeneration
strategy, and experiments show that solving problems involving complicated
objects with multiple semantic meanings in the MIML framework can lead to good
performance. Considering that the degeneration process may lose information, we
propose the D-MimlSvm algorithm which tackles MIML problems directly in a
regularization framework. Moreover, we show that even when we do not have
access to the real objects and thus cannot capture more information from real
objects by using the MIML representation, MIML is still useful. We propose the
InsDif and SubCod algorithms. InsDif works by transforming single-instances
into the MIML representation for learning, while SubCod works by transforming
single-label examples into the MIML representation for learning. Experiments
show that in some tasks they are able to achieve better performance than
learning the single-instances or single-label examples directly.Comment: 64 pages, 10 figures; Artificial Intelligence, 201
Hierarchical Classification and its Application in University Search
Web search engines have been adopted by most universities for searching webpages in their own domains. Basically, a user sends keywords to the search engine and the search engine returns a flat ranked list of webpages. However, in university search, user queries are usually related to topics. Simple keyword queries are often insufficient to express topics as keywords. On the other hand, most E-commerce sites allow users to browse and search products in various hierarchies. It would be ideal if hierarchical browsing and keyword search can be seamlessly combined for university search engines. The main difficulty is to automatically classify and rank a massive number of webpages into the topic hierarchies for universities.
In this thesis, we use machine learning and data mining techniques to build a novel hybrid search engine with integrated hierarchies for universities, called SEEU (Search Engine with hiErarchy for Universities).
Firstly, we study the problem of effective hierarchical webpage classification. We develop a parallel webpage classification system based on Support Vector Machines. With extensive experiments on the well-known ODP (Open Directory Project) dataset, we empirically demonstrate that our hierarchical classification system is very effective and outperforms the traditional flat classification approaches significantly.
Secondly, we study the problem of integrating hierarchical classification into the ranking system of keywords-based search engines. We propose a novel ranking framework, called ERIC (Enhanced Ranking by hIerarchical Classification), for search engines with hierarchies. Experimental results on four large-scale TREC (Text REtrieval Conference) web search datasets show that our ranking system with hierarchical classification outperforms the traditional flat keywords-based search methods significantly.
Thirdly, we propose a novel active learning framework to improve the performance of hierarchical classification, which is important for ranking webpages in hierarchies. From our experiments on the benchmark text datasets, we find that our active learning framework can achieve good classification performance yet save a considerable number of labeling effort compared with the state-of-the-art active learning methods for hierarchical text classification.
Fourthly, based on the proposed classification and ranking methods, we present a novel hierarchical classification framework for mining academic topics from university webpages. We build an academic topic hierarchy based on the commonly accepted Wikipedia academic disciplines. Based on this hierarchy, we train a hierarchical classifier and apply it to mine academic topics. According to our comprehensive analysis, the academic topics mined by our method are reasonable and consistent with the real-world topic distribution in universities.
Finally, we combine all the proposed techniques together and implement the SEEU search engine. According to two usability studies conducted in the ECE and the CS departments at our university, SEEU is favored by the majority of participants.
To conclude, the main contribution of this thesis is a novel search engine, called SEEU, for universities. We discuss the challenges toward building SEEU and propose effective machine learning and data mining methods to tackle them. With extensive experiments on well-known benchmark datasets and real-world university webpage datasets, we demonstrate that our system is very effective. In addition, two usability studies of SEEU in our university show that SEEU has a great promise for university search
Active learning with gaussian processes for object categorization
Discriminative methods for visual object category recognition are typically non-probabilistic, predicting class labels but not directly providing an estimate of uncertainty. Gaussian Processes (GPs) are powerful regression techniques with explicit uncertainty models; we show here how Gaussian Processes with covariance functions defined based on a Pyramid Match Kernel (PMK) can be used for probabilistic object category recognition. The uncertainty model provided by GPs offers confidence estimates at test points, and naturally allows for an active learning paradigm in which points are optimally selected for interactive labeling. We derive a novel active category learning method based on our probabilistic regression model, and show that a significant boost in classification performance is possible, especially when the amount of training data for a category is ultimately very small. 1
- …