14,859 research outputs found
Identifying Mislabeled Training Data
This paper presents a new approach to identifying and eliminating mislabeled
training instances for supervised learning. The goal of this approach is to
improve classification accuracies produced by learning algorithms by improving
the quality of the training data. Our approach uses a set of learning
algorithms to create classifiers that serve as noise filters for the training
data. We evaluate single algorithm, majority vote and consensus filters on five
datasets that are prone to labeling errors. Our experiments illustrate that
filtering significantly improves classification accuracy for noise levels up to
30 percent. An analytical and empirical evaluation of the precision of our
approach shows that consensus filters are conservative at throwing away good
data at the expense of retaining bad data and that majority filters are better
at detecting bad data at the expense of throwing away good data. This suggests
that for situations in which there is a paucity of data, consensus filters are
preferable, whereas majority vote filters are preferable for situations with an
abundance of data
Structural evolution drives diversification of the large LRR-RLK gene family
Cells are continuously exposed to chemical signals that they must discriminate between and respond to appropriately. In embryophytes, the leucineārich repeat receptorālike kinases (LRRāRLKs) are signal receptors critical in development and defense. LRRāRLKs have diversified to hundreds of genes in many plant genomes. Although intensively studied, a wellāresolved LRRāRLK gene tree has remained elusive. To resolve the LRRāRLK gene tree, we developed an improved gene discovery method based on iterative hidden Markov model searching and phylogenetic inference. We used this method to infer complete gene trees for each of the LRRāRLK subclades and reconstructed the deepest nodes of the full gene family. We discovered that the LRRāRLK gene family is even larger than previously thought, and that protein domain gains and losses are prevalent. These structural modifications, some of which likely predate embryophyte diversification, led to misclassification of some LRRāRLK variants as members of other gene families. Our work corrects this misclassification. Our results reveal ongoing structural evolution generating novel LRRāRLK genes. These new genes are raw material for the diversification of signaling in development and defense. Our methods also enable phylogenetic reconstruction in any large gene family
Filling Knowledge Gaps in a Broad-Coverage Machine Translation System
Knowledge-based machine translation (KBMT) techniques yield high quality in
domains with detailed semantic models, limited vocabulary, and controlled input
grammar. Scaling up along these dimensions means acquiring large knowledge
resources. It also means behaving reasonably when definitive knowledge is not
yet available. This paper describes how we can fill various KBMT knowledge
gaps, often using robust statistical techniques. We describe quantitative and
qualitative results from JAPANGLOSS, a broad-coverage Japanese-English MT
system.Comment: 7 pages, Compressed and uuencoded postscript. To appear: IJCAI-9
Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata
Many social Web sites allow users to annotate the content with descriptive
metadata, such as tags, and more recently to organize content hierarchically.
These types of structured metadata provide valuable evidence for learning how a
community organizes knowledge. For instance, we can aggregate many personal
hierarchies into a common taxonomy, also known as a folksonomy, that will aid
users in visualizing and browsing social content, and also to help them in
organizing their own content. However, learning from social metadata presents
several challenges, since it is sparse, shallow, ambiguous, noisy, and
inconsistent. We describe an approach to folksonomy learning based on
relational clustering, which exploits structured metadata contained in personal
hierarchies. Our approach clusters similar hierarchies using their structure
and tag statistics, then incrementally weaves them into a deeper, bushier tree.
We study folksonomy learning using social metadata extracted from the
photo-sharing site Flickr, and demonstrate that the proposed approach addresses
the challenges. Moreover, comparing to previous work, the approach produces
larger, more accurate folksonomies, and in addition, scales better.Comment: 10 pages, To appear in the Proceedings of ACM SIGKDD Conference on
Knowledge Discovery and Data Mining(KDD) 201
- ā¦