475 research outputs found
Transfer and Multi-Task Learning for Noun-Noun Compound Interpretation
In this paper, we empirically evaluate the utility of transfer and multi-task
learning on a challenging semantic classification task: semantic interpretation
of noun--noun compounds. Through a comprehensive series of experiments and
in-depth error analysis, we show that transfer learning via parameter
initialization and multi-task learning via parameter sharing can help a neural
classification model generalize over a highly skewed distribution of relations.
Further, we demonstrate how dual annotation with two distinct sets of relations
over the same set of compounds can be exploited to improve the overall accuracy
of a neural classifier and its F1 scores on the less frequent, but more
difficult relations.Comment: EMNLP 2018: Conference on Empirical Methods in Natural Language
Processing (EMNLP
New instances classification framework on Quran ontology applied to question answering system
Instances classification with the small dataset for Quran ontology is the current research problem which appears in Quran ontology development. The existing classification approach used machine learning: Backpropagation Neural Network. However, this method has a drawback; if the training set amount is small, then the classifier accuracy could decline. Unfortunately, Holy Quran has a small corpus. Based on this problem, our study aims to formulate new instances classification framework for small training corpus applied to semantic question answering system. As a result, the instances classification framework consists of several essential components: pre-processing, morphology analysis, semantic analysis, feature extraction, instances classification with Radial Basis Function Networks algorithm, and the transformation module. This algorithm is chosen since it robustness to noisy data and has an excellent achievement to handle small dataset. Furthermore, document processing module on question answering system is used to access instances classification result in Quran ontology
Recommended from our members
Efficient Machine Teaching Frameworks for Natural Language Processing
The past decade has seen tremendous growth in potential applications of language technologies in our daily lives due to increasing data, computational resources, and user interfaces. An important step to support emerging applications is the development of algorithms for processing the rich variety of human-generated text and extracting relevant information. Machine learning, especially deep learning, has seen increasing success on various text benchmarks. However, while standard benchmarks have static tasks with expensive human-labeled data, real-world applications are characterized by dynamic task specifications and limited resources for data labeling, thus making it challenging to transfer the success of supervised machine learning to the real world. To deploy language technologies at scale, it is crucial to develop alternative techniques for teaching machines beyond data labeling.
In this dissertation, we address this data labeling bottleneck by studying and presenting resource-efficient frameworks for teaching machine learning models to solve language tasks across diverse domains and languages. Our goal is to (i) support emerging real-world problems without the expensive requirement of large-scale manual data labeling; and (ii) assist humans in teaching machines via more flexible types of interaction. Towards this goal, we describe our collaborations with experts across domains (including public health, earth sciences, news, and e-commerce) to integrate weakly-supervised neural networks into operational systems, and we present efficient machine teaching frameworks that leverage flexible forms of declarative knowledge as supervision: coarse labels, large hierarchical taxonomies, seed words, bilingual word translations, and general labeling rules.
First, we present two neural network architectures that we designed to leverage weak supervision in the form of coarse labels and hierarchical taxonomies, respectively, and highlight their successful integration into operational systems. Our Hierarchical Sigmoid Attention Network (HSAN) learns to highlight important sentences of potentially long documents without sentence-level supervision by, instead, using coarse-grained supervision at the document level. HSAN improves over previous weakly supervised learning approaches across sentiment classification benchmarks and has been deployed to help inspections in health departments for the discovery of foodborne illness outbreaks. We also present TXtract, a neural network that extracts attributes for e-commerce products from thousands of diverse categories without using manually labeled data for each category, by instead considering category relationships in a hierarchical taxonomy. TXtract is a core component of Amazon’s AutoKnow, a system that collects knowledge facts for over 10K product categories, and serves such information to Amazon search and product detail pages.
Second, we present architecture-agnostic machine teaching frameworks that we applied across domains, languages, and tasks. Our weakly-supervised co-training framework can train any type of text classifier using just a small number of class-indicative seed words and unlabeled data. In contrast to previous work that use seed words to initialize embedding layers, our iterative seed word distillation (ISWD) method leverages the predictive power of seed words as supervision signals and shows strong performance improvements for aspect detection in reviews across domains and languages. We further demonstrate the cross-lingual transfer abilities of our co-training approach via cross-lingual teacher-student (CLTS), a method for training document classifiers across diverse languages using labeled documents only in English and a limited budget for bilingual translations. Not all classification tasks, however, can be effectively addressed using human supervision in the form of seed words. To capture a broader variety of tasks, we present weakly-supervised self-training (ASTRA), a weakly-supervised learning framework for training a classifier using more general labeling rules in addition to labeled and unlabeled data. As a complete set of accurate rules may be hard to obtain all in one shot, we further present an interactive framework that assists human annotators by automatically suggesting candidate labeling rules.
In conclusion, this thesis demonstrates the benefits of teaching machines with different types of interaction than the standard data labeling paradigm and shows promising results for new applications across domains and languages. To facilitate future research, we publish our code implementations and design new challenging benchmarks with various types of supervision. We believe that our proposed frameworks and experimental findings will influence research and will enable new applications of language technologies without the costly requirement of large manually labeled datasets
Hierarchical Metadata-Aware Document Categorization under Weak Supervision
Categorizing documents into a given label hierarchy is intuitively appealing
due to the ubiquity of hierarchical topic structures in massive text corpora.
Although related studies have achieved satisfying performance in fully
supervised hierarchical document classification, they usually require massive
human-annotated training data and only utilize text information. However, in
many domains, (1) annotations are quite expensive where very few training
samples can be acquired; (2) documents are accompanied by metadata information.
Hence, this paper studies how to integrate the label hierarchy, metadata, and
text signals for document categorization under weak supervision. We develop
HiMeCat, an embedding-based generative framework for our task. Specifically, we
propose a novel joint representation learning module that allows simultaneous
modeling of category dependencies, metadata information and textual semantics,
and we introduce a data augmentation module that hierarchically synthesizes
training documents to complement the original, small-scale training set. Our
experiments demonstrate a consistent improvement of HiMeCat over competitive
baselines and validate the contribution of our representation learning and data
augmentation modules.Comment: 9 pages; Accepted to WSDM 202
Multi-dimensional mining of unstructured data with limited supervision
As one of the most important data forms, unstructured text data plays a crucial role in data-driven decision making in domains ranging from social networking and information retrieval to healthcare and scientific research. In many emerging applications, people's information needs from text data are becoming multi-dimensional---they demand useful insights for multiple aspects from the given text corpus. However, turning massive text data into multi-dimensional knowledge remains a challenge that cannot be readily addressed by existing data mining techniques.
In this thesis, we propose algorithms that turn unstructured text data into multi-dimensional knowledge with limited supervision. We investigate two core questions:
1. How to identify task-relevant data with declarative queries in multiple dimensions?
2. How to distill knowledge from data in a multi-dimensional space?
To address the above questions, we propose an integrated cube construction and exploitation framework. First, we develop a cube construction module that organizes unstructured data into a cube structure, by discovering latent multi-dimensional and multi-granular structure from the unstructured text corpus and allocating documents into the structure. Second, we develop a cube exploitation module that models multiple dimensions in the cube space, thereby distilling multi-dimensional knowledge from data to provide insights along multiple dimensions. Together, these two modules constitute an integrated pipeline: leveraging the cube structure, users can perform multi-dimensional, multi-granular data selection with declarative queries; and with cube exploitation algorithms, users can make accurate cross-dimension predictions or extract multi-dimensional patterns for decision making.
The proposed framework has two distinctive advantages when turning text data into multi-dimensional knowledge: flexibility and label-efficiency. First, it enables acquiring multi-dimensional knowledge flexibly, as the cube structure allows users to easily identify task-relevant data along multiple dimensions at varied granularities and further distill multi-dimensional knowledge. Second, the algorithms for cube construction and exploitation require little supervision; this makes the framework appealing for many applications where labeled data are expensive to obtain
Semantic knowledge integration for learning from semantically imprecise data
Low availability of labeled training data often poses a fundamental limit to the accuracy of computer vision applications using machine learning methods. While these methods are improved continuously, e.g., through better neural network architectures, there cannot be a single methodical change that increases the accuracy on all possible tasks. This statement, known as the no free lunch theorem, suggests that we should consider aspects of machine learning other than learning algorithms for opportunities to escape the limits set by the available training data. In this thesis, we focus on two main aspects, namely the nature of the training data, where we introduce structure into the label set using concept hierarchies, and the learning paradigm, which we change in accordance with requirements of real-world applications as opposed to more academic setups.Concept hierarchies represent semantic relations, which are sets of statements such as "a bird is an animal." We propose a hierarchical classifier to integrate this domain knowledge in a pre-existing task, thereby increasing the information the classifier has access to. While the hierarchy's leaf nodes correspond to the original set of classes, the inner nodes are "new" concepts that do not exist in the original training data. However, we pose that such "imprecise" labels are valuable and should occur naturally, e.g., as an annotator's way of expressing their uncertainty. Furthermore, the increased number of concepts leads to more possible search terms when assembling a web-crawled dataset or using an image search. We propose CHILLAX, a method that learns from semantically imprecise training data, while still offering precise predictions to integrate seamlessly into a pre-existing application
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across
the world's languages. A large-scale typology could provide excellent guidance
for multilingual Natural Language Processing (NLP), particularly for languages
that suffer from the lack of human labeled resources. We present an extensive
literature survey on the use of typological information in the development of
NLP techniques. Our survey demonstrates that to date, the use of information in
existing typological databases has resulted in consistent but modest
improvements in system performance. We show that this is due to both intrinsic
limitations of databases (in terms of coverage and feature granularity) and
under-employment of the typological features included in them. We advocate for
a new approach that adapts the broad and discrete nature of typological
categories to the contextual and continuous nature of machine learning
algorithms used in contemporary NLP. In particular, we suggest that such
approach could be facilitated by recent developments in data-driven induction
of typological knowledge
New Instances Classification Framework On Quran Ontology Applied To Question Answering System
Instances classification with the small dataset for Quran ontology is the current research problem which appears in Quran ontology development. The existing classification approach used machine learning: Backpropagation Neural Network. However, this method has a drawback; if the training set amount is small, then the classifier accuracy could decline. Unfortunately, Holy Quran has a small corpus. Based on this problem, our study aims to formulate new instances classification framework for small training corpus applied to semantic question answering system. As a result, the instances classification framework consists of several essential components: pre-processing, morphology analysis, semantic analysis, feature extraction, instances classification with Radial Basis Function Networks algorithm, and the transformation module. This algorithm is chosen since it robustness to noisy data and has an excellent achievement to handle small dataset. Furthermore, document processing module on question answering system is used to access instances classification result in Quran ontology
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typological information in the development of NLP techniques. Our survey demonstrates that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance. We show that this is due to both intrinsic limitations of databases (in terms of coverage and feature granularity) and under-utilization of the typological features included in them. We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. In particular, we suggest that such an approach could be facilitated by recent developments in data-driven induction of typological knowledge.</jats:p
- …