25 research outputs found
Fidelity-Weighted Learning
Training deep neural networks requires many training samples, but in practice
training labels are expensive to obtain and may be of varying quality, as some
may be from trusted expert labelers while others might be from heuristics or
other sources of weak supervision such as crowd-sourcing. This creates a
fundamental quality versus-quantity trade-off in the learning process. Do we
learn from the small amount of high-quality data or the potentially large
amount of weakly-labeled data? We argue that if the learner could somehow know
and take the label-quality into account when learning the data representation,
we could get the best of both worlds. To this end, we propose
"fidelity-weighted learning" (FWL), a semi-supervised student-teacher approach
for training deep neural networks using weakly-labeled data. FWL modulates the
parameter updates to a student network (trained on the task we care about) on a
per-sample basis according to the posterior confidence of its label-quality
estimated by a teacher (who has access to the high-quality labels). Both
student and teacher are learned from the data. We evaluate FWL on two tasks in
information retrieval and natural language processing where we outperform
state-of-the-art alternative semi-supervised methods, indicating that our
approach makes better use of strong and weak labels, and leads to better
task-dependent data representations.Comment: Published as a conference paper at ICLR 201
Evaluation and development of conceptual document similarity metrics with content-based recommender applications
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2010.ENGLISH ABSTRACT: The World Wide Web brought with it an unprecedented level of information overload.
Computers are very effective at processing and clustering numerical and binary data,
however, the automated conceptual clustering of natural-language data is considerably
harder to automate. Most past techniques rely on simple keyword-matching techniques
or probabilistic methods to measure semantic relatedness. However, these approaches do
not always accurately capture conceptual relatedness as measured by humans.
In this thesis we propose and evaluate the use of novel Spreading Activation (SA)
techniques for computing semantic relatedness, by modelling the article hyperlink structure
of Wikipedia as an associative network structure for knowledge representation. The
SA technique is adapted and several problems are addressed for it to function over the
Wikipedia hyperlink structure. Inter-concept and inter-document similarity metrics are
developed which make use of SA to compute the conceptual similarity between two concepts
and between two natural-language documents. We evaluate these approaches over
two document similarity datasets and achieve results which compare favourably with the
state of the art.
Furthermore, document preprocessing techniques are evaluated in terms of the performance
gain these techniques can have on the well-known cosine document similarity metric
and the Normalised Compression Distance (NCD) metric. Results indicate that a near
two-fold increase in accuracy can be achieved for NCD by applying simple preprocessing
techniques. Nonetheless, the cosine similarity metric still significantly outperforms NCD.
Finally, we show that using our Wikipedia-based method to augment the cosine vector
space model provides superior results to either in isolation. Combining the two methods
leads to an increased correlation of Pearson p = 0:72 over the Lee (2005) document similarity
dataset, which matches the reported result for the state-of-the-art Explicit Semantic
Analysis (ESA) technique, while requiring less than 10% of the Wikipedia database as
required by ESA.
As a use case for document similarity techniques, a purely content-based news-article
recommender system is designed and implemented for a large online media company.
This system is used to gather additional human-generated relevance ratings which we
use to evaluate the performance of three state-of-the-art document similarity metrics for
providing content-based document recommendations.AFRIKAANSE OPSOMMING: Die Wêreldwye-Web het ’n vlak van inligting-oorbelading tot gevolg gehad soos nog nooit
tevore. Rekenaars is baie effektief met die verwerking en groepering van numeriese en
binêre data, maar die konsepsuele groepering van natuurlike-taal data is aansienlik moeiliker
om te outomatiseer. Tradisioneel berus sulke algoritmes op eenvoudige sleutelwoordherkenningstegnieke
of waarskynlikheidsmetodes om semantiese verwantskappe te bereken,
maar hierdie benaderings modelleer nie konsepsuele verwantskappe, soos gemeet deur
die mens, baie akkuraat nie.
In hierdie tesis stel ons die gebruik van ’n nuwe aktiverings-verspreidingstrategie (AV)
voor waarmee inter-konsep verwantskappe bereken kan word, deur die artikel skakelstruktuur
van Wikipedia te modelleer as ’n assosiatiewe netwerk. Die AV tegniek word aangepas
om te funksioneer oor die Wikipedia skakelstruktuur, en verskeie probleme wat hiermee
gepaard gaan word aangespreek. Inter-konsep en inter-dokument verwantskapsmaatstawwe
word ontwikkel wat gebruik maak van AV om die konsepsuele verwantskap tussen twee
konsepte en twee natuurlike-taal dokumente te bereken. Ons evalueer hierdie benadering
oor twee dokument-verwantskap datastelle en die resultate vergelyk goed met die van
ander toonaangewende metodes.
Verder word teks-voorverwerkingstegnieke ondersoek in terme van die moontlike verbetering
wat dit tot gevolg kan hê op die werksverrigting van die bekende kosinus vektorruimtemaatstaf
en die genormaliseerde kompressie-afstandmaatstaf (GKA). Resultate
dui daarop dat GKA se akkuraatheid byna verdubbel kan word deur gebruik te maak van
eenvoudige voorverwerkingstegnieke, maar dat die kosinus vektorruimtemaatstaf steeds
aansienlike beter resultate lewer.
Laastens wys ons dat die Wikipedia-gebasseerde metode gebruik kan word om die
vektorruimtemaatstaf aan te vul tot ’n gekombineerde maatstaf wat beter resultate lewer
as enige van die twee metodes afsonderlik. Deur die twee metodes te kombineer lei tot ’n
verhoogde korrelasie van Pearson p = 0:72 oor die Lee dokument-verwantskap datastel.
Dit is gelyk aan die gerapporteerde resultaat vir Explicit Semantic Analysis (ESA), die
huidige beste Wikipedia-gebasseerde tegniek. Ons benadering benodig egter minder as
10% van die Wikipedia databasis wat benodig word vir ESA.
As ’n toetstoepassing vir dokument-verwantskaptegnieke ontwerp en implementeer ons
’n stelsel vir ’n aanlyn media-maatskappy wat nuusartikels aanbeveel vir gebruikers, slegs
op grond van die artikels se inhoud. Joernaliste wat die stelsel gebruik ken ’n punt toe aan
elke aanbeveling en ons gebruik hierdie data om die akkuraatheid van drie toonaangewende
maatstawwe vir dokument-verwantskap te evalueer in die konteks van inhoud-gebasseerde
nuus-artikel aanbevelings
Training neural word embeddings for transfer learning and translation
Thesis (D. Phi)--Stellenbosch University, 2016.ENGLISH ABSTRACT: In contrast to only a decade ago, it is now easy to collect large text corpora from theWeb on any topic imaginable. However, in order for information processing systems to perform a useful task, such as answer a user’s queries on the content of the text, the raw text first needs to be parsed into the appropriate linguistic structures, like parts of speech, named-entities or semantic entities. Contemporary natural language processing systems rely predominantly on supervised machine learning techniques for performing this task. However, the supervision required to train these models are expensive to come by, since human annotators need to mark up relevant pieces of text with the required labels of interest. Furthermore, machine learning practitioners need to manually engineer a set of task-specific features which
represents a wasteful duplication of efforts for each new task.
An alternative approach is to attempt to automatically learn representations from raw text that are useful for predicting a wide variety of linguistic structures. In this dissertation, we hypothesise that neural word embeddings, i.e. representations that use continuous values to represent words in a learned vector space of meaning, are a suitable and efficient approach for learning representations of natural languages that are useful for predicting various aspects related to their meaning. We show experimental results which support this hypothesis, and present several contributions which make inducing word representations faster and applicable for monolingual and various cross-lingual prediction tasks.
The first contribution to this end is SimTree, an efficient algorithm for jointly clustering words into semantic classes while training a neural network language model with the hierarchical softmax output layer. The second is an efficient subsampling training technique for speeding up learning while increasing accuracy of word embeddings induced using the hierarchical softmax. The third is BilBOWA, a bilingual word embedding model that can
efficiently learn to embed words across multiple languages using only a limited sample of
parallel raw text, and unlimited amounts of monolingual raw text. The fourth is Barista, a bilingual word embedding model that efficiently uses additional semantic information about how words map into equivalence classes, such as parts of speech or word translations, and includes this information during the embedding process. In addition, this dissertation provides an in-depth overview of the different neural language model architectures, and a detailed, tutorial-style overview of the available popular techniques for training these models.AFRIKAANSE OPSOMMING: Geen opsomming beskikbaa
Unsupervised mining of lexical variants from noisy text
The amount of data produced in user-generated content continues to grow at a staggering rate. However, the text found in these media can deviate wildly from the standard rules of orthography, syntax and even semantics and present significant problems to downstream applications which make use of all this noisy data. In this paper we present a novel unsupervised method for extracting domain-specific lexical variants given a large volume of text. We demonstrate the utility of this method by applying it to normalize text messages found in the online social media service, Twitter, into their most likely standard English versions. Our method yields a 20% reduction in word error rate over an existing state-of-the-art approach