28,708 research outputs found
Fidelity-Weighted Learning
Training deep neural networks requires many training samples, but in practice
training labels are expensive to obtain and may be of varying quality, as some
may be from trusted expert labelers while others might be from heuristics or
other sources of weak supervision such as crowd-sourcing. This creates a
fundamental quality versus-quantity trade-off in the learning process. Do we
learn from the small amount of high-quality data or the potentially large
amount of weakly-labeled data? We argue that if the learner could somehow know
and take the label-quality into account when learning the data representation,
we could get the best of both worlds. To this end, we propose
"fidelity-weighted learning" (FWL), a semi-supervised student-teacher approach
for training deep neural networks using weakly-labeled data. FWL modulates the
parameter updates to a student network (trained on the task we care about) on a
per-sample basis according to the posterior confidence of its label-quality
estimated by a teacher (who has access to the high-quality labels). Both
student and teacher are learned from the data. We evaluate FWL on two tasks in
information retrieval and natural language processing where we outperform
state-of-the-art alternative semi-supervised methods, indicating that our
approach makes better use of strong and weak labels, and leads to better
task-dependent data representations.Comment: Published as a conference paper at ICLR 201
Non-Compositional Term Dependence for Information Retrieval
Modelling term dependence in IR aims to identify co-occurring terms that are
too heavily dependent on each other to be treated as a bag of words, and to
adapt the indexing and ranking accordingly. Dependent terms are predominantly
identified using lexical frequency statistics, assuming that (a) if terms
co-occur often enough in some corpus, they are semantically dependent; (b) the
more often they co-occur, the more semantically dependent they are. This
assumption is not always correct: the frequency of co-occurring terms can be
separate from the strength of their semantic dependence. E.g. "red tape" might
be overall less frequent than "tape measure" in some corpus, but this does not
mean that "red"+"tape" are less dependent than "tape"+"measure". This is
especially the case for non-compositional phrases, i.e. phrases whose meaning
cannot be composed from the individual meanings of their terms (such as the
phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction
between the frequency and strength of term dependence in IR, we present a
principled approach for handling term dependence in queries, using both lexical
frequency and semantic evidence. We focus on non-compositional phrases,
extending a recent unsupervised model for their detection [21] to IR. Our
approach, integrated into ranking using Markov Random Fields [31], yields
effectiveness gains over competitive TREC baselines, showing that there is
still room for improvement in the very well-studied area of term dependence in
IR
Linked Data Supported Information Retrieval
Um Inhalte im World Wide Web ausfindig zu machen, sind Suchmaschienen nicht mehr wegzudenken. Semantic Web und Linked Data Technologien ermöglichen ein detaillierteres und eindeutiges Strukturieren der Inhalte und erlauben vollkommen neue Herangehensweisen an die Lösung von Information Retrieval Problemen. Diese Arbeit befasst sich mit den Möglichkeiten, wie Information Retrieval Anwendungen von der Einbeziehung von Linked Data profitieren können. Neue Methoden der computer-gestützten semantischen Textanalyse, semantischen Suche, Informationspriorisierung und -visualisierung werden vorgestellt und umfassend evaluiert. Dabei werden Linked Data Ressourcen und ihre Beziehungen in die Verfahren integriert, um eine Steigerung der Effektivität der Verfahren bzw. ihrer Benutzerfreundlichkeit zu erzielen. Zunächst wird eine Einführung in die Grundlagen des Information Retrieval und Linked Data gegeben. Anschließend werden neue manuelle und automatisierte Verfahren zum semantischen Annotieren von Dokumenten durch deren Verknüpfung mit Linked Data Ressourcen vorgestellt (Entity Linking). Eine umfassende Evaluation der Verfahren wird durchgeführt und das zu Grunde liegende Evaluationssystem umfangreich verbessert. Aufbauend auf den Annotationsverfahren werden zwei neue Retrievalmodelle zur semantischen Suche vorgestellt und evaluiert. Die Verfahren basieren auf dem generalisierten Vektorraummodell und beziehen die semantische Ähnlichkeit anhand von taxonomie-basierten Beziehungen der Linked Data Ressourcen in Dokumenten und Suchanfragen in die Berechnung der Suchergebnisrangfolge ein. Mit dem Ziel die Berechnung von semantischer Ähnlichkeit weiter zu verfeinern, wird ein Verfahren zur Priorisierung von Linked Data Ressourcen vorgestellt und evaluiert. Darauf aufbauend werden Visualisierungstechniken aufgezeigt mit dem Ziel, die Explorierbarkeit und Navigierbarkeit innerhalb eines semantisch annotierten Dokumentenkorpus zu verbessern. Hierfür werden zwei Anwendungen präsentiert. Zum einen eine Linked Data basierte explorative Erweiterung als Ergänzung zu einer traditionellen schlüsselwort-basierten Suchmaschine, zum anderen ein Linked Data basiertes Empfehlungssystem
Human information processing based information retrieval
This work focused on the investigation of the question how the concept of relevance in Information Retrieval can be validated. The work is motivated by the consistent difficulties of defining the meaning of the concept, and by advances in the field of cognitive science.
Analytical and empirical investigations are carried out with the aim of devising a principled approach to the validation of the concept. The foundation for this work was set by
interpreting relevance as a phenomenon occurring within the context of two systems:
An IR system and the cognitive processing system of the user. In light of the cognitive interpretation of relevance, an analysis of the learnt lessons in cognitive science
with regard to the validation of cognitive phenomena was conducted. It identified that construct validity constitutes the dominant approach to the validation of constructs in
cognitive science. Construct validity constitutes a proposal for the conduction of validation in scenarios, where no direct observation of a phenomenon is possible. With
regard to the limitations on direct observation of a construct (i.e. a postulated theoretic concept), it bases validation on the evaluation of its relations to other constructs.
Based on the interpretation of relevance as a product of cognitive processing it was concluded, that the limitations with regard to direct observation apply to its investigation.
The evaluation of its applicability to an IR context, focused on the exploration of the nomological network methodology. A nomological network constitutes an analytically constructed set of constructs and their relations. The construction of such a network
forms the basis for establishing construct validity through investigation of the relations between constructs. An analysis focused on contemporary insights to the nomological
network methodology identified two important aspects with regard to its application in IR. The first aspect is given by a choice of context and the identification of a pool of
candidate constructs for the inclusion in the network. The second consists of identifying criteria for the selection of a set of constructs from the candidate pool. The
identification of the pertinent constructs for the network was based on a review of the principles
of cognitive exploration, and an analysis of the state of the art in text based discourse processing and reasoning. On that basis, a listing of known sub-processes contributing
to the pertinent cognitive processing was presented. Based on the identification of a large number of potential candidates, the next step consisted of the inference of criteria for the selection of an initial set of constructs for the network. The investigation of these
criteria focused on the consideration of pragmatic and meta-theoretical aspects. Based on a survey of experimental means in cognitive science and IR, five pragmatic criteria for the selection of constructs were presented. Consideration of meta-theoretically motivated criteria required to investigate what the specific challenges with regard to the
validation of highly abstract constructs are. This question was explored based on the underlying considerations of the Information Processing paradigm and Newell’s (1994)
cognitive bands. This led to the identification of a set of three meta-theoretical criteria for the selection of constructs. Based on the criteria and the demarcated candidate pool, an IR focused nomological network was defined. The network consists of the constructs of relevance and type and grade of word relatedness.
A necessary prerequisite for making inferences based on a nomological network consists of the availability of validated measurement instruments for the constructs. To that cause, two validation studies targeting the measurement of the type and grade of relations between words were conducted. The clarification of the question of the validity
of the measurement instruments enabled the application of the nomological network. A first step of the application consisted of testing if the constructs in the network are
related to each other. Based on the alignment of measurements of relevance and the word related constructs it was concluded to be true. The relation between the constructs was characterized by varying the word related constructs over a large parameter space and observing the effect of this variation on relevance. Three hypotheses relating to different aspects of the relations between the word related constructs and relevance. It was
concluded, that the conclusive confirmation of the hypotheses requires an extension of the experimental means underlying the study. Based on converging observations from
the empirical investigation of the three hypotheses it was concluded, that semantic and associative relations distinctly differ with regard to their impact on relevance estimation
The underpinnings of a composite measure for automatic term extraction: The case of SRC
The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small-and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grants FFI2011-29798-C02-01 and FFI2014-53788-C3-1-P.Periñán Pascual, JC. (2015). The underpinnings of a composite measure for automatic term extraction: The case of SRC. Terminology. 21(2):151-179. doi:10.1075/term.21.2.02perS15117921
A Hybrid Model for Document Retrieval Systems.
A methodology for the design of document retrieval systems is presented. First, a composite index term weighting model is developed based on term frequency statistics, including document frequency, relative frequency within document and relative frequency within collection, which can be adjusted by selecting various coefficients to fit into different indexing environments. Then, a composite retrieval model is proposed to process a user\u27s information request in a weighted Phrase-Oriented Fixed-Level Expression (POFLE), which may apply more than Boolean operators, through two phases. That is, we have a search for documents which are topically relevant to the information request by means of a descriptor matching mechanism, which incorporate a partial matching facility based on a structurally-restricted relationship imposed by indexing model, and is more general than matching functions of the traditional Boolean model and vector space model, and then we have a ranking of these topically relevant documents, by means of two types of heuristic-based selection rules and a knowledge-based evaluation function, in descending order of a preference score which predicts the combined effect of user preference for quality, recency, fitness and reachability of documents
Recommended from our members
Beyond Discourse: Computational Text Analysis and Material Historical Processes
This dissertation proposes a general methodological framework for the application of computational text analysis to the study of long duration material processes of transformation, beyond their traditional application to the study of discourse and rhetorical action. Over a thin theory of the linguistic nature of social facts, the proposed methodology revolves around the compilation of term co-occurrence matrices and their projection into different representations of an hypothetical semantic space. These representations offer solutions to two problems inherent to social scientific research: that of "mapping" features in a given representation to theoretical entities and that of "alignment" of the features seen in models built from different sources in order to enable their comparison.
The data requirements of the exercise are discussed through the introduction of the notion of a "narrative horizon", the extent to which a given source incorporates a narrative account in its rendering of the context that produces it. Useful primary data will consist of text with short narrative horizons, such that the ideal source will correspond to a continuous archive of institutional, ideally bureaucratic text produced as mere documentation of a definite population of more or less stable and comparable social facts across a couple of centuries. Such a primary source is available in the Proceedings of the Old Bailey (POB), a collection of transcriptions of 197,752 criminal trials seen by the Old Bailey and the Central Criminal Court of London and Middlesex between 1674 and 1913 that includes verbatim transcriptions of witness testimony. The POB is used to demonstrate the proposed framework, starting with the analysis of the evolution of an historical corpus to illustrate the procedure by which provenance data is used to construct longitudinal and cross-sectional comparisons of different corpus segments.
The co-occurrence matrices obtained from the POB corpus are used to demonstrate two different projections: semantic networks that model different notions of similarity between the terms in a corpus' lexicon as an adjacency matrix describing a graph and semantic vector spaces that approximate a lower-dimensional representation of an hypothetical semantic space from its empirical effects on the co-occurrence matrix.
Semantic networks are presented as discrete mathematical objects that offer a solution to the mapping problem through operation that allow for the construction of sets of terms over which an order can be induced using any measure of significance of the strength of association between a term set and its elements. Alignment can then be solved through different similarity measures computed over the intersection and union of the sets under comparison.
Semantic vector spaces are presented as continuous mathematical objects that offer a solution to the mapping problem in the linear structures contained in them. This include, in all cases, a meaningful metric that makes it possible to define neighbourhoods and regions in the semantic space and, in some cases, a meaningful orientation that makes it possible to trace dimensions across them. Alignment can then proceed endogenously in the case of oriented vector spaces for relative comparisons, or through the construction of common basis sets for non-oriented semantic spaces for absolute comparisons.
The dissertation concludes with the proposition of a general research program for the systematic compilation of text distributional patterns in order to facilitate a much needed process of calibration required by the techniques discussed in the previous chapters. Two specific avenues for further research are identified. First, the development of incremental methods of projection that allow a semantic model to be updated as new observations come along, an area that has received considerable attention from the field of electronic finance and the pervasive use of Gentleman's algorithm for matrix factorisation. Second, the development of additively decomposable models that may be combined or disaggregated to obtain a similar result to the one that would have been obtained had the model being computed from the union or difference of their inputs. This is established to be dependent on whether the functions that actualise a given model are associative under addition or not
Learning Analogies and Semantic Relations
We present an algorithm for learning from unlabeled text, based on the
Vector Space Model (VSM) of information retrieval, that can solve verbal
analogy questions of the kind found in the Scholastic Aptitude Test (SAT).
A verbal analogy has the form A:B::C:D, meaning "A is to B as C is to D";
for example, mason:stone::carpenter:wood. SAT analogy questions provide
a word pair, A:B, and the problem is to select the most analogous word
pair, C:D, from a set of five choices. The VSM algorithm correctly
answers 47% of a collection of 374 college-level analogy questions
(random guessing would yield 20% correct). We motivate this research by
relating it to work in cognitive science and linguistics, and by applying
it to a difficult problem in natural language processing, determining
semantic relations in noun-modifier pairs. The problem is to classify a
noun-modifier pair, such as "laser printer", according to the semantic
relation between the noun (printer) and the modifier (laser). We use a
supervised nearest-neighbour algorithm that assigns a class to a given
noun-modifier pair by finding the most analogous noun-modifier pair in
the training data. With 30 classes of semantic relations, on a collection
of 600 labeled noun-modifier pairs, the learning algorithm attains an F
value of 26.5% (random guessing: 3.3%). With 5 classes of semantic
relations, the F value is 43.2% (random: 20%). The performance is
state-of-the-art for these challenging problems
- …