1,481 research outputs found
Supervised and unsupervised methods for learning representations of linguistic units
Word representations, also called word embeddings, are generic representations, often high-dimensional vectors. They map the discrete space of words into a continuous vector space, which allows us to handle rare or even unseen events, e.g. by considering the nearest neighbors. Many Natural Language Processing tasks can be improved by word representations if we extend the task specific training data by the general knowledge incorporated in the word representations.
The first publication investigates a supervised, graph-based method to create word representations. This method leads to a graph-theoretic similarity measure, CoSimRank, with equivalent formalizations that show CoSimRank’s close relationship to Personalized Page-Rank and SimRank. The new formalization is efficient because it can use the graph-based word representation to compute a single node similarity without having to compute the similarities of the entire graph. We also show how we can take advantage of fast matrix multiplication algorithms.
In the second publication, we use existing unsupervised methods for word representation learning and combine these with semantic resources by learning representations for non-word objects like synsets and entities. We also investigate improved word representations which incorporate the semantic information from the resource. The method is flexible in that it can take any word representations as input and does not need an additional training corpus. A sparse tensor formalization guarantees efficiency and parallelizability.
In the third publication, we introduce a method that learns an orthogonal transformation of the word representation space that focuses the information relevant for a task in an ultradense subspace of a dimensionality that is smaller by a factor of 100 than the original space. We use ultradense representations for a Lexicon Creation task in which words are annotated with three types of lexical information – sentiment, concreteness and frequency.
The final publication introduces a new calculus for the interpretable ultradense subspaces, including polarity, concreteness, frequency and part-of-speech (POS). The calculus supports operations like “−1 × hate = love” and “give me a neutral word for greasy” (i.e., oleaginous) and extends existing analogy computations like “king − man + woman = queen”.Wortrepräsentationen, sogenannte Word Embeddings, sind generische Repräsentationen, meist hochdimensionale Vektoren. Sie bilden den diskreten Raum der Wörter in einen stetigen Vektorraum ab und erlauben uns, seltene oder ungesehene Ereignisse zu behandeln -- zum Beispiel durch die Betrachtung der nächsten Nachbarn. Viele Probleme der Computerlinguistik können durch Wortrepräsentationen gelöst werden, indem wir spezifische Trainingsdaten um die allgemeinen Informationen erweitern, welche in den Wortrepräsentationen enthalten sind.
In der ersten Publikation untersuchen wir überwachte, graphenbasierte Methodenn um Wortrepräsentationen zu erzeugen. Diese Methoden führen zu einem graphenbasierten Ähnlichkeitsmaß, CoSimRank, für welches zwei äquivalente Formulierungen existieren, die sowohl die enge Beziehung zum personalisierten PageRank als auch zum SimRank zeigen. Die neue Formulierung kann einzelne Knotenähnlichkeiten effektiv berechnen, da graphenbasierte Wortrepräsentationen benutzt werden können.
In der zweiten Publikation verwenden wir existierende Wortrepräsentationen und kombinieren diese mit semantischen Ressourcen, indem wir Repräsentationen für Objekte lernen, welche keine Wörter sind, wie zum Beispiel Synsets und Entitäten. Die Flexibilität unserer Methode zeichnet sich dadurch aus, dass wir beliebige Wortrepräsentationen als Eingabe verwenden können und keinen zusätzlichen Trainingskorpus benötigen.
In der dritten Publikation stellen wir eine Methode vor, die eine Orthogonaltransformation des Vektorraums der Wortrepräsentationen lernt. Diese Transformation fokussiert relevante Informationen in einen ultra-kompakten Untervektorraum. Wir benutzen die ultra-kompakten Repräsentationen zur Erstellung von Wörterbüchern mit drei verschiedene Angaben -- Stimmung, Konkretheit und Häufigkeit.
Die letzte Publikation präsentiert eine neue Rechenmethode für die interpretierbaren ultra-kompakten Untervektorräume -- Stimmung, Konkretheit, Häufigkeit und Wortart. Diese Rechenmethode beinhaltet Operationen wie ”−1 × Hass = Liebe” und ”neutrales Wort für Winkeladvokat” (d.h., Anwalt) und erweitert existierende Rechenmethoden, wie ”Onkel − Mann + Frau = Tante”
Recommended from our members
Semantic Concept Co-Occurrence Patterns for Image Annotation and Retrieval.
Describing visual image contents by semantic concepts is an effective and straightforward way to facilitate various high level applications. Inferring semantic concepts from low-level pictorial feature analysis is challenging due to the semantic gap problem, while manually labeling concepts is unwise because of a large number of images in both online and offline collections. In this paper, we present a novel approach to automatically generate intermediate image descriptors by exploiting concept co-occurrence patterns in the pre-labeled training set that renders it possible to depict complex scene images semantically. Our work is motivated by the fact that multiple concepts that frequently co-occur across images form patterns which could provide contextual cues for individual concept inference. We discover the co-occurrence patterns as hierarchical communities by graph modularity maximization in a network with nodes and edges representing concepts and co-occurrence relationships separately. A random walk process working on the inferred concept probabilities with the discovered co-occurrence patterns is applied to acquire the refined concept signature representation. Through experiments in automatic image annotation and semantic image retrieval on several challenging datasets, we demonstrate the effectiveness of the proposed concept co-occurrence patterns as well as the concept signature representation in comparison with state-of-the-art approaches
Breadth analysis of Online Social Networks
This thesis is mainly motivated by the analysis, understanding, and prediction of human behaviour
by means of the study of their digital fingeprints. Unlike a classical PhD thesis, where
you choose a topic and go further on a deep analysis on a research topic, we carried out a breadth
analysis on the research topic of complex networks, such as those that humans create themselves
with their relationships and interactions. These kinds of digital communities where humans interact
and create relationships are commonly called Online Social Networks. Then, (i) we have
collected their interactions, as text messages they share among each other, in order to analyze the
sentiment and topic of such messages. We have basically applied the state-of-the-art techniques
for Natural Language Processing, widely developed and tested on English texts, in a collection
of Spanish Tweets and we compare the results. Next, (ii) we focused on Topic Detection, creating
our own classifier and applying it to the former Tweets dataset. The breakthroughs are two:
our classifier relies on text-graphs from the input text and we achieved a figure of 70% accuracy,
outperforming previous results. After that, (iii) we moved to analyze the network structure (or
topology) and their data values to detect outliers. We hypothesize that in social networks there
is a large mass of users that behaves similarly, while a reduced set of them behave in a different
way. However, specially among this last group, we try to separate those with high activity, or
low activity, or any other paramater/feature that make them belong to different kind of outliers.
We aim to detect influential users in one of these outliers set. We propose a new unsupervised
method, Massive Unsupervised Outlier Detection (MUOD), labeling the outliers detected os of
shape, magnitude, amplitude or combination of those. We applied this method to a subset of
roughly 400 million Google+ users, identifying and discriminating automatically sets of outlier
users. Finally, (iv) we find interesting to address the monitorization of real complex networks.
We created a framework to dynamically adapt the temporality of large-scale dynamic networks,
reducing compute overhead by at least 76%, data volume by 60% and overall cloud costs by at
least 54%, while always maintaining accuracy above 88%.PublicadoPrograma de Doctorado en Ingeniería Matemática por la Universidad Carlos III de MadridPresidente: Rosa María Benito Zafrilla.- Secretario: Ángel Cuevas Rumín.- Vocal: José Ernesto Jiménez Merin
Analysis and Applications of the T-complexity
T-codes are variable-length self-synchronizing codes introduced by Titchener in 1984. T-code codewords are constructed recursively from a finite alphabet using an algorithm called T-augmentation, resulting in excellent self-synchronization properties. An algorithm called T-decomposition parses a given sequence into a series of T-prefixes, and finds a T-code set in which the sequence is encoded to a longest codeword. There are similarities and differences between T-decomposition and the conventional LZ78 incremental parsing. The LZ78 incremental parsing algorithm parses a given sequence into consecutive distinct subsequences (words) sequentially in such a way that each word consists of the longest matching word parsed previously and a literal symbol. Then, the LZ-complexity is defined as the number of words. By contrast, T-decomposition parses a given sequence into a series of T-prefixes, each of which consists of the recursive concatenation of the longest matching T-prefix parsed previously and a literal symbol, and it has to access the whole sequence every time it determines a T-prefix. Alike to the LZ-complexity, the T-complexity of a sequence is defined as the number of T-prefixes, however, the T-complexity of a particular sequence in general tends to be smaller than the LZ-complexity. In the first part of the thesis, we deal with our contributions to the theory of T-codes. In order to realize a sequential determination of T-prefixes, we devise a new T-decomposition algorithm using forward parsing. Both the T-complexity profile obtained from the forward T-decomposition and the LZ-complexity profile can be derived in a unified way using a differential equation method. The method focuses on the increase of the average codeword length of a code tree. The obtained formulas are confirmed to coincide with those of previous studies. The magnitude of the T-complexity of a given sequence s in general indicates the degree of randomness. However, there exist interesting sequences that have much larger T-complexities than any random sequences. We investigate the maximum T-complexity sequences and the maximum LZ-complexity sequences using various techniques including those of the test suite released by the National Institute of Standards and Technology (NIST) of the U.S. government, and find that the maximum T-complexity sequences are less random than the maximum LZ-complexity sequences. In the second part of the thesis, we present our achievements in terms of application. We consider two applications -- data compression and randomness testing. First, we propose a new data compression scheme based on T-codes using a dictionary method such that all phrases added to a dictionary have a recursive structure similar to T-codes. Our data compression scheme can compress any of the files in the Calgary Corpus more efficiently than previous schemes based on T-codes and the UNIX compress, a variant of LZ78 (LZW). Next, we introduce a randomness test based on the T-complexity. Recently, the Lempel-Ziv (LZ) randomness test based on the LZ-complexity was officially excluded from the NIST test suite. This is because the distribution of P-values for random sequences of length 106, the most common length used, is strictly discrete in the case of the LZ-complexity. Our test solves this problem because the T-complexity features an almost ideal uniform continuous distribution of P-values for random sequences of length 106. The proposed test outperforms the NIST LZ test, a modified LZ test proposed by Doganaksoy and Göloglu, and all other tests included in the NIST test suite, in terms of the detection of undesirable pseudorandom sequences generated by a multiplicative congruential generator (MCG) and non-random byte sequences Y = Y0, Y1, Y2, ···, where Y3i and Y(3i+1) are random, but Y(3i+2) is given by Y3i + Y(3i+1) mod 28.報告番号: 甲26141 ; 学位授与年月日: 2010-03-24 ; 学位の種別: 課程博士 ; 学位の種類: 博士(科学) ; 学位記番号: 博創域第558号 ; 研究科・専攻: 新領域創成科学研究科基盤科学研究系複雑理工学専
CAPE: Corrective Actions from Precondition Errors using Large Language Models
Extracting commonsense knowledge from a large language model (LLM) offers a
path to designing intelligent robots. Existing approaches that leverage LLMs
for planning are unable to recover when an action fails and often resort to
retrying failed actions, without resolving the error's underlying cause.
We propose a novel approach (CAPE) that attempts to propose corrective
actions to resolve precondition errors during planning. CAPE improves the
quality of generated plans by leveraging few-shot reasoning from action
preconditions. Our approach enables embodied agents to execute more tasks than
baseline methods while ensuring semantic correctness and minimizing
re-prompting. In VirtualHome, CAPE generates executable plans while improving a
human-annotated plan correctness metric from 28.89% to 49.63% over SayCan. Our
improvements transfer to a Boston Dynamics Spot robot initialized with a set of
skills (specified in language) and associated preconditions, where CAPE
improves the correctness metric of the executed task plans by 76.49% compared
to SayCan. Our approach enables the robot to follow natural language commands
and robustly recover from failures, which baseline approaches largely cannot
resolve or address inefficiently.Comment: 8 pages, 3 figures, Under Review at ICRA 202
- …