301 research outputs found
NLP Analysis of Email Interactions to find automation opportunities
Finding automatization opportunities for email interactions can have positive effects for several industries, especially in tasks such as reading, receiving, writing and responding emails, categorizing emails or even to prevent loss of productivity and financial loses by dealing with spam, or improve users' satisfaction; even improving automatic categorization systems can mitigate negative impacts on personal and organization performance. Furthermore, people who work in companies spend around 28 % of their time reading and answering emails. In this project we proposed a methodology based on NLP and Unsupervised Machine Learning to look for opportunities of automation arising from recurrent email patterns found in email texts. We intent to facilitate the linguistic analysis in order to retrieve interaction patterns that can trigger automation actions. We proposed CRISP-DM methodology that lays the groundwork for detection of automatization opportunities in tasks relates. We compared the unsupervised machine learning methods K-Means, DBSCAN, and HDBSCAN with four clustering metrics applied to the Enron e-mails dataset transformed into paragraph vectors and performed several experiments with Word Mover's Distance, Euclidean Distance, L2-Norm and Cosine Similarity. Although our process yielded limited results in the detection of email interactions, we found that DBSCAN combined with Euclidean Distance was the best method among all scores. This project also contributes to the parameterization literature of said clustering algorithms as well as showing which methods, distances and scores settings are relevant for unsupervised email mining
V-Measure: A conditional entropy-based external cluster evaluation
We present V-measure, an external entropy-based cluster evaluation measure. Vmeasure provides an elegant solution to many problems that affect previously defined cluster evaluation measures including 1) dependence on clustering algorithm or data set, 2) the “problem of matching”, where the clustering of only a portion of data points are evaluated and 3) accurate evaluation and combination of two desirable aspects of clustering, homogeneity and completeness. We compare V-measure to a number of popular cluster evaluation measures and demonstrate that it satisfies several desirable properties of clustering solutions, using simulated clustering results. Finally, we use V-measure to evaluate two clustering tasks: document clustering and pitch accent type clustering
Visual SLAM with RGB-D cameras based on pose graph optimization
En este trabajo abordamos el problema de localización y mapeo simultáneo (SLAM) utilizando únicamente información obtenida mediante una cámara RGB-D. El objetivo principal es desarrollar un sistema SLAM capaz de estimar la trayectoria completa del sensor y generar una
representación 3D consistente del entorno en tiempo real. Para lograr este objetivo, el sistema se basa en un método de estimación del movimiento del sensor a partir de información de profundidad densa y en técnicas de reconocimiento de lugares a partir de características visuales. A partir de estos algoritmos, se extraen restricciones espaciales entre fotogramas
cuidadosamente seleccionados. Con estas restricciones espaciales se construye
un grafo de poses, empleado para inferir la trayectoria más verosímil. El sistema se ha diseñado para ejecutarse en dos hilos paralelos: uno para el seguimiento y el otro para la construcción de la representación consistente. El sistema se evalúa en conjuntos de datos públicamente accesible, alcanzando una precisión comparable a sistemas de SLAM del estado del arte. Además, el hilo de seguimiento se ejecuta a una frecuencia de 60 Hz en un ordenador portátil de prestaciones modestas. También se realizan pruebas en situaciones más realistas, procesando observaciones adquiridas mientras se movía el sensor por dos entornos de interiores distintos
Using Generic Summarization to Improve Music Information Retrieval Tasks
In order to satisfy processing time constraints, many MIR tasks process only
a segment of the whole music signal. This practice may lead to decreasing
performance, since the most important information for the tasks may not be in
those processed segments. In this paper, we leverage generic summarization
algorithms, previously applied to text and speech summarization, to summarize
items in music datasets. These algorithms build summaries, that are both
concise and diverse, by selecting appropriate segments from the input signal
which makes them good candidates to summarize music as well. We evaluate the
summarization process on binary and multiclass music genre classification
tasks, by comparing the performance obtained using summarized datasets against
the performances obtained using continuous segments (which is the traditional
method used for addressing the previously mentioned time constraints) and full
songs of the same original dataset. We show that GRASSHOPPER, LexRank, LSA,
MMR, and a Support Sets-based Centrality model improve classification
performance when compared to selected 30-second baselines. We also show that
summarized datasets lead to a classification performance whose difference is
not statistically significant from using full songs. Furthermore, we make an
argument stating the advantages of sharing summarized datasets for future MIR
research.Comment: 24 pages, 10 tables; Submitted to IEEE/ACM Transactions on Audio,
Speech and Language Processin
Intelligent Information Access to Linked Data - Weaving the Cultural Heritage Web
The subject of the dissertation is an information alignment experiment of two cultural heritage information systems (ALAP): The Perseus Digital Library and Arachne. In modern societies, information integration is gaining importance for many tasks such as business decision making or even catastrophe management. It is beyond doubt that the information available in digital form can offer users new ways of interaction. Also, in the humanities and cultural heritage communities, more and more information is being published online. But in many situations the way that information has been made publicly available is disruptive to the research process due to its heterogeneity and distribution. Therefore integrated information will be a key factor to pursue successful research, and the need for information alignment is widely recognized.
ALAP is an attempt to integrate information from Perseus and Arachne, not only on a schema level, but to also perform entity resolution. To that end, technical peculiarities and philosophical implications of the concepts of identity and co-reference are discussed. Multiple approaches to information integration and entity resolution are discussed and evaluated. The methodology that is used to implement ALAP is mainly rooted in the fields of information retrieval and knowledge discovery.
First, an exploratory analysis was performed on both information systems to get a first impression of the data. After that, (semi-)structured information from both systems was extracted and normalized. Then, a clustering algorithm was used to reduce the number of needed entity comparisons. Finally, a thorough matching was performed on the different clusters. ALAP helped with identifying challenges and highlighted the opportunities that arise during the attempt to align cultural heritage information systems
A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models
Word representation has always been an important research area in the history
of natural language processing (NLP). Understanding such complex text data is
imperative, given that it is rich in information and can be used widely across
various applications. In this survey, we explore different word representation
models and its power of expression, from the classical to modern-day
state-of-the-art word representation language models (LMS). We describe a
variety of text representation methods, and model designs have blossomed in the
context of NLP, including SOTA LMs. These models can transform large volumes of
text into effective vector representations capturing the same semantic
information. Further, such representations can be utilized by various machine
learning (ML) algorithms for a variety of NLP related tasks. In the end, this
survey briefly discusses the commonly used ML and DL based classifiers,
evaluation metrics and the applications of these word embeddings in different
NLP tasks
- …