11 research outputs found

    ANALYZING SOCIAL MEDIA CONTENTS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Principled Approaches to Automatic Text Summarization

    Get PDF
    Automatic text summarization is a particularly challenging Natural Language Processing (NLP) task involving natural language understanding, content selection and natural language generation. In this thesis, we concentrate on the content selection aspect, the inherent problem of summarization which is controlled by the notion of information Importance. We present a simple and intuitive formulation of the summarization task as two components: a summary scoring function θ measuring how good a text is as a summary of the given sources, and an optimization technique O extracting a summary with a high score according to θ. This perspective offers interesting insights over previous summarization efforts and allows us to pinpoint promising research directions. In particular, we realize that previous works heavily constrained the summary scoring function in order to solve convenient optimization problems (e.g., Integer Linear Programming). We question this assumption and demonstrate that General Purpose Optimization (GPO) techniques like genetic algorithms are practical. These GPOs do not require mathematical properties from the objective function and, thus, the summary scoring function can be relieved from its previously imposed constraints. Additionally, the summary scoring function can be evaluated on its own based on its ability to correlate with humans. This offers a principled way of examining the inner workings of summarization systems and complements the traditional evaluations of the extracted summaries. In fact, evaluation metrics are also summary scoring functions which should correlate well with humans. Thus, the two main challenges of summarization, the evaluation and the development of summarizers, are unified within the same setup: discovering strong summary scoring functions. Hence, we investigated ways of uncovering such functions. First, we conducted an empirical study of learning the summary scoring function from data. The results show that an unconstrained summary scoring function is better able to correlate with humans. Furthermore, an unconstrained summary scoring function optimized approximately with GPO extracts better summaries than a constrained summary scoring function optimized exactly with, e.g., ILP. Along the way, we proposed techniques to leverage the small and biased human judgment datasets. Additionally, we released a new evaluation metric explicitly trained to maximize its correlation with humans. Second, we developed a theoretical formulation of the notion of Importance. In a framework rooted in information theory, we defined the quantities: Redundancy, Relevance and Informativeness. Importance arises as the notion unifying these concepts. More generally, Importance is the measure that guides which choices to make when information must be discarded. Finally, evaluation remains an open-problem with a massive impact on summarization progress. Thus, we conducted experiments on available human judgment datasets commonly used to compare evaluation metrics. We discovered that these datasets do not cover the high-quality range in which summarization systems and evaluation metrics operate. This motivates efforts to collect human judgments for high-scoring summaries as this would be necessary to settle the debate over which metric to use. This would also be greatly beneficial for improving summarization systems and metrics alike

    Extracting Causal Relations between News Topics from Distributed Sources

    Get PDF
    The overwhelming amount of online news presents a challenge called news information overload. To mitigate this challenge we propose a system to generate a causal network of news topics. To extract this information from distributed news sources, a system called Forest was developed. Forest retrieves documents that potentially contain causal information regarding a news topic. The documents are processed at a sentence level to extract causal relations and news topic references, these are the phases used to refer to a news topic. Forest uses a machine learning approach to classify causal sentences, and then renders the potential cause and effect of the sentences. The potential cause and effect are then classified as news topic references, these are the phrases used to refer to a news topics, such as “The World Cup” or “The Financial Meltdown”. Both classifiers use an algorithm developed within our working group, the algorithm performs better than several well known classification algorithms for the aforementioned tasks. In our evaluations we found that participants consider causal information useful to understand the news, and that while we can not extract causal information for all news topics, it is highly likely that we can extract causal relation for the most popular news topics. To evaluate the accuracy of the extractions made by Forest, we completed a user survey. We found that by providing the top ranked results, we obtained a high accuracy in extracting causal relations between news topics

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Learning facet-specific entity embeddings

    Get PDF
    An entity embedding is a vector space representation of entities in which similar entities have similar representations. However, similarity is a multi-faceted notion; for example, a person may be similar to one group of people because they graduated from the same university and similar to another group through having the same nationality or playing the same sport. Our hypothesis in this thesis is that learning a single entity embedding is a sub-optimal way to faithfully capture these different facets of similarity. Therefore, this thesis aims to learn facet-specific entity embeddings that capture different facets of similarity, taking inspiration from a framework widely known in cognitive science called conceptual spaces framework. Conceptual spaces [48] are vector space models designed to represent entities of a given kind (e.g. movies), together with their associated properties (e.g. scary), and concepts (e.g. thrillers). As such, they are similar in spirit to the vector space models that have been proposed in natural language processing, but there are also notable differences. First, the dimensions of conceptual spaces, referred to as quality dimensions, are interpretable, as they correspond to semantically meaningful features. Second, conceptual spaces are organized into sets of semantic domains or facets (e.g. genre, language), which are formed by grouping the quality dimensions. Each facet is associated with its own low-dimensional vector space, which intuitively captures similarity with respect to the corresponding facet. For instance, the vector space for the budget facet would only capture whether two movies had similar budgets. From an application point of view, the fact that conceptual spaces are structured into facets is appealing because this allows us to model the different facets of similarity in a more flexible and cognitively more plausible way. Based on this, we hypothesize that learning facet-specific entity embeddings that are similar in spirit to conceptual spaces will allow us to predict the properties and categories of entities more reliably than from standard single space representations. Learning data-driven conceptual spaces, especially in an unsupervised way, has received very limited attention to date. Therefore, in this thesis, we will learn facet-specific entity embeddings that is similar in spirit to conceptual spaces. This includes learning quality dimensions and then grouping them into facets. In particular, in this thesis, we propose three unsupervised models to learn this type of vector space representations for a set of entities using their textual descriptions. In two of these models, we convert traditional vector space embeddings into facet-specific entity embeddings, using quality dimensions-like features. In these cases, we rely on an existing method to learn these features. In our first proposed model, we structured the vector space representations implicitly into meaningful facets by identifying the quality dimensions in a two-level hierarchy: The first level corresponds to the facets, and the second level corresponds to the facet-specific features. In our second developed model, using the quality dimensions and pre-trained word embeddings, we decompose the vector space representations into low-dimensional facets in an incremental way. In both of these models, we depend on clustering algorithms to find facet-specific features. In contrast, our third proposed model uses a mixture-of experts formulation to find the features that describe each facet and it simultaneously learns the facet-specific embeddings directly from the bag-of-words. We evaluate our models on several datasets, each of which contains a set of entities with their textual descriptions and a number of classification tasks, using a range of different classifiers. The experimental results support our hypothesis that, by capturing different facets of similarity, facet-specific vector space representations improve a model’s ability to predict the categories and properties of entities

    Identifying reusable knowledge in developer instant messaging communication.

    Get PDF
    Context and background: Software engineering is a complex and knowledge-intensive activity. Required knowledge (e.g., about technologies, frameworks, and design decisions) changes fast and the knowledge needs of those who design, code, test and maintain software constantly evolve. On the other hand, software developers use a wide range of processes, practices and tools where developers explicitly and implicitly “produce” and capture different types of knowledge. Problem: Software developers use instant messaging tools (e.g., Slack, Microsoft Teams and Gitter) to discuss development-related problems, share experiences and to collaborate in projects. This communication takes place in chat rooms that accumulate potentially relevant knowledge to be reused by other developers. Therefore, in this research we analyze whether there is reusable knowledge in developer instant messaging communication by exploring (a) which instant messaging platforms can be a source of reusable knowledge, and (b) software engineering themes that represent the main discussions of developers in instant messaging communication. We also analyze how this reusable knowledge can be identified with the use of topic modeling (a natural language processing technique to discover abstract topics in text) by (c) surveying the literature on how topic modeling has been applied in software engineering research, and (d) evaluating how topic models perform with developer instant messages. Method: First, we conducted a Field Study through an exploratory case study and a reflexive thematic analysis to check whether there is reusable knowledge in developer instant messaging communication, and if so, what this knowledge (main themes discussed) is. Then, we conducted a Sample Study to explore how reusable knowledge in developer instant messaging communication can we identified. In this study, we applied a literature survey and software repository mining (i.e. short text topic modeling). Findings and contributions: We (a) developed a comparison framework for instant messaging tools, (b) identified a map of the main themes discussed in chat rooms of an instant messaging tool (Gitter, a platform used by software developers), (c) provided a comprehensive literature review that offers insights and references on the use of topic modeling in software engineering, and (d) provided an evaluation of the performance of topic models applied to developer instant messages based on topic coherence metrics and human judgment for topic quality

    Topical relevance models

    Get PDF
    An inherent characteristic of information retrieval (IR) is that the query expressing a user's information need is often multi-faceted, that is, it encapsulates more than one specific potential sub-information need. This multifacetedness of queries manifests itself as a topic distribution in the retrieved set of documents, where each document can be considered as a mixture of topics, one or more of which may correspond to the sub-information needs expressed in the query. In some specific domains of IR, such as patent prior art search, where the queries are full patent articles and the objective is to (in)validate the claims contained therein, the queries themselves are multi-topical in addition to the retrieved set of documents. The overall objective of the research described in this thesis involves investigating techniques to recognize and exploit these multi-topical characteristics of the retrieved documents and the queries in IR and relevance feedback in IR. First, we hypothesize that segments of documents in close proximity to the query terms are indicative of these segments being topically related to the query terms. An intuitive choice for the unit of such segments, in close proximity to query terms within documents, is the sentences, which characteristically represent a collection of semantically related terms. This way of utilizing term proximity through the use of sentences is empirically shown to select potentially relevant topics from among those present in a retrieved document set and thus improve relevance feedback in IR. Secondly, to handle the very long queries of patent prior art search which are essentially multi-topical in nature, we hypothesize that segmenting these queries into topically focused segments and then using these topically focused segments as separate queries for retrieval can retrieve potentially relevant documents for each of these segments. The results for each of these segments then need to be merged to obtain a final retrieval result set for the whole query. These two conceptual approaches for utilizing the topical relatedness of terms in both the retrieved documents and the queries are then integrated more formally within a single statistical generative model, called the topical relevance model (TRLM). This model utilizes the underlying multi-topical nature of both retrieved documents and the query. Moreover, the model is used as the basis for construction of a novel search interface, called TopicVis, which lets the user visualize the topic distributions in the retrieved set of documents and the query. This visualization of the topics is beneficial to the user in the following ways. Firstly, through visualization of the ranked retrieval list, TopicVis facilitates the user to choose one or more facets of interest from the query in a feedback step, after which it retrieves documents primarily composed of the selected facets at top ranks. Secondly, the system provides an access link to the first segment within a document focusing on the selected topic and also supports navigation links to subsequent segments on the same topic in other documents. The methods proposed in this thesis are evaluated on datasets from the TREC IR benchmarking workshop series, and the CLEF-IP 2010 data, a patent prior art search data set. Experimental results show that relevance feedback using sentences and segmented retrieval for patent prior art search queries significantly improve IR effectiveness for the standard ad-hoc IR and patent prior art search tasks. Moreover, the topical relevance model (TRLM), designed to encapsulate these two complementary approaches within a single framework, significantly improves IR effectiveness for both standard ad-hoc IR and patent prior art search. Furthermore, a task based user study experiment shows that novel features of topic visualization, topic-based feedback and topic-based navigation, implemented in the TopicVis interface, lead to effective and efficient task completion achieving good user satisfaction
    corecore