46 research outputs found

    Collaborative Summarization: When Collaborative Filtering Meets Document Summarization

    Get PDF
    PACLIC 23 / City University of Hong Kong / 3-5 December 200

    A survey on opinion summarization technique s for social media

    Get PDF
    The volume of data on the social media is huge and even keeps increasing. The need for efficient processing of this extensive information resulted in increasing research interest in knowledge engineering tasks such as Opinion Summarization. This survey shows the current opinion summarization challenges for social media, then the necessary pre-summarization steps like preprocessing, features extraction, noise elimination, and handling of synonym features. Next, it covers the various approaches used in opinion summarization like Visualization, Abstractive, Aspect based, Query-focused, Real Time, Update Summarization, and highlight other Opinion Summarization approaches such as Contrastive, Concept-based, Community Detection, Domain Specific, Bilingual, Social Bookmarking, and Social Media Sampling. It covers the different datasets used in opinion summarization and future work suggested in each technique. Finally, it provides different ways for evaluating opinion summarization

    A Review On Automatic Text Summarization Approaches

    Get PDF
    It has been more than 50 years since the initial investigation on automatic text summarization was started.Various techniques have been successfully used to extract the important contents from text document to represent document summary.In this study,we review some of the studies that have been conducted in this still-developing research area.It covers the basics of text summarization,the types of summarization,the methods that have been used and some areas in which text summarization has been applied.Furthermore,this paper also reviews the significant efforts which have been put in studies concerning sentence extraction,domain specific summarization and multi document summarization and provides the theoretical explanation and the fundamental concepts related to it.In addition,the advantages and limitations concerning the approaches commonly used for text summarization are also highlighted in this study

    Towards Personalized and Human-in-the-Loop Document Summarization

    Full text link
    The ubiquitous availability of computing devices and the widespread use of the internet have generated a large amount of data continuously. Therefore, the amount of available information on any given topic is far beyond humans' processing capacity to properly process, causing what is known as information overload. To efficiently cope with large amounts of information and generate content with significant value to users, we require identifying, merging and summarising information. Data summaries can help gather related information and collect it into a shorter format that enables answering complicated questions, gaining new insight and discovering conceptual boundaries. This thesis focuses on three main challenges to alleviate information overload using novel summarisation techniques. It further intends to facilitate the analysis of documents to support personalised information extraction. This thesis separates the research issues into four areas, covering (i) feature engineering in document summarisation, (ii) traditional static and inflexible summaries, (iii) traditional generic summarisation approaches, and (iv) the need for reference summaries. We propose novel approaches to tackle these challenges, by: i)enabling automatic intelligent feature engineering, ii) enabling flexible and interactive summarisation, iii) utilising intelligent and personalised summarisation approaches. The experimental results prove the efficiency of the proposed approaches compared to other state-of-the-art models. We further propose solutions to the information overload problem in different domains through summarisation, covering network traffic data, health data and business process data.Comment: PhD thesi

    Temporal models for mining, ranking and recommendation in the Web

    Get PDF
    Due to their first-hand, diverse and evolution-aware reflection of nearly all areas of life, heterogeneous temporal datasets i.e., the Web, collaborative knowledge bases and social networks have been emerged as gold-mines for content analytics of many sorts. In those collections, time plays an essential role in many crucial information retrieval and data mining tasks, such as from user intent understanding, document ranking to advanced recommendations. There are two semantically closed and important constituents when modeling along the time dimension, i.e., entity and event. Time is crucially served as the context for changes driven by happenings and phenomena (events) that related to people, organizations or places (so-called entities) in our social lives. Thus, determining what users expect, or in other words, resolving the uncertainty confounded by temporal changes is a compelling task to support consistent user satisfaction. In this thesis, we address the aforementioned issues and propose temporal models that capture the temporal dynamics of such entities and events to serve for the end tasks. Specifically, we make the following contributions in this thesis: (1) Query recommendation and document ranking in the Web - we address the issues for suggesting entity-centric queries and ranking effectiveness surrounding the happening time period of an associated event. In particular, we propose a multi-criteria optimization framework that facilitates the combination of multiple temporal models to smooth out the abrupt changes when transitioning between event phases for the former and a probabilistic approach for search result diversification of temporally ambiguous queries for the latter. (2) Entity relatedness in Wikipedia - we study the long-term dynamics of Wikipedia as a global memory place for high-impact events, specifically the reviving memories of past events. Additionally, we propose a neural network-based approach to measure the temporal relatedness of entities and events. The model engages different latent representations of an entity (i.e., from time, link-based graph and content) and use the collective attention from user navigation as the supervision. (3) Graph-based ranking and temporal anchor-text mining inWeb Archives - we tackle the problem of discovering important documents along the time-span ofWeb Archives, leveraging the link graph. Specifically, we combine the problems of relevance, temporal authority, diversity and time in a unified framework. The model accounts for the incomplete link structure and natural time lagging in Web Archives in mining the temporal authority. (4) Methods for enhancing predictive models at early-stage in social media and clinical domain - we investigate several methods to control model instability and enrich contexts of predictive models at the “cold-start” period. We demonstrate their effectiveness for the rumor detection and blood glucose prediction cases respectively. Overall, the findings presented in this thesis demonstrate the importance of tracking these temporal dynamics surround salient events and entities for IR applications. We show that determining such changes in time-based patterns and trends in prevalent temporal collections can better satisfy user expectations, and boost ranking and recommendation effectiveness over time

    Computing the Importance of Schema Elements Taking Into Account the Whole SCHEMA

    Get PDF
    Conceptual Schemas are one of the most important artifacts in the development cycle of information systems. To understand the conceptual schema is essential to get involved in the information system that is described within it. As the information system increases its size and complexity, the relative conceptual schema will grow in the same proportion making di cult to understand the main concepts of that schema/information system. The thesis comprises the investigation of the in uence of the whole schema in computing the relevance of schema elements. It will include research and implementation of algorithms for scoring elements in the literature, an study of the di erent results obtained once applied to a few example conceptual schemas, an extension of those algorithms including new components in the computation process like derivation rules, constraints and the behavioural subschema speci cation, and an in-depth comparison among the initial algorithms and the extended ones studying the results in order to choose those algorithms that give the most valuable output

    VIRAL TOPIC PREDICTION AND DESCRIPTION IN MICROBLOG SOCIAL NETWORKS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Detecting Dissimilar Classes of Source Code Defects

    Get PDF
    Software maintenance accounts for the most part of the software development cost and efforts, with its major activities focused on the detection, location, analysis and removal of defects present in the software. Although software defects can be originated, and be present, at any phase of the software development life-cycle, implementation (i.e., source code) contains more than three-fourths of the total defects. Due to the diverse nature of the defects, their detection and analysis activities have to be carried out by equally diverse tools, often necessitating the application of multiple tools for reasonable defect coverage that directly increases maintenance overhead. Unified detection tools are known to combine different specialized techniques into a single and massive core, resulting in operational difficulty and maintenance cost increment. The objective of this research was to search for a technique that can detect dissimilar defects using a simplified model and a single methodology, both of which should contribute in creating an easy-to-acquire solution. Following this goal, a ‘Supervised Automation Framework’ named FlexTax was developed for semi-automatic defect mapping and taxonomy generation, which was then applied on a large-scale real-world defect dataset to generate a comprehensive Defect Taxonomy that was verified using machine learning classifiers and manual verification. This Taxonomy, along with an extensive literature survey, was used for comprehension of the properties of different classes of defects, and for developing Defect Similarity Metrics. The Taxonomy, and the Similarity Metrics were then used to develop a defect detection model and associated techniques, collectively named Symbolic Range Tuple Analysis, or SRTA. SRTA relies on Symbolic Analysis, Path Summarization and Range Propagation to detect dissimilar classes of defects using a simplified set of operations. To verify the effectiveness of the technique, SRTA was evaluated by processing multiple real-world open-source systems, by direct comparison with three state-of-the-art tools, by a controlled experiment, by using an established Benchmark, by comparison with other tools through secondary data, and by a large-scale fault-injection experiment conducted using a Mutation-Injection Framework, which relied on the taxonomy developed earlier for the definition of mutation rules. Experimental results confirmed SRTA’s practicality, generality, scalability and accuracy, and proved SRTA’s applicability as a new Defect Detection Technique

    Search beyond traditional probabilistic information retrieval

    Get PDF
    "This thesis focuses on search beyond probabilistic information retrieval. Three ap- proached are proposed beyond the traditional probabilistic modelling. First, term associ- ation is deeply examined. Term association considers the term dependency using a factor analysis based model, instead of treating each term independently. Latent factors, con- sidered the same as the hidden variables of ""eliteness"" introduced by Robertson et al. to gain understanding of the relation among term occurrences and relevance, are measured by the dependencies and occurrences of term sequences and subsequences. Second, an entity-based ranking approach is proposed in an entity system named ""EntityCube"" which has been released by Microsoft for public use. A summarization page is given to summarize the entity information over multiple documents such that the truly relevant entities can be highly possibly searched from multiple documents through integrating the local relevance contributed by proximity and the global enhancer by topic model. Third, multi-source fusion sets up a meta-search engine to combine the ""knowledge"" from different sources. Meta-features, distilled as high-level categories, are deployed to diversify the baselines. Three modified fusion methods are employed, which are re- ciprocal, CombMNZ and CombSUM with three expanded versions. Through extensive experiments on the standard large-scale TREC Genomics data sets, the TREC HARD data sets and the Microsoft EntityCube Web collections, the proposed extended models beyond probabilistic information retrieval show their effectiveness and superiority.
    corecore