55 research outputs found

    Temporal models for mining, ranking and recommendation in the Web

    Get PDF
    Due to their first-hand, diverse and evolution-aware reflection of nearly all areas of life, heterogeneous temporal datasets i.e., the Web, collaborative knowledge bases and social networks have been emerged as gold-mines for content analytics of many sorts. In those collections, time plays an essential role in many crucial information retrieval and data mining tasks, such as from user intent understanding, document ranking to advanced recommendations. There are two semantically closed and important constituents when modeling along the time dimension, i.e., entity and event. Time is crucially served as the context for changes driven by happenings and phenomena (events) that related to people, organizations or places (so-called entities) in our social lives. Thus, determining what users expect, or in other words, resolving the uncertainty confounded by temporal changes is a compelling task to support consistent user satisfaction. In this thesis, we address the aforementioned issues and propose temporal models that capture the temporal dynamics of such entities and events to serve for the end tasks. Specifically, we make the following contributions in this thesis: (1) Query recommendation and document ranking in the Web - we address the issues for suggesting entity-centric queries and ranking effectiveness surrounding the happening time period of an associated event. In particular, we propose a multi-criteria optimization framework that facilitates the combination of multiple temporal models to smooth out the abrupt changes when transitioning between event phases for the former and a probabilistic approach for search result diversification of temporally ambiguous queries for the latter. (2) Entity relatedness in Wikipedia - we study the long-term dynamics of Wikipedia as a global memory place for high-impact events, specifically the reviving memories of past events. Additionally, we propose a neural network-based approach to measure the temporal relatedness of entities and events. The model engages different latent representations of an entity (i.e., from time, link-based graph and content) and use the collective attention from user navigation as the supervision. (3) Graph-based ranking and temporal anchor-text mining inWeb Archives - we tackle the problem of discovering important documents along the time-span ofWeb Archives, leveraging the link graph. Specifically, we combine the problems of relevance, temporal authority, diversity and time in a unified framework. The model accounts for the incomplete link structure and natural time lagging in Web Archives in mining the temporal authority. (4) Methods for enhancing predictive models at early-stage in social media and clinical domain - we investigate several methods to control model instability and enrich contexts of predictive models at the “cold-start” period. We demonstrate their effectiveness for the rumor detection and blood glucose prediction cases respectively. Overall, the findings presented in this thesis demonstrate the importance of tracking these temporal dynamics surround salient events and entities for IR applications. We show that determining such changes in time-based patterns and trends in prevalent temporal collections can better satisfy user expectations, and boost ranking and recommendation effectiveness over time

    Data mining techniques for complex application domains

    Get PDF
    The emergence of advanced communication techniques has increased availability of large collection of data in electronic form in a number of application domains including healthcare, e- business, and e-learning. Everyday a large amount of records are stored electronically. However, finding useful information from such a large data collection is a challenging issue. Data mining technology aims automatically extracting hidden knowledge from large data repositories exploiting sophisticated algorithms. The hidden knowledge in the electronic data may be potentially utilized to facilitate the procedures, productivity, and reliability of several application domains. The PhD activity has been focused on novel and effective data mining approaches to tackle the complex data coming from two main application domains: Healthcare data analysis and Textual data analysis. The research activity, in the context of healthcare data, addressed the application of different data mining techniques to discover valuable knowledge from real exam-log data of patients. In particular, efforts have been devoted to the extraction of medical pathways, which can be exploited to analyze the actual treatments followed by patients. The derived knowledge not only provides useful information to deal with the treatment procedures but may also play an important role in future predictions of potential patient risks associated with medical treatments. The research effort in textual data analysis is twofold. On the one hand, a novel approach to discovery of succinct summaries of large document collections has been proposed. On the other hand, the suitability of an established descriptive data mining to support domain experts in making decisions has been investigated. Both research activities are focused on adopting widely exploratory data mining techniques to textual data analysis, which require overcoming intrinsic limitations for traditional algorithms for handling textual documents efficiently and effectively

    Identifying experts and authoritative documents in social bookmarking systems

    Get PDF
    Social bookmarking systems allow people to create pointers to Web resources in a shared, Web-based environment. These services allow users to add free-text labels, or “tags”, to their bookmarks as a way to organize resources for later recall. Ease-of-use, low cognitive barriers, and a lack of controlled vocabulary have allowed social bookmaking systems to grow exponentially over time. However, these same characteristics also raise concerns. Tags lack the formality of traditional classificatory metadata and suffer from the same vocabulary problems as full-text search engines. It is unclear how many valuable resources are untagged or tagged with noisy, irrelevant tags. With few restrictions to entry, annotation spamming adds noise to public social bookmarking systems. Furthermore, many algorithms for discovering semantic relations among tags do not scale to the Web. Recognizing these problems, we develop a novel graph-based Expert and Authoritative Resource Location (EARL) algorithm to find the most authoritative documents and expert users on a given topic in a social bookmarking system. In EARL’s first phase, we reduce noise in a Delicious dataset by isolating a smaller sub-network of “candidate experts”, users whose tagging behavior shows potential domain and classification expertise. In the second phase, a HITS-based graph analysis is performed on the candidate experts’ data to rank the top experts and authoritative documents by topic. To identify topics of interest in Delicious, we develop a distributed method to find subsets of frequently co-occurring tags shared by many candidate experts. We evaluated EARL’s ability to locate authoritative resources and domain experts in Delicious by conducting two independent experiments. The first experiment relies on human judges’ n-point scale ratings of resources suggested by three graph-based algorithms and Google. The second experiment evaluated the proposed approach’s ability to identify classification expertise through human judges’ n-point scale ratings of classification terms versus expert-generated data

    TV in the Age of the Internet: Information Quality of Science Fiction TV Fansites

    Get PDF
    Thesis (Ph.D.) - Indiana University, Information Science, 2011Communally created Web 2.0 content on the Internet has begun to compete with information provided by traditional gatekeeper institutions, such as academic journals, medical professionals, and large corporations. On the one hand, such gatekeepers need to understand the nature of this competition, as well as to try to ensure that the general public are not endangered by poor quality information. On the other hand, advocates of free and universal access to basic social services have argued that communal efforts can provide as good or better-quality versions of commonly needed resources. This dissertation arises from these needs to understand the nature and quality of information being produced on such websites. Website-oriented information quality (IQ) literature spans at least 15 different academic fields, a survey of which identified two types of IQ: perceptual and artifactual fitness-related, and representational accuracy and completeness-related. The current project studied websites in terms of all of these, except perceptual fitness. This study may be the only of its kind to have targeted fansites: websites made by fans of a mass media franchise. Despite the Internet's becoming a primary means by which millions of people consume and co-produce their entertainment, little academic attention has been paid to the IQ of sites about the mass media. For this study, the four central non-studio-affiliated sites about a highly popular and fan-engaging science fiction television franchise, Stargate, were chosen, and their IQ examined across sites having different sizes as well as editorial and business models. As exhaustive of samples as possible were collected from each site. Based on 21 relevant variables from the IQ literature, four qualitative and 17 exploratory statistical analyses were conducted. Key findings include: five possibly new IQ criteria; smaller sites concerned more with pleasing connoisseuring fans than the general public; larger sites being targeted towards older users; professional editors serving their own interests more than users'; wikis' greater user freedom attracting more invested and balanced writers; for-profit sites being more imposing upon, and less protecting of, users than non-profit sites; and the emergence of common writing styles, themes, data fields, advertisement types, linking strategies, and page types

    Graph neural networks for network analysis

    Get PDF
    With an increasing number of applications where data can be represented as graphs, graph neural networks (GNNs) are a useful tool to apply deep learning to graph data. Signed and directed networks are important forms of networks that are linked to many real-world problems, such as ranking from pairwise comparisons, and angular synchronization. In this report, we propose two spatial GNN methods for node clustering in signed and directed networks, a spectral GNN method for signed directed networks on both node clustering and link prediction, and two GNN methods for specific applications in ranking as well as angular synchronization. The methods are end-to-end in combining embedding generation and prediction without an intermediate step. Experimental results on various data sets, including several synthetic stochastic block models, random graph outlier models, and real-world data sets at different scales, demonstrate that our proposed methods can achieve satisfactory performance, for a wide range of noise and sparsity levels. The introduced models also complement existing methods through the possibility of including exogenous information, in the form of node-level features or labels. Their contribution not only aid the analysis of data which are represented by networks, but also form a body of work which presents novel architectures and task-driven loss functions for GNNs to be used in network analysis

    WiFi-Based Human Activity Recognition Using Attention-Based BiLSTM

    Get PDF
    Recently, significant efforts have been made to explore human activity recognition (HAR) techniques that use information gathered by existing indoor wireless infrastructures through WiFi signals without demanding the monitored subject to carry a dedicated device. The key intuition is that different activities introduce different multi-paths in WiFi signals and generate different patterns in the time series of channel state information (CSI). In this paper, we propose and evaluate a full pipeline for a CSI-based human activity recognition framework for 12 activities in three different spatial environments using two deep learning models: ABiLSTM and CNN-ABiLSTM. Evaluation experiments have demonstrated that the proposed models outperform state-of-the-art models. Also, the experiments show that the proposed models can be applied to other environments with different configurations, albeit with some caveats. The proposed ABiLSTM model achieves an overall accuracy of 94.03%, 91.96%, and 92.59% across the 3 target environments. While the proposed CNN-ABiLSTM model reaches an accuracy of 98.54%, 94.25% and 95.09% across those same environments
    corecore