57,647 research outputs found

    Advances in Processing, Mining, and Learning Complex Data: From Foundations to Real-World Applications

    Get PDF
    Processing, mining, and learning complex data refer to an advanced study area of data mining and knowledge discovery concerning the development and analysis of approaches for discovering patterns and learning models from data with a complex structure (e.g., multirelational data, XML data, text data, image data, time series, sequences, graphs, streaming data, and trees) [1–5]. These kinds of data are commonly encountered in many social, economic, scientific, and engineering applications. Complex data pose new challenges for current research in data mining and knowledge discovery as they require new methods for processing, mining, and learning them. Traditional data analysis methods often require the data to be represented as vectors [6]. However, many data objects in real-world applications, such as chemical compounds in biopharmacy, brain regions in brain health data, users in business networks, and time-series information in medical data, contain rich structure information (e.g., relationships between data and temporal structures). Such a simple feature-vector representation inherently loses the structure information of the objects. In reality, objects may have complicated characteristics, depending on how the objects are assessed and characterized. Meanwhile, the data may come from heterogeneous domains [7], such as traditional tabular-based data, sequential patterns, graphs, time-series information, and semistructured data. Novel data analytics methods are desired to discover meaningful knowledge in advanced applications from data objects with complex characteristics. This special issue contributes to the fundamental research in processing, mining, and learning complex data, focusing on the analysis of complex data sources

    Advances in Processing, Mining, and Learning Complex Data: From Foundations to Real-World Applications

    Get PDF
    Processing, mining, and learning complex data refer to an advanced study area of data mining and knowledge discovery concerning the development and analysis of approaches for discovering patterns and learning models from data with a complex structure (e.g., multirelational data, XML data, text data, image data, time series, sequences, graphs, streaming data, and trees) [1–5]. These kinds of data are commonly encountered in many social, economic, scientific, and engineering applications. Complex data pose new challenges for current research in data mining and knowledge discovery as they require new methods for processing, mining, and learning them. Traditional data analysis methods often require the data to be represented as vectors [6]. However, many data objects in real-world applications, such as chemical compounds in biopharmacy, brain regions in brain health data, users in business networks, and time-series information in medical data, contain rich structure information (e.g., relationships between data and temporal structures). Such a simple feature-vector representation inherently loses the structure information of the objects. In reality, objects may have complicated characteristics, depending on how the objects are assessed and characterized. Meanwhile, the data may come from heterogeneous domains [7], such as traditional tabular-based data, sequential patterns, graphs, time-series information, and semistructured data. Novel data analytics methods are desired to discover meaningful knowledge in advanced applications from data objects with complex characteristics. This special issue contributes to the fundamental research in processing, mining, and learning complex data, focusing on the analysis of complex data sources

    Employing Topological Data Analysis On Social Networks Data To Improve Information Diffusion

    Get PDF
    For the past decade, the number of users on social networks has grown tremendously from thousands in 2004 to billions by the end of 2015. On social networks, users create and propagate billions of pieces of information every day. The data can be in many forms (such as text, images, or videos). Due to the massive usage of social networks and availability of data, the field of social network analysis and mining has attracted many researchers from academia and industry to analyze social network data and explore various research opportunities (including information diffusion and influence measurement). Information diffusion is defined as the way that information is spread on social networks; this can occur due to social influence. Influence is the ability affect others without direct commands. Influence on social networks can be observed through social interactions between users (such as retweet on Twitter, like on Instagram, or favorite on Flickr). In order to improve information diffusion, we measure the influence of users on social networks to predict influential users. The ability to predict the popularity of posts can improve information diffusion as well; posts become popular when they diffuse on social networks. However, measuring influence and predicting posts popularity can be challenging due to unstructured, big, noisy data. Therefore, social network mining and analysis techniques are essential for extracting meaningful information about influential users and popular posts. For measuring the influence of users, we proposed a novel influence measurement that integrates both users’ structural locations and characteristics on social networks, which then can be used to predict influential users on social networks. centrality analysis techniques are adapted to identify the users’ structural locations. Centrality is used to identify the most important nodes within a graph; social networks can be represented as graphs (where nodes represent users and edges represent interactions between users), and centrality analysis can be adopted. The second part of the work focuses on predicting the popularity of images on social networks over time. The effect of social context, image content and early popularity on image popularity using machine learning algorithms are analyzed. A new approach for image content is developed to represent the semantics of an image using its captions, called keyword vector. This approach is based on Word2vec (an unsupervised two-layer neural network that generates distributed numerical vectors to represent words in the vector space to detect similarity) and k-means (a popular clustering algorithm). However, machine learning algorithms do not address issues arising from the nature of social network data, noise and high dimensionality in data. Therefore, topological data analysis is adopted. It is a noble approach to extract meaningful information from high-dimensional data and is robust to noise. It is based on topology, which aims to study the geometric shape of data. In this thesis, we explore the feasibility of topological data analysis for mining social network data by addressing the problem of image popularity. The proposed techniques are employed to datasets crawled from real-world social networks to examine the performance of each approach. The results for predicting the influential users outperforms existing measurements in terms of correlation. As for predicting the popularity of images on social networks, the results indicate that the proposed features provides a promising opportunity and exceeds the related work in terms of accuracy. Further exploration of these research topics can be used for a variety of real-world applications (including improving viral marketing, public awareness, political standings and charity work)

    Complex graph stream mining

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Recent years have witnessed a dramatic increase of information due to the ever development of modern technologies. The large scale of information makes data analysis, particularly data mining and knowledge discovery tasks, unprecedentedly challenging. First, data is becoming more and more interconnected. In a variety of domains such as social networks, chemical compounds, and XML documents, data is no longer represented by a flat table with instance-feature format, but exhibits complex structures indicating dependency relationships. Second, data is evolving more and more dynamically. Emerging applications such as social networks continuously generate information over time. Third, the learning tasks in many real-life applications become more and more complicated in that there are various constraints on the number of labelled data, class distributions, misclassification costs, or the number of learning tasks etc. Considering the above challenges, this research aims to investigate theoretical foundations, study new algorithm designs and system frameworks to enable the mining of complex graph streams from three aspects, including (1) Correlated Graph Stream Mining, (2) Graph Stream Classifications, and (3) Complex Task Graph Classification. In particular, correlated graph stream mining intends to carry out structured pattern search and support the query of similar graphs from a graph stream. Due to the dynamic changing nature of the streaming data and the inherent complexity of the graph query process, treating graph streams as static datasets is computationally infeasible or ineffective. Therefore, we proposed a novel algorithm, CGStream, to identify correlated graphs from a data stream, by using a sliding window, which covers a number of consecutive batches of stream data records. Experimental results demonstrate that the proposed algorithm is several times, or even an order of magnitude, more efficient than the straightforward algorithms. Graph stream classification aims to build effective and efficient classification models for graph streams with continuous growing volumes and dynamic changes. We proposed two methods for complex graph stream classification. Due to the inherent complexity of graph structure, labelling graph data is very expensive. To solve this problem, we proposed a gLSU algorithm, which aims to select discriminative subgraph features with minimum redundancy by using both labelled and unlabelled graphs for graph streams. The second approach handles graph streams with imbalanced class distributions and noise. Both frameworks use an instance weighting scheme to capture the underlying concept drifts of graph streams and achieve significant performance gain on benchmark graph streams. Complex task graph classification aims to address the graph classification problems with complex constraints. We studied two complex task graph classification problems, cost-sensitive graph classification of large-scale graphs and multi-task graph classification. As in medical diagnosis the misclassification cost/risk for different classes is inherently different and large scale graph classification is highly demanded in real-life applications, we proposed a CogBoost algorithm for cost-sensitive classification of large scale graphs. To overcome the limitation of insufficient labelled graphs for a specific learning task, we further proposed effective algorithms to leverage multiple graph learning tasks to select subgraph features and regularize multiple tasks to achieve better generalization performance for all learning tasks

    Extraction and Analysis of Facebook Friendship Relations

    Get PDF
    Online Social Networks (OSNs) are a unique Web and social phenomenon, affecting tastes and behaviors of their users and helping them to maintain/create friendships. It is interesting to analyze the growth and evolution of Online Social Networks both from the point of view of marketing and other of new services and from a scientific viewpoint, since their structure and evolution may share similarities with real-life social networks. In social sciences, several techniques for analyzing (online) social networks have been developed, to evaluate quantitative properties (e.g., defining metrics and measures of structural characteristics of the networks) or qualitative aspects (e.g., studying the attachment model for the network evolution, the binary trust relationships, and the link prediction problem).\ud However, OSN analysis poses novel challenges both to Computer and Social scientists. We present our long-term research effort in analyzing Facebook, the largest and arguably most successful OSN today: it gathers more than 500 million users. Access to data about Facebook users and their friendship relations, is restricted; thus, we acquired the necessary information directly from the front-end of the Web site, in order to reconstruct a sub-graph representing anonymous interconnections among a significant subset of users. We describe our ad-hoc, privacy-compliant crawler for Facebook data extraction. To minimize bias, we adopt two different graph mining techniques: breadth-first search (BFS) and rejection sampling. To analyze the structural properties of samples consisting of millions of nodes, we developed a specific tool for analyzing quantitative and qualitative properties of social networks, adopting and improving existing Social Network Analysis (SNA) techniques and algorithms

    From Relational Data to Graphs: Inferring Significant Links using Generalized Hypergeometric Ensembles

    Full text link
    The inference of network topologies from relational data is an important problem in data analysis. Exemplary applications include the reconstruction of social ties from data on human interactions, the inference of gene co-expression networks from DNA microarray data, or the learning of semantic relationships based on co-occurrences of words in documents. Solving these problems requires techniques to infer significant links in noisy relational data. In this short paper, we propose a new statistical modeling framework to address this challenge. It builds on generalized hypergeometric ensembles, a class of generative stochastic models that give rise to analytically tractable probability spaces of directed, multi-edge graphs. We show how this framework can be used to assess the significance of links in noisy relational data. We illustrate our method in two data sets capturing spatio-temporal proximity relations between actors in a social system. The results show that our analytical framework provides a new approach to infer significant links from relational data, with interesting perspectives for the mining of data on social systems.Comment: 10 pages, 8 figures, accepted at SocInfo201

    Crawling Facebook for Social Network Analysis Purposes

    Get PDF
    We describe our work in the collection and analysis of massive data describing the connections between participants to online social networks. Alternative approaches to social network data collection are defined and evaluated in practice, against the popular Facebook Web site. Thanks to our ad-hoc, privacy-compliant crawlers, two large samples, comprising millions of connections, have been collected; the data is anonymous and organized as an undirected graph. We describe a set of tools that we developed to analyze specific properties of such social-network graphs, i.e., among others, degree distribution, centrality measures, scaling laws and distribution of friendship.\u
    • …