    arules - A Computational Environment for Mining Association Rules and Frequent Item Sets

    Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.

    Questionnaire integration system based on question classification and short text semantic textual similarity, A

    2018 Fall.Includes bibliographical references.Semantic integration from heterogeneous sources involves a series of NLP tasks. Existing re- search has focused mainly on measuring two paired sentences. However, to find possible identical texts between two datasets, the sentences are not paired. To avoid pair-wise comparison, this thesis proposed a semantic similarity measuring system equipped with a precategorization module. It applies a hybrid question classification module, which subdivides all texts to coarse categories. The sentences are then paired from these subcategories. The core task is to detect identical texts between two sentences, which relates to the semantic textual similarity task in the NLP field. We built a short text semantic textual similarity measuring module. It combined conventional NLP techniques, including both semantic and syntactic features, with a Recurrent Convolutional Neural Network to accomplish an ensemble model. We also conducted a set of empirical evaluations. The results show that our system possesses a degree of generalization ability, and it performs well on heterogeneous sources

    Identification of Online Users' Social Status via Mining User-Generated Data

    With the burst of available online user-generated data, identifying online users’ social status via mining user-generated data can play a significant role in many commercial applications, research and policy-making in many domains. Social status refers to the position of a person in relation to others within a society, which is an abstract concept. The actual definition of social status is specific in terms of specific measure indicator. For example, opinion leadership measures individual social status in terms of influence and expertise in an online society, while socioeconomic status characterizes personal real-life social status based on social and economic factors. Compared with traditional survey method which is time-consuming, expensive and sometimes difficult, some efforts have been made to identify specific social status of users based on specific user-generated data using classic machine learning methods. However, in fact, regarding specific social status identification based on specific user-generated data, the specific case has several specific challenges. However, classic machine learning methods in existing works fail to address these challenges, which lead to low identification accuracy. Given the importance of improving identification accuracy, this thesis studies three specific cases on identification of online and offline social status. For each work, this thesis proposes novel effective identification method to address the specific challenges for improving accuracy. The first work aims at identifying users’ online social status in terms of topic-sensitive influence and knowledge authority in social community question answering sites, namely identifying topical opinion leaders who are both influential and expert. Social community question answering (SCQA) site, an innovative community question answering platform, not only offers traditional question answering (QA) services but also integrates an online social network where users can follow each other. Identifying topical opinion leaders in SCQA has become an important research area due to the significant role of topical opinion leaders. However, most previous related work either focus on using knowledge expertise to find experts for improving the quality of answers, or aim at measuring user influence to identify influential ones. In order to identify the true topical opinion leaders, we propose a topical opinion leader identification framework called QALeaderRank which takes account of both topic-sensitive influence and topical knowledge expertise. In the proposed framework, to measure the topic-sensitive influence of each user, we design a novel influence measure algorithm that exploits both the social and QA features of SCQA, taking into account social network structure, topical similarity and knowledge authority. In addition, we propose three topic-relevant metrics to infer the topical expertise of each user. The extensive experiments along with an online user study show that the proposed QALeaderRank achieves significant improvement compared with the state-of-the-art methods. Furthermore, we analyze the topic interest change behaviors of users over time and examine the predictability of user topic interest through experiments. The second work focuses on predicting individual socioeconomic status from mobile phone data. Socioeconomic Status (SES) is an important social and economic aspect widely concerned. Assessing individual SES can assist related organizations in making a variety of policy decisions. Traditional approach suffers from the extremely high cost in collecting large-scale SES-related survey data. With the ubiquity of smart phones, mobile phone data has become a novel data source for predicting individual SES with low cost. However, the task of predicting individual SES on mobile phone data also proposes some new challenges, including sparse individual records, scarce explicit relationships and limited labeled samples, unconcerned in prior work restricted to regional or household-oriented SES prediction. To address these issues, we propose a semi-supervised Hypergraph based Factor Graph Model (HyperFGM) for individual SES prediction. HyperFGM is able to efficiently capture the associations between SES and individual mobile phone records to handle the individual record sparsity. For the scarce explicit relationships, HyperFGM models implicit high-order relationships among users on the hypergraph structure. Besides, HyperFGM explores the limited labeled data and unlabeled data in a semi-supervised way. Experimental results show that HyperFGM greatly outperforms the baseline methods on individual SES prediction with using a set of anonymized real mobile phone data. The third work is to predict social media users’ socioeconomic status based on their social media content, which is useful for related organizations and companies in a range of applications, such as economic and social policy-making. Previous work leverage manually defined textual features and platform-based user level attributes from social media content and feed them into a machine learning based classifier for SES prediction. However, they ignore some important information of social media content, containing the order and the hierarchical structure of social media text as well as the relationships among user level attributes. To this end, we propose a novel coupled social media content representation model for individual SES prediction, which not only utilizes a hierarchical neural network to incorporate the order and the hierarchical structure of social media text but also employs a coupled attribute representation method to take into account intra-coupled and inter-coupled interaction relationships among user level attributes. The experimental results show that the proposed model significantly outperforms other stat-of-the-art models on a real dataset, which validate the efficiency and robustness of the proposed model

    Similarity processing in multi-observation data

    Many real-world application domains such as sensor-monitoring systems for environmental research or medical diagnostic systems are dealing with data that is represented by multiple observations. In contrast to single-observation data, where each object is assigned to exactly one occurrence, multi-observation data is based on several occurrences that are subject to two key properties: temporal variability and uncertainty. When defining similarity between data objects, these properties play a significant role. In general, methods designed for single-observation data hardly apply for multi-observation data, as they are either not supported by the data models or do not provide sufficiently efficient or effective solutions. Prominent directions incorporating the key properties are the fields of time series, where data is created by temporally successive observations, and uncertain data, where observations are mutually exclusive. This thesis provides research contributions for similarity processing - similarity search and data mining - on time series and uncertain data. The first part of this thesis focuses on similarity processing in time series databases. A variety of similarity measures have recently been proposed that support similarity processing w.r.t. various aspects. In particular, this part deals with time series that consist of periodic occurrences of patterns. Examining an application scenario from the medical domain, a solution for activity recognition is presented. Finally, the extraction of feature vectors allows the application of spatial index structures, which support the acceleration of search and mining tasks resulting in a significant efficiency gain. As feature vectors are potentially of high dimensionality, this part introduces indexing approaches for the high-dimensional space for the full-dimensional case as well as for arbitrary subspaces. The second part of this thesis focuses on similarity processing in probabilistic databases. The presence of uncertainty is inherent in many applications dealing with data collected by sensing devices. Often, the collected information is noisy or incomplete due to measurement or transmission errors. Furthermore, data may be rendered uncertain due to privacy-preserving issues with the presence of confidential information. This creates a number of challenges in terms of effectively and efficiently querying and mining uncertain data. Existing work in this field either neglects the presence of dependencies or provides only approximate results while applying methods designed for certain data. Other approaches dealing with uncertain data are not able to provide efficient solutions. This part presents query processing approaches that outperform existing solutions of probabilistic similarity ranking. This part finally leads to the application of the introduced techniques to data mining tasks, such as the prominent problem of probabilistic frequent itemset mining.Viele Anwendungsgebiete, wie beispielsweise die Umweltforschung oder die medizinische Diagnostik, nutzen Systeme der SensorĂŒberwachung. Solche Systeme mĂŒssen oftmals in der Lage sein, mit Daten umzugehen, welche durch mehrere Beobachtungen reprĂ€sentiert werden. Im Gegensatz zu Daten mit nur einer Beobachtung (Single-Observation Data) basieren Daten aus mehreren Beobachtungen (Multi-Observation Data) auf einer Vielzahl von Beobachtungen, welche zwei SchlĂŒsseleigenschaften unterliegen: Zeitliche VerĂ€nderlichkeit und Datenunsicherheit. Im Bereich der Ähnlichkeitssuche und im Data Mining spielen diese Eigenschaften eine wichtige Rolle. GĂ€ngige Lösungen in diesen Bereichen, die fĂŒr Single-Observation Data entwickelt wurden, sind in der Regel fĂŒr den Umgang mit mehreren Beobachtungen pro Objekt nicht anwendbar. Der Grund dafĂŒr liegt darin, dass diese AnsĂ€tze entweder nicht mit den Datenmodellen vereinbar sind oder keine Lösungen anbieten, die den aktuellen AnsprĂŒchen an LösungsqualitĂ€t oder Effizienz genĂŒgen. Bekannte Forschungsrichtungen, die sich mit Multi-Observation Data und deren SchlĂŒsseleigenschaften beschĂ€ftigen, sind die Analyse von Zeitreihen und die Ähnlichkeitssuche in probabilistischen Datenbanken. WĂ€hrend erstere Richtung eine zeitliche Ordnung der Beobachtungen eines Objekts voraussetzt, basieren unsichere Datenobjekte auf Beobachtungen, die sich gegenseitig bedingen oder ausschließen. Diese Dissertation umfasst aktuelle ForschungsbeitrĂ€ge aus den beiden genannten Bereichen, wobei Methoden zur Ähnlichkeitssuche und zur Anwendung im Data Mining vorgestellt werden. Der erste Teil dieser Arbeit beschĂ€ftigt sich mit Ähnlichkeitssuche und Data Mining in Zeitreihendatenbanken. Insbesondere werden Zeitreihen betrachtet, welche aus periodisch auftretenden Mustern bestehen. Im Kontext eines medizinischen Anwendungsszenarios wird ein Ansatz zur AktivitĂ€tserkennung vorgestellt. Dieser erlaubt mittels Merkmalsextraktion eine effiziente Speicherung und Analyse mit Hilfe von rĂ€umlichen Indexstrukturen. FĂŒr den Fall hochdimensionaler Merkmalsvektoren stellt dieser Teil zwei Indexierungsmethoden zur Beschleunigung von Ă€hnlichkeitsanfragen vor. Die erste Methode berĂŒcksichtigt alle Attribute der Merkmalsvektoren, wĂ€hrend die zweite Methode eine Projektion der Anfrage auf eine benutzerdefinierten Unterraum des Vektorraums erlaubt. Im zweiten Teil dieser Arbeit wird die Ähnlichkeitssuche im Kontext probabilistischer Datenbanken behandelt. Daten aus Sensormessungen besitzen hĂ€ufig Eigenschaften, die einer gewissen Unsicherheit unterliegen. Aufgrund von Mess- oder ĂŒbertragungsfehlern sind gemessene Werte oftmals unvollstĂ€ndig oder mit Rauschen behaftet. In diversen Szenarien, wie beispielsweise mit persönlichen oder medizinisch vertraulichen Daten, können Daten auch nachtrĂ€glich von Hand verrauscht werden, so dass eine genaue Rekonstruktion der ursprĂŒnglichen Informationen nicht möglich ist. Diese Gegebenheiten stellen Anfragetechniken und Methoden des Data Mining vor einige Herausforderungen. In bestehenden Forschungsarbeiten aus dem Bereich der unsicheren Datenbanken werden diverse Probleme oftmals nicht beachtet. Entweder wird die PrĂ€senz von AbhĂ€ngigkeiten ignoriert, oder es werden lediglich approximative Lösungen angeboten, welche die Anwendung von Methoden fĂŒr sichere Daten erlaubt. Andere AnsĂ€tze berechnen genaue Lösungen, liefern die Antworten aber nicht in annehmbarer Laufzeit zurĂŒck. Dieser Teil der Arbeit prĂ€sentiert effiziente Methoden zur Beantwortung von Ähnlichkeitsanfragen, welche die Ergebnisse absteigend nach ihrer Relevanz, also eine Rangliste der Ergebnisse, zurĂŒckliefern. Die angewandten Techniken werden schließlich auf Problemstellungen im probabilistischen Data Mining ĂŒbertragen, um beispielsweise das Problem des Frequent Itemset Mining unter BerĂŒcksichtigung des vollen Gehalts an Unsicherheitsinformation zu lösen

    Symbolic Data Analysis for the Assessment of User Satisfaction: An Application to Reading Rooms Services.

    Special edition of the European Scientific Journal (ESJ): Conference proceedings: 1st Annual International Interdisciplinary Conference AIIC 2013, 24-26 April, Azores Islands, Portugal.This paper re-examines and deepens the study of a portion of the data collected within the context of a wider 2007 research project conducted in the Autonomous Region of Azores. The 2007 study aimed to understand users’ habits, attitudes and cultural practices, concerning reading and utilization of different library services, archives and museums. Based upon knowledge that only data analysis of a representative sample can supply, the study aimed to identify the aspects that should be prioritized in a process of restructuring the cultural services of leisure and reading to be implemented. This paper, utilizing data from the 2007 study, presents some results from the Ascendant Hierarchical Cluster Analysis (AHCA) of symbolic objects, according to the treatment to which they were submitted. These objects are described by different symbolic attributes pertaining to the latent variable ‘Degree of Satisfaction’. This variable was evaluated according to different dimensions of on-the-spot reading and consultation services. The aggregation criteria used in this study belong to a parametric family of methods and the similarity measure used is the weighted generalized affinity coefficient, for symbolic data. The validation of the clustering results is based on some validation measures

    Machine Learning Methods for Depression Detection Using SMRI and RS-FMRI Images

    Major Depression Disorder (MDD) is a common disease throughout the world that negatively influences people’s lives. Early diagnosis of MDD is beneficial, so detecting practical biomarkers would aid clinicians in the diagnosis of MDD. Having an automated method to find biomarkers for MDD is helpful even though it is difficult. The main aim of this research is to generate a method for detecting discriminative features for MDD diagnosis based on Magnetic Resonance Imaging (MRI) data. In this research, representational similarity analysis provides a framework to compare distributed patterns and obtain the similarity/dissimilarity of brain regions. Regions are obtained by either data-driven or model-driven methods such as cubes and atlases respectively. For structural MRI (sMRI) similarity of voxels of spatial cubes (data-driven) are explored. For resting-state fMRI (rs-fMRI) images, the similarity of the time series of both cubes (data-driven) and atlases (model-driven) are examined. Moreover, the similarity method of the inverse of Minimum Covariant Determinant is applied that excludes outliers from patterns and finds conditionally independent regions given the rest of regions. Next, a statistical test that is robust to outliers, identifies discriminative similarity features between two groups of MDDs and controls. Therefore, the key contribution is the way to get discriminative features that include obtaining similarity of voxel’s cubes/time series using the inverse of robust covariance along with the statistical test. The experimental results show that obtaining these features along with the Bernoulli Naïve Bayes classifier achieves superior performance compared with other methods. The performance of our method is verified by applying it to three imbalanced datasets. Moreover, the similarity-based methods are compared with deep learning and regional-based approaches for detecting MDD using either sMRI or rs-fMRI. Given that depression is famous to be a connectivity disorder problem, investigating the similarity of the brain’s regions is valuable to understand the behavior of the brain. The combinations of structural and functional brain similarities are explored to investigate the brain’s structural and functional properties together. Moreover, the combination of data-driven (cube) and model-driven (atlas) similarities of rs-fMRI are looked over to evaluate how they affect the performance of the classifier. Besides, discriminative similarities are visualized for both sMRI and rs-fMRI. Also, to measure the informativeness of a cube, the relationship of atlas regions with overlapping cubes and vise versa (cubes with overlapping regions) are explored and visualized. Furthermore, the relationship between brain structure and function has been probed through common similarities between structural and resting-state functional networks

    A Probabilistic Evaluation Framework for Preference Aggregation Reflecting Group Homogeneity

    Groups differ in the homogeneity of their members' preferences. Reflecting this, we propose a probabilistic criterion for evaluating and comparing the adequateness of preference aggregation procedures that takes into account information on the considered group's homogeneity structure. Further, we discuss two approaches for approximating our criterion if information is only imperfectly given and show how to estimate these approximations from data. As a preparation, we elaborate some general minimal requirements for measuring homogeneity and discuss a specific proposal for a homogeneity measure. Finally, we investigate our framework by comparing aggregation rules in a simulation study
