47 research outputs found

    Multivariate equi-width data swapping for private data publication

    No full text
    Also published in: Advances in knowledge discovery and data mining: 14th Pacific-Asia Conference, PAKDD 2010, Hyderabad, India, June 21-24, 2010: Proceedings, Part I / Mohammed J. Zaki, Jeffrey Xu Yu, B. Ravindran and Vikram Pudi (eds.), pp. 208-215In many privacy preserving applications, specific variables are required to be disturbed simultaneously in order to guarantee correlations among them. Multivariate Equi-Depth Swapping (MEDS) is a natural solution in such cases, since it provides uniform privacy protection for each data tuple. However, this approach performs ineffectively not only in computational complexity (basically O(n 3) for n data tuples), but in data utility for distance-based data analysis. This paper discusses the utilisation of Multivariate Equi-Width Swapping (MEWS) to enhance the utility preservation for such cases. With extensive theoretical analysis and experimental results, we show that, MEWS can achieve a similar performance in privacy preservation to that of MEDS and has only O(n) computational complexity.Yidong Li and Hong She

    Truncated Affinity Maximization: One-class Homophily Modeling for Graph Anomaly Detection

    Full text link
    One prevalent property we find empirically in real-world graph anomaly detection (GAD) datasets is a one-class homophily, i.e., normal nodes tend to have strong connection/affinity with each other, while the homophily in abnormal nodes is significantly weaker than normal nodes. However, this anomaly-discriminative property is ignored by existing GAD methods that are typically built using a conventional anomaly detection objective, such as data reconstruction. In this work, we explore this property to introduce a novel unsupervised anomaly scoring measure for GAD -- local node affinity -- that assigns a larger anomaly score to nodes that are less affiliated with their neighbors, with the affinity defined as similarity on node attributes/representations. We further propose Truncated Affinity Maximization (TAM) that learns tailored node representations for our anomaly measure by maximizing the local affinity of nodes to their neighbors. Optimizing on the original graph structure can be biased by non-homophily edges (i.e., edges connecting normal and abnormal nodes). Thus, TAM is instead optimized on truncated graphs where non-homophily edges are removed iteratively to mitigate this bias. The learned representations result in significantly stronger local affinity for normal nodes than abnormal nodes. Extensive empirical results on six real-world GAD datasets show that TAM substantially outperforms seven competing models, achieving over 10% increase in AUROC/AUPRC compared to the best contenders on challenging datasets. Our code will be made available at https: //github.com/mala-lab/TAM-master/.Comment: 19 pages, 9 figure

    Course Recommendation based on Sequences: An Evolutionary Search of Emerging Sequential Patterns

    Get PDF
    To provide a good study plan is key to avoid students’ failure. Academic advising based on student’s preferences, complexity of the semester, or even background knowledge is usually considered to reduce the dropout rate. This article aims to provide a good course index to recommend courses to students based on the sequence of courses already taken by each student. Hence, unlike existing long-term course planning methods, it is based on graduate students to model the course and not on external factors that might introduce some bias in the process. The proposal includes a novel sequential pattern mining algorithm, called (ES)2P (Evolutionary Search of Emerging Sequential Patterns), that properly identifies paths followed by good students and not followed by not so good students, as a long-term course planning approach. A major feature of the proposed (ES)2P algorithm is its ability to extract the best k solutions, that is, those with a best recommendation index score instead of returning the whole set of solutions above a predefined threshold. A real study case is performed including more than 13,000 students belonging to 13 faculties to demonstrate the usefulness of the proposal not only to recommend study plans but also to give advices at different stages of the students’ learning process

    Course Recommendation based on Sequences: An Evolutionary Search of Emerging Sequential Patterns

    Get PDF
    Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was supported by the Spanish Ministry of Science and Innovation, project PID2020-115832GBI00, and the University of Cordoba, project UCO-FEDER 18 REF.1263116 MOD.A. Both projects were also supported by the European Fund of Regional Development.To provide a good study plan is key to avoid students’ failure. Academic advising based on student’s preferences, complexity of the semester, or even background knowledge is usually considered to reduce the dropout rate. This article aims to provide a good course index to recommend courses to students based on the sequence of courses already taken by each student. Hence, unlike existing long-term course planning methods, it is based on graduate students to model the course and not on external factors that might introduce some bias in the process. The proposal includes a novel sequential pattern mining algorithm, called (ES)2 P (Evolutionary Search of Emerging Sequential Patterns), that properly identifies paths followed by good students and not followed by not so good students, as a long-term course planning approach. A major feature of the proposed (ES)2 P algorithm is its ability to extract the best k solutions, that is, those with a best recommendation index score instead of returning the whole set of solutions above a predefined threshold. A real study case is performed including more than 13,000 students belonging to 13 faculties to demonstrate the usefulness of the proposal not only to recommend study plans but also to give advices at different stages of the students’ learning process.CRUE-CSICSpringer NatureSpanish Government PID2020-115832GBI00University of Cordoba UCO-FEDER 18 REF.1263116 MOD.

    Menetelmiä mielenkiintoisten solmujen löytämiseen verkostoista

    Get PDF
    With the increasing amount of graph-structured data available, finding interesting objects, i.e., nodes in graphs, becomes more and more important. In this thesis we focus on finding interesting nodes and sets of nodes in graphs or networks. We propose several definitions of node interestingness as well as different methods to find such nodes. Specifically, we propose to consider nodes as interesting based on their relevance and non-redundancy or representativeness w.r.t. the graph topology, as well as based on their characterisation for a class, such as a given node attribute value. Identifying nodes that are relevant, but non-redundant to each other is motivated by the need to get an overview of different pieces of information related to a set of given nodes. Finding representative nodes is of interest, e.g. when the user needs or wants to select a few nodes that abstract the large set of nodes. Discovering nodes characteristic for a class helps to understand the causes behind that class. Next, four methods are proposed to find a representative set of interesting nodes. The first one incrementally picks one interesting node after another. The second iteratively changes the set of nodes to improve its overall interestingness. The third method clusters nodes and picks a medoid node as a representative for each cluster. Finally, the fourth method contrasts diverse sets of nodes in order to select nodes characteristic for their class, even if the classes are not identical across the selected nodes. The first three methods are relatively simple and are based on the graph topology and a similarity or distance function for nodes. For the second and third, the user needs to specify one parameter, either an initial set of k nodes or k, the size of the set. The fourth method assumes attributes and class attributes for each node, a class-related interesting measure, and possible sets of nodes which the user wants to contrast, such as sets of nodes that represent different time points. All four methods are flexible and generic. They can, in principle, be applied on any weighted graph or network regardless of what nodes, edges, weights, or attributes represent. Application areas for the methods developed in this thesis include word co-occurrence networks, biological networks, social networks, data traffic networks, and the World Wide Web. As an illustrating example, consider a word co-occurrence network. There, finding terms (nodes in the graph) that are relevant to some given nodes, e.g. branch and root, may help to identify different, shared contexts such as botanics, mathematics, and linguistics. A real life application lies in biology where finding nodes (biological entities, e.g. biological processes or pathways) that are relevant to other, given nodes (e.g. some genes or proteins) may help in identifying biological mechanisms that are possibly shared by both the genes and proteins.Väitöskirja käsittelee verkostojen louhinnan menetelmiä. Sen tavoitteena on löytää mielenkiintoisia tietoja painotetuista verkoista. Painotettuna verkkona voi tarkastella esim. tekstiainestoja, biologisia ainestoja, ihmisten välisiä yhteyksiä tai internettiä. Tällaisissa verkoissa solmut edustavat käsitteitä (esim. sanoja, geenejä, ihmisiä tai internetsivuja) ja kaaret niiden välisiä suhteita (esim. kaksi sanaa esiintyy samassa lauseessa, geeni koodaa proteiinia, ihmisten ystävyyksiä tai internetsivu viittaa toiseen internetsivuun). Kaarten painot voivat vastata esimerkiksi yhteyden voimakuutta tai luotettavuutta. Väitöskirjassa esitetään erilaisia verkon rakenteeseen tai solmujen attribuutteihin perustuvia määritelmiä solmujen mielenkiintoisuudelle sekä useita menetelmiä mielenkiintoisten solmujen löytämiseksi. Mielenkiintoisuuden voi määritellä esim. merkityksellisyytenä suhteessa joihinkin annettuihin solmuihin ja toisaalta mielenkiintoisten solmujen keskinäisenä erilaisuutena. Esimerkiksi ns. ahneella menetelmällä voidaan löytää keskenään erilaisia solmuja yksi kerrallaan. Väitöskirjan tuloksia voidaan soveltaa esimerkiksi tekstiaineistoa käsittelemällä saatuun sanojen väliseen verkostoon, jossa kahden sanan välillä on sitä voimakkaampi yhteys mitä useammin ne tapaavat esiintyä keskenään samoissa lauseissa. Sanojen erilaisia käyttöyhteyksiä ja jopa merkityksiä voidaan nyt löytää automaattisesti. Jos kohdesanaksi otetaan vaikkapa "juuri", niin siihen liittyviä mutta keskenään toisiinsa liittymättömiä sanoja ovat "puu" (biologinen merkitys: kasvin juuri), "yhtälö" (matemaattinen merkitys: yhtälön ratkaisu eli juuri) sekä "indoeurooppalainen" (kielitieteellinen merkitys: sanan vartalo eli juuri). Tällaisia menetelmiä voidaan soveltaa esimerkiksi hakukoneessa: sanalla "juuri" tehtyihin hakutuloksiin sisällytetään tuloksia mahdollisimman erilaisista käyttöyhteyksistä, jotta käyttäjän tarkoittama merkitys tulisi todennäköisemmin katetuksi hakutuloksissa. Merkittävä sovelluskohde väitöskirjan menetelmille ovat biologiset verkot, joissa solmut edustavat biologisia käsitteitä (esim. geenejä, proteiineja tai sairauksia) ja kaaret niiden välisiä suhteita (esim. geeni koodaa proteiinia tai proteiini on aktiivinen tietyssä sairauksessa). Menetelmillä voidaan etsiä esimerkiksi sairauksiin vaikuttavia biologisia mekanismeja paikantamalla edustava joukko sairauteen ja siihen mahdollisesti liittyviin geeneihin verkostossa kytkeytyviä muita solmuja. Nämä voivat auttaa biologeja ymmärtämään geenien ja sairauden mahdollisia kytköksiä ja siten kohdentamaan jatkotutkimustaan lupaavimpiin geeneihin, proteiineihin tms. Väitöskirjassa esitetyt solmujen mielenkiintoisuuden määritelmät sekä niiden löytämiseen ehdotetut menetelmät ovat yleispäteviä ja niitä voi soveltaa periaatteessa mihin tahansa verkkoon riippumatta siitä, mitä solmut, kaaret tai painot edustavat. Kokeet erilaisilla verkoilla osoittavat että ne löytävät mielenkiintoisia solmuja
    corecore