3 research outputs found

    Speeding-Up Hierarchical Agglomerative Clustering in Presence of Expensive Metrics

    No full text
    Abstract. In several contexts and domains, hierarchical agglomerative clustering (HAC) offers best-quality results, but at the price of a high complexity which reduces the size of datasets which can be handled. In some contexts, in particular, computing distances between objects is the most expensive task. In this paper we propose a pruning heuristics aimed at improving performances in these cases, which is well integrated in all the phases of the HAC process and can be applied to two HAC variants: single-linkage and complete-linkage. After describing the method, we provide some theoretical evidence of its pruning power, followed by an empirical study of its effectiveness over different data domains, with a special focus on dimensionality issues.

    Symmetrical classification of service system objects

    Get PDF
    U ovoj disertaciji predložena su dva nova postupka za razvrstavanje nesigurnih objekata. Prvi postupak omogućuje razvrstavanje prostornih podataka zasnovano na simetralnoj podjeli prostora, te usporedbi i odbacivanju grozdova. Postupak znatno smanjuje broj računanja očekivanih udaljenosti i ubrzava proces odbacivanja grozdova u odnosu na postojeće postupke. Drugi postupak dijeli područja skupa objekata određivanjem prostornih odnosa objekata s ciljem povećanja mogućnosti paralelne obrade. Postupkom je omogućeno paralelno izvođenje procesa razvrstavanja. Postupak ne zahtijeva dodatna ulaganja u opremu, jer se može izvoditi na računalu s više jezgri. Razvijeni postupci iskorišteni su za stvaranje modela predviđanja ponašanja uslužnog sustava razvrstavanjem postojećih podataka o objektima uslužnog sustava. Pomoću modela mogu se predvidjeti zahtjevi za uslužnim sustavom i djelovati prema zahtjevima. Postiže se smanjenje troškova uslužnog sustava i povećava broj zadataka koje uslužni sustav može obaviti.Two original procedures for clustering spatially uncertain data are proposed in this dissertation. The first procedure enables the clustering of spatial data based on bisector division of space, using comparison and cluster pruning. It significantly reduces the number of the expected distances calculations and speeds up the process of clusters pruning in comparison to existing procedures. Second procedure divides the data set area using spatial relations among objects to increase the possibility of parallel processing. The procedure enables parallel execution of the clustering process. The procedure does not require additional investments in equipment, because of use a computer with multiple cores. Presented procedures are used for creating a model for prediction of behaviour service-oriented system, using clustering of existing data about objects of service-oriented system. This model can predict the requirements for service-oriented system and prepare according to the requirements. Costs are reduced and increased the number of tasks that can be done by service-oriented system

    Privacy by Design in Data Mining

    Get PDF
    Privacy is ever-growing concern in our society: the lack of reliable privacy safeguards in many current services and devices is the basis of a diffusion that is often more limited than expected. Moreover, people feel reluctant to provide true personal data, unless it is absolutely necessary. Thus, privacy is becoming a fundamental aspect to take into account when one wants to use, publish and analyze data involving sensitive information. Many recent research works have focused on the study of privacy protection: some of these studies aim at individual privacy, i.e., the protection of sensitive individual data, while others aim at corporate privacy, i.e., the protection of strategic information at organization level. Unfortunately, it is in- creasingly hard to transform the data in a way that it protects sensitive information: we live in the era of big data characterized by unprecedented opportunities to sense, store and analyze complex data which describes human activities in great detail and resolution. As a result anonymization simply cannot be accomplished by de-identification. In the last few years, several techniques for creating anonymous or obfuscated versions of data sets have been proposed, which essentially aim to find an acceptable trade-off between data privacy on the one hand and data utility on the other. So far, the common result obtained is that no general method exists which is capable of both dealing with “generic personal data” and preserving “generic analytical results”. In this thesis we propose the design of technological frameworks to counter the threats of undesirable, unlawful effects of privacy violation, without obstructing the knowledge discovery opportunities of data mining technologies. Our main idea is to inscribe privacy protection into the knowledge discovery technol- ogy by design, so that the analysis incorporates the relevant privacy requirements from the start. Therefore, we propose the privacy-by-design paradigm that sheds a new light on the study of privacy protection: once specific assumptions are made about the sensitive data and the target mining queries that are to be answered with the data, it is conceivable to design a framework to: a) transform the source data into an anonymous version with a quantifiable privacy guarantee, and b) guarantee that the target mining queries can be answered correctly using the transformed data instead of the original ones. This thesis investigates on two new research issues which arise in modern Data Mining and Data Privacy: individual privacy protection in data publishing while preserving specific data mining analysis, and corporate privacy protection in data mining outsourcing
    corecore