2,741 research outputs found

    How emergent self organizing maps can help counter domestic violence.

    Get PDF
    Topographic maps are an appealing exploratory instrument for discovering new knowledge from databases. During the past years, new types of Self Organizing Maps (SOM) were introduced in the literature, including the recent Emergent SOM. The ESOM is used to study a large set of police reports describing a whole range of violent incidents that occurred during the year 2007 in the police region Amsterdam-Amstelland (the Netherlands). It is demonstrated that it provides an exploratory search instrument for examining unstructured text in police reports. First, it is shown how the ESOM was used to discover a whole range of new features that better distinguish domestic from non-domestic violence cases. Then, it is demonstrated how this resulted in a significant improvement in classification accuracy. Finally, the ESOM is showcased as a powerful instrument for the domain expert interested in an indepth investigation of the nature and scope of domestic violence.

    Instance reduction approach to machine learning and multi-database mining

    Get PDF
    The paper proposes a heuristic instance reduction algorithm as an approach to machine learning and knowledge discovery in centralized and distributed databases. The proposed algorithm is based on an original method for a selection of reference instances and creates a reduced training dataset. The reduced training set consisting of selected instances can be used as an input for the machine learning algorithms used for data mining tasks. The algorithm calculates for each instance in the data set the value of its similarity coefficient. Values of the coefficient are used to group instances into clusters. The number of clusters depends on the value of the so called representation level set by the user. Out of each cluster only a limited number of instances is selected to form a reduced training set. The proposed algorithm uses population learning algorithm for selection of instances. The paper includes a description of the proposed approach and results of the validating experiment

    Some Pattern Recognition Challenges in Data-Intensive Astronomy

    Get PDF
    We review some of the recent developments and challenges posed by the data analysis in modern digital sky surveys, which are representative of the information-rich astronomy in the context of Virtual Observatory. Illustrative examples include the problems of an automated star-galaxy classification in complex and heterogeneous panoramic imaging data sets, and an automated, iterative, dynamical classification of transient events detected in synoptic sky surveys. These problems offer good opportunities for productive collaborations between astronomers and applied computer scientists and statisticians, and are representative of the kind of challenges now present in all data-intensive fields. We discuss briefly some emergent types of scalable scientific data analysis systems with a broad applicability.Comment: 8 pages, compressed pdf file, figures downgraded in quality in order to match the arXiv size limi

    Parallel and Distributed Data Mining

    Get PDF

    Role based behavior analysis

    Get PDF
    Tese de mestrado, Segurança InformĂĄtica, Universidade de Lisboa, Faculdade de CiĂȘncias, 2009Nos nossos dias, o sucesso de uma empresa depende da sua agilidade e capacidade de se adaptar a condiçÔes que se alteram rapidamente. Dois requisitos para esse sucesso sĂŁo trabalhadores proactivos e uma infra-estrutura ĂĄgil de Tecnologias de InformacĂŁo/Sistemas de Informação (TI/SI) que os consiga suportar. No entanto, isto nem sempre sucede. Os requisitos dos utilizadores ao nĂ­vel da rede podem nao ser completamente conhecidos, o que causa atrasos nas mudanças de local e reorganizaçÔes. AlĂ©m disso, se nĂŁo houver um conhecimento preciso dos requisitos, a infraestrutura de TI/SI poderĂĄ ser utilizada de forma ineficiente, com excessos em algumas ĂĄreas e deficiĂȘncias noutras. Finalmente, incentivar a proactividade nĂŁo implica acesso completo e sem restriçÔes, uma vez que pode deixar os sistemas vulnerĂĄveis a ameaças externas e internas. O objectivo do trabalho descrito nesta tese Ă© desenvolver um sistema que consiga caracterizar o comportamento dos utilizadores do ponto de vista da rede. Propomos uma arquitectura de sistema modular para extrair informação de fluxos de rede etiquetados. O processo Ă© iniciado com a criação de perfis de utilizador a partir da sua informação de fluxos de rede. Depois, perfis com caracterĂ­sticas semelhantes sĂŁo agrupados automaticamente, originando perfis de grupo. Finalmente, os perfis individuais sĂŁo comprados com os perfis de grupo, e os que diferem significativamente sĂŁo marcados como anomalias para anĂĄlise detalhada posterior. Considerando esta arquitectura, propomos um modelo para descrever o comportamento de rede dos utilizadores e dos grupos. Propomos ainda mĂ©todos de visualização que permitem inspeccionar rapidamente toda a informação contida no modelo. O sistema e modelo foram avaliados utilizando um conjunto de dados reais obtidos de um operador de telecomunicaçÔes. Os resultados confirmam que os grupos projectam com precisĂŁo comportamento semelhante. AlĂ©m disso, as anomalias foram as esperadas, considerando a população subjacente. Com a informação que este sistema consegue extrair dos dados em bruto, as necessidades de rede dos utilizadores podem sem supridas mais eficazmente, os utilizadores suspeitos sĂŁo assinalados para posterior anĂĄlise, conferindo uma vantagem competitiva a qualquer empresa que use este sistema.In our days, the success of a corporation hinges on its agility and ability to adapt to fast changing conditions. Proactive workers and an agile IT/IS infrastructure that can support them is a requirement for this success. Unfortunately, this is not always the case. The user’s network requirements may not be fully understood, which slows down relocation and reorganization. Also, if there is no grasp on the real requirements, the IT/IS infrastructure may not be efficiently used, with waste in some areas and deficiencies in others. Finally, enabling proactivity does not mean full unrestricted access, since this may leave the systems vulnerable to outsider and insider threats. The purpose of the work described on this thesis is to develop a system that can characterize user network behavior. We propose a modular system architecture to extract information from tagged network flows. The system process begins by creating user profiles from their network flows’ information. Then, similar profiles are automatically grouped into clusters, creating role profiles. Finally, the individual profiles are compared against the roles, and the ones that differ significantly are flagged as anomalies for further inspection. Considering this architecture, we propose a model to describe user and role network behavior. We also propose visualization methods to quickly inspect all the information contained in the model. The system and model were evaluated using a real dataset from a large telecommunications operator. The results confirm that the roles accurately map similar behavior. The anomaly results were also expected, considering the underlying population. With the knowledge that the system can extract from the raw data, the users network needs can be better fulfilled, the anomalous users flagged for inspection, giving an edge in agility for any company that uses it

    Decomposable Naive Bayes Classifier for Partitioned Data

    Get PDF
    Most learning algorithms are designed to work on a single dataset. However, with the growth of networks, data is increasingly distributed over many databases in many different geographical sites. These databases cannot be moved to other network sites due to security, size, privacy, or data ownership consideration. In this paper, we propose two decomposable versions of Naive Bayes Classifier for horizontally and vertically partitioned data. The goal of our algorithms is to achieve the learning objectives for any data distribution encountered across the network by exchanging minimum local summaries among the participating sites

    Exploring Patterns of Epigenetic Information With Data Mining Techniques

    Get PDF
    [Abstract] Data mining, a part of the Knowledge Discovery in Databases process (KDD), is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Analyses of epigenetic data have evolved towards genome-wide and high-throughput approaches, thus generating great amounts of data for which data mining is essential. Part of these data may contain patterns of epigenetic information which are mitotically and/or meiotically heritable determining gene expression and cellular differentiation, as well as cellular fate. Epigenetic lesions and genetic mutations are acquired by individuals during their life and accumulate with ageing. Both defects, either together or individually, can result in losing control over cell growth and, thus, causing cancer development. Data mining techniques could be then used to extract the previous patterns. This work reviews some of the most important applications of data mining to epigenetics.Programa Iberoamericano de Ciencia y TecnologĂ­a para el Desarrollo; 209RT-0366Galicia. ConsellerĂ­a de EconomĂ­a e Industria; 10SIN105004PRInstituto de Salud Carlos III; RD07/0067/000
    • 

    corecore