211,513 research outputs found

    DAME: A distributed data mining and exploration framework within the virtual observatory

    Get PDF
    Nowadays, many scientific areas share the same broad requirements of being able to deal with massive and distributed datasets while, when possible, being integrated with services and applications. In order to solve the growing gap between the incremental generation of data and our understanding of it, it is required to know how to access, retrieve, analyze, mine and integrate data from disparate sources. One of the fundamental aspects of any new generation of data mining software tool or package which really wants to become a service for the community is the possibility to use it within complex workflows which each user can fine tune in order to match the specific demands of his scientific goal. These workflows need often to access different resources (data, providers, computing facilities and packages) and require a strict interoperability on (at least) the client side. The project DAME (DAta Mining & Exploration) arises from these requirements by providing a distributed WEB-based data mining infrastructure specialized on Massive Data Sets exploration with Soft Computing methods. Originally designed to deal with astrophysical use cases, where first scientific application examples have demonstrated its effectiveness, the DAME Suite results as a multi-disciplinary platformindependent tool perfectly compliant with modern KDD (Knowledge Discovery in Databases) requirements and Information & Communication Technology trends

    Evaluate Various Techniques of Data Warehouse and Data Mining with Web Based Tool

    Get PDF
    All enterprise has a crucial role to play proficiently and productively to maintain its survival in the market and increase its profitability shares. This challenge becomes more complicated with advancement in information technology along with increasing volume and complexity of information. Currently, success of an enterprise is not just the result of efforts by resources but also depends upon its ability to mine the data from the stored information. Data warehousing is a compilation of decision making procedure to integrate and manage the large variant data efficiently and scientifically. Data mining shores up organizations, scrutinize their data more effectively and proficiently to achieve valuable information, that can reward an intelligent and strategic decision making. Data mining has several techniques and maths algorithms which are used to mine large data to increase the organization performance and strategic decision-making. Clustering is a powerful and widely accepted data mining method used to segregate the large data sets into group of similar objects and provides to the end user a sophisticated view of database. This study discusses the basic concept of clustering; its meaning and applications, especially in business for division and selection of target market. This technique is useful in marketing or sales side and, for example, sends a promotion to the right target for that product or service. Association is a known data mining techniques. A pattern is inferred based on an affiliation between matter of same business transaction. It is also referred as relation technique. Large enterprises depend on this technique to research customer's buying preferences. For instance, to track people's buying behavior, retailers might categorize that a customer always buy sambar onion when they buy dal, and therefore suggest that the next time that they buy dal they might also want to buy onion. Classification – it is one of the data mining concept differs from the above in a way it is used on machine learning and makes use of techniques used in maths such as linear programming, decision trees, neural network. In classification, enterprises try to build tool that can learn how to classify the data items into groups. For instance, a company can define a classification in the application that “given all records of employees who offered to resign from the company, predict the number of individuals who are likely to resign from the company in future.” Under such a scenario, the company can classify the records of employees into two groups that namely “separate” and “retain”. It can use its data mining software to classify the employees into separate groups created earlier. Fuzzy logic resembles human reasoning greatly in handling of imperfect information and can be used as a flexibility tool for soften the boundaries in classification that suits the real problems more efficiently. The present study discusses the meaning of fuzzy logic, its applications and different features. A tool to be build to check data mining algorithms and algorithm behind the model, apply clustering method as a sample in tool to select the training data out of the large data base and reduce complexity and time while computing. K-nearest neighbor method can be used in many applications from general to specific to find the requested data out of huge data. Decision trees – A decision tree is a structure that includes a root node, branches, and leaf nodes. Every one interior node signify a test on an attribute, each branch denotes the result of a test, and each leaf node represents a class label. The topmost node in the tree is the root node. Within the decision tree, we start with a simple question that has multiple answers. Each respond show the way to a further query to help classify or identify the data so that it can be categorized, or so that a prediction can be made based on each answer. Regression analysis is the data mining method of identifying and analyzing the relationship between variables. It is used to identify the likelihood of a specific variable, given the presence of other variables. Outlier detection technique refers to observation of data items in the dataset which do not match an expected pattern or expected behaviour. This technique can be used in a variety of domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection is also called Outlier Analysis or Outlier mining. Sequential Patterns technique helps to find out similar patterns or trends in transaction data for definite period

    ScalaParBiBit: Scaling the Binary Biclustering in Distributed-Memory Systems

    Get PDF
    [Abstract] Biclustering is a data mining technique that allows us to find groups of rows and columns that are highly correlated in a 2D dataset. Although there exist several software applications to perform biclustering, most of them suffer from a high computational complexity which prevents their use in large datasets. In this work we present ScalaParBiBit, a parallel tool to find biclusters on binary data, quite common in many research fields such as text mining, marketing or bioinformatics. ScalaParBiBit takes advantage of the special characteristics of these binary datasets, as well as of an efficient parallel implementation and algorithm, to accelerate the biclustering procedure in distributed-memory systems. The experimental evaluation proves that our tool is significantly faster and more scalable that the state-of-the-art tool ParBiBit in a cluster with 32 nodes and 768 cores. Our tool together with its reference manual are freely available at https://github.com/fraguela/ScalaParBiBit.This research was supported by the Ministry of Science and Innovation of Spain (TIN2016-75845-P and PID2019-104184RB-I00, AEI/FEDER/EU, 10.13039/ 501100011033), and by the Xunta de Galicia co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2017/04). We acknowledge also the support from the Centro Singular de Investigación de Galicia “CITIC”, funded by Xunta de Galicia and the European Union (European Regional Development Fund- Galicia 2014-2020 Program), by grant ED431G 2019/01. We also acknowledge the Centro de Supercomputación de Galicia (CESGA) for the usage of their resourcesXunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G 2019/0

    Novel Algorithm Development for ‘NextGeneration’ Sequencing Data Analysis

    Get PDF
    In recent years, the decreasing cost of ‘Next generation’ sequencing has spawned numerous applications for interrogating whole genomes and transcriptomes in research, diagnostic and forensic settings. While the innovations in sequencing have been explosive, the development of scalable and robust bioinformatics software and algorithms for the analysis of new types of data generated by these technologies have struggled to keep up. As a result, large volumes of NGS data available in public repositories are severely underutilised, despite providing a rich resource for data mining applications. Indeed, the bottleneck in genome and transcriptome sequencing experiments has shifted from data generation to bioinformatics analysis and interpretation. This thesis focuses on development of novel bioinformatics software to bridge the gap between data availability and interpretation. The work is split between two core topics – computational prioritisation/identification of disease gene variants and identification of RNA N6 -adenosine Methylation from sequencing data. The first chapter briefly discusses the emergence and establishment of NGS technology as a core tool in biology and its current applications and perspectives. Chapter 2 introduces the problem of variant prioritisation in the context of Mendelian disease, where tens of thousands of potential candidates are generated by a typical sequencing experiment. Novel software developed for candidate gene prioritisation is described that utilises data mining of tissue-specific gene expression profiles (Chapter 3). The second part of chapter investigates an alternative approach to candidate variant prioritisation by leveraging functional and phenotypic descriptions of genes and diseases from multiple biomedical domain ontologies (Chapter 4). Chapter 5 discusses N6 AdenosineMethylation, a recently re-discovered posttranscriptional modification of RNA. The core of the chapter describes novel software developed for transcriptome-wide detection of this epitranscriptomic mark from sequencing data. Chapter 6 presents a case study application of the software, reporting the previously uncharacterised RNA methylome of Kaposi’s Sarcoma Herpes Virus. The chapter further discusses a putative novel N6-methyl-adenosine -RNA binding protein and its possible roles in the progression of viral infection
    • 

    corecore