787 research outputs found

    Computational Generalization in Taxonomies Applied to: (1) Analyze Tendencies of Research and (2) Extend User Audiences

    Get PDF
    D.F. and B.M. acknowledge continuing support by the Academic Fund Program at the NRU HSE (grant-19-04-019 in 2018?2019) and by the DECAN Lab NRU HSE, in the framework of a subsidy granted to the HSE by the Government of the Russian Federation for the implementation of the Russian Academic Excellence Project ?5-100?. S.N. acknowledges the support by FCT/MCTES, NOVA LINCS (UID/CEC/04516/2019).We define a most specific generalization of a fuzzy set of topics assigned to leaves of the rooted tree of a domain taxonomy. This generalization lifts the set to its “head subject” node in the higher ranks of the taxonomy tree. The head subject is supposed to “tightly” cover the query set, possibly bringing in some errors referred to as “gaps” and “offshoots”. Our method, ParGenFS, globally minimizes a penalty function combining the numbers of head subjects and gaps and offshoots, differently weighted. Two applications are considered: (1) analysis of tendencies of research in Data Science; (2) audience extending for programmatic targeted advertising online. The former involves a taxonomy of Data Science derived from the celebrated ACM Computing Classification System 2012. Based on a collection of research papers published by Springer 1998–2017, and applying in-house methods for text analysis and fuzzy clustering, we derive fuzzy clusters of leaf topics in learning, retrieval and clustering. The head subjects of these clusters inform us of some general tendencies of the research. The latter involves publicly available IAB Tech Lab Content Taxonomy. Each of about 25 mln users is assigned with a fuzzy profile within this taxonomy, which is generalized offline using ParGenFS. Our experiments show that these head subjects effectively extend the size of targeted audiences at least twice without loosing quality.authorsversionpublishe

    Unsupervised Algorithms for Microarray Sample Stratification

    Get PDF
    The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.Peer reviewe

    Preparing Low Cost Solution Based On Customized Process Of Parallel Clustering Solution

    Get PDF
    Big Data analysis is the field of data processing where it involves collections of large volume of data sets which are generally so large and really complex in nature and also there is no unified scientific solution globally for any data analysis due to its nature of difficulties to process them by adopting traditional approaches and technologies. Handling large volume of data and preparing them for deep analysis to evaluate them and prepare required information as required by the mining process is the most complex and sometimes costlier task in real-time. There are many solutions for the data mining process like clustering, special mining, k-means mining to name a few. But the real challenge in data mining process is choosing the correct solution or algorithm to apply for mining the input data and tuning the processing step in such a way that we establish a cost effective solution for the entire mining process. There may be many solutions where mining is efficient but cost of operation is not effective and sometimes it is vice-versa. Hence there is always an ever increasing demand for an efficient solution which is cost effective as well as efficient in data mining technique. The intent of this paper is researching on how we implement a concept called Parallel clustering which gives higher benefit in terms of cost and time in data mining processing without compromising the efficiency and accuracy in expected result. This paper discusses one such custom algorithm and its performance as compared to other solutions

    Hierarchical community structure in networks

    Get PDF
    Modular and hierarchical structures are pervasive in real-world complex systems. A great deal of effort has gone into trying to detect and study these structures. Important theoretical advances in the detection of modular, or "community", structures have included identifying fundamental limits of detectability by formally defining community structure using probabilistic generative models. Detecting hierarchical community structure introduces additional challenges alongside those inherited from community detection. Here we present a theoretical study on hierarchical community structure in networks, which has thus far not received the same rigorous attention. We address the following questions: 1)~How should we define a valid hierarchy of communities? 2)~How should we determine if a hierarchical structure exists in a network? and 3)~how can we detect hierarchical structure efficiently? We approach these questions by introducing a definition of hierarchy based on the concept of stochastic externally equitable partitions and their relation to probabilistic models, such as the popular stochastic block model. We enumerate the challenges involved in detecting hierarchies and, by studying the spectral properties of hierarchical structure, present an efficient and principled method for detecting them.Comment: 22 pages, 12 figure

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Machine learning for multivariate time series with the R package mlmts

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract]: Time series data are ubiquitous nowadays. Whereas most of the literature on the topic deals with univariate time series, multivariate time series have typically received much less attention. However, the development of machine learning algorithms for the latter objects has substantially increased in recent years. The R package mlmts attempts to provide a set of widespread data mining techniques for multivariate series. Several functions allowing the execution of clustering, classification, outlier detection and forecasting methods, among others, are included in the package. mlmts also incorporates a collection of multivariate time series datasets often used to test the performance of new classification algorithms. The main characteristics of the package are described and its use is illustrated through various examples. Practitioners from a wide variety of fields could benefit from the general framework provided by mlmts.This research has been supported by the Ministerio de Economía y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de Investigación del Sistema Universitario de Galicia, “CITIC” grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF). This work has received funding for open access charge by University of A Coruña/CISUG.Xunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0

    Improving water network management by efficient division into supply clusters

    Full text link
    El agua es un recurso escaso que, como tal, debe ser gestionado de manera eficiente. Así, uno de los propósitos de dicha gestión debiera ser la reducción de pérdidas de agua y la mejora del funcionamiento del abastecimiento. Para ello, es necesario crear un marco de trabajo basado en un conocimiento profundo de la redes de distribución. En los casos reales, llegar a este conocimiento es una tarea compleja debido a que estos sistemas pueden estar formados por miles de nodos de consumo, interconectados entre sí también por miles de tuberías y sus correspondientes elementos de alimentación. La mayoría de las veces, esas redes no son el producto de un solo proceso de diseño, sino la consecuencia de años de historia que han dado respuesta a demandas de agua continuamente crecientes con el tiempo. La división de la red en lo que denominaremos clusters de abastecimiento, permite la obtención del conocimiento hidráulico adecuado para planificar y operar las tareas de gestión oportunas, que garanticen el abastecimiento al consumidor final. Esta partición divide las redes de distribución en pequeñas sub-redes, que son virtualmente independientes y están alimentadas por un número prefijado de fuentes. Esta tesis propone un marco de trabajo adecuado en el establecimiento de vías eficientes tanto para dividir la red de abastecimiento en sectores, como para desarrollar nuevas actividades de gestión, aprovechando esta estructura dividida. La propuesta de desarrollo de cada una de estas tareas será mediante el uso de métodos kernel y sistemas multi-agente. El spectral clustering y el aprendizaje semi-supervisado se mostrarán como métodos con buen comportamiento en el paradigma de encontrar una red sectorizada que necesite usar el número mínimo de válvulas de corte. No obstante, sus algoritmos se vuelven lentos (a veces infactibles) dividiendo una red de abastecimiento grande.Herrera Fernández, AM. (2011). Improving water network management by efficient division into supply clusters [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/11233Palanci

    Podtipi parkinsonove bolezni na podlagi kratkih časovnih vrst in gručenja z več pogledi

    Full text link
    Parkinson\u27s disease (PD) is a progressive brain disorder which is characterized by movement problems such as tremor, stiffness, slowness of movement and dizziness, as well as non-motor symptoms, which include sleep disorders, constipation, problems concentrating, depression and emotional changes. Due to the clinical heterogeneity of PD, the existence of subtypes of PD patients has been addressed in many clinical and research studies and may contribute to a more personalized treatment and improved quality of life. We apply a methodology for discovering PD patient subtypes to patient data from the Fox Insight study (FI). The data sets are composed from questionnaires, containing patient symptoms and medication data collected through routine study visits. Dividing patients in subtypes can be translated to a problem of clustering time series data. We address this problem by using single-view clustering with k-means algorithm and multi-view spectral clustering. We describe the obtained subtypes with decision rules. Understanding decision making is crucial in medicine and we use decision trees as simple, explainable tools for describing subtypes. An important part of managing the disease is understanding the disease progression. By observing the patient\u27s subtype changes between consecutive visits with skip-grams, we analyze the disease progression.Parkinsonova bolezen (PD) je progresivna možganska motnja, za katero so značilne motnje gibanja, kot so tremor, okorelost, počasnost in omotica, ter nemotorični simptomi, ki vključujejo motnje spanja, zaprtje, težave s koncentracijo, depresijo in čustvene spremembe. Zaradi klinične heterogenosti PD so v številnih kliničnih in raziskovalnih študijah obravnavali obstoj podtipov bolnikov s PD, kar lahko prispeva k bolj prilagojenemu zdravljenju in izboljšanju kakovosti življenja. Predstavljamo metodologijo za odkrivanje podtipov bolnikov s PD z uporabo podatkov o bolnikih iz študije Fox Insight (FI). Nabori podatkov izhajajo iz vprašalnikov, za katere z rutinskimi študijskimi obiski zbirajo podatki o bolnikovih simptomih in zdravilih. Razvrščanje pacientov v podtipe je v bistvu problem združevanja podatkov iz časovnih vrst. V naši nalogi problem rešujemo z algoritmom k-means in s spektralnim združevanjem v okviru učenja z več pogledi. Opis dobljenih podtipov dobimo z generiranjem pravil. Razumevanje odločanja je v medicini ključnega pomena, zato smo odločitvena drevesa uporabili kot preprosto, a razložljivo orodje za opis podtipov. Pomemben del obvladovanja bolezni je razumevanje napredovanja bolezni. Z opazovanjem prehodov pacientov med podtipi tekom zaporednih obiskov analiziramo napredovanje bolezni s pomočjo preskočnih n-gramov
    corecore