608 research outputs found

    An overview of emerging pattern mining in supervised descriptive rule discovery: taxonomy, empirical study, trends, and prospects

    Get PDF
    Emerging pattern mining is a data mining task that aims to discover discriminative patterns, which can describe emerging behavior with respect to a property of interest. In recent years, the description of datasets has become an interesting field due to the easy acquisition of knowledge by the experts. In this review, we will focus on the descriptive point of view of the task. We collect the existing approaches that have been proposed in the literature and group them together in a taxonomy in order to obtain a general vision of the task. A complete empirical study demonstrates the suitability of the approaches presented. This review also presents future trends and emerging prospects within pattern mining and the benefits of knowledge extracted from emerging patternsSpanish Ministry of Economy and Competitiveness under the project TIN2015-68454-R (FEDER Founds

    Supervised Descriptive Rule Discovery: A Survey of the State-of-the-Art

    Get PDF
    The supervised descriptive rule discovery concept groups a set of data mining techniques whose objective is to describe data with respect to a property of interest. Among the techniques within this concept are the subgroup discovery, emerging patterns and contrast sets. This contribution presents the supervised descriptive rule discovery concept within the data mining literature. Specifically, it is important to remark the main di erence with respect to other existing techniques within classification or description. In addition, a a survey of the state-of-the-art about the different techniques within supervised descriptive rule discovery throughout the literature can be observed. The paper allows to the experts to analyse the compatibilities between terms and heuristics of the different data mining tasks within this concept

    11th German Conference on Chemoinformatics (GCC 2015) : Fulda, Germany. 8-10 November 2015.

    Get PDF

    Maturases and Group II Introns in the Mitochondrial Genomes of the Deepest Jakobid Branch

    Get PDF
    Ophirinina is a recently described suborder of jakobid protists (Excavata) with only one described species to date, Ophirina amphinema. Despite the acquisition and analysis of massive transcriptomic and mitogenomic sequence data from O. amphinema, its phylogenetic position among excavates remained inconclusive, branching as sister group either to all Jakobida or to all Discoba. From a morphological perspective, it has not only several typical jakobid features but also unusual traits for this group, including the morphology of mitochondrial cristae (sac-shaped to flattened-curved cristae) and the presence of two flagellar vanes. In this study, we have isolated, morphologically characterized, and sequenced genome and transcriptome data of two new Ophirinina species: Ophirina chinija sp. nov. and Agogonia voluta gen. et sp. nov. Ophirina chinija differs from O. amphinema in having rounded cell ends, subapically emerging flagella and a posterior cell protrusion. The much more distantly related A. voluta has several unique ultrastructural characteristics, including sac-shaped mitochondrial cristae and a complex “B” fiber. Phylogenomic analyses with a large conserved-marker dataset supported the monophyly of Ophirina and Agogonia within the Ophirinina and, more importantly, resolved the conflicting position of ophirinids as the sister clade to all other jakobids. The characterization of the mitochondrial genomes showed that Agogonia differs from all known gene-rich jakobid mitogenomes by the presence of two group II introns and their corresponding maturase protein genes. A phylogenetic analysis of the diversity of known maturases confirmed that the Agogonia proteins are highly divergent from each other and define distant families among the prokaryotic and eukaryotic maturases. This opens the intriguing possibility that, compared to other jakobids, Ophirinina may have retained additional mitochondrial elements that may help to understand the early diversification of eukaryotes and the evolution of mitochondria.We thank Dr. Sergey A. Karpov (Zoological Institute RAS) and Dr. Denis V. Tikhonenkov (IBIW RAS) for their helpful discussions on the ultrastructural observations, and Dr. Hwan Su Yoon (Sungkyunkwan University) for the mitochondrial marker alignments. This work was supported by the European Research Council (ERC) Advanced Grants “Protistworld” and “Plast-Evol” (322669 and 787904, respectively) and the Horizon 2020 research and innovation program under the Marie Skłodowska-Curie ITN project SINGEK H2020-MSCA-ITN-2015-675752 (http://www.singek.eu/). G.T. was supported by the 2019 BP 00208 Beatriu de Pinos-3 Postdoctoral Program (BP3; 801370).Peer ReviewedPostprint (published version

    JIDOKA. Integration of Human and AI within Industry 4.0 Cyber Physical Manufacturing Systems

    Get PDF
    This book is about JIDOKA, a Japanese management technique coined by Toyota that consists of imbuing machines with human intelligence. The purpose of this compilation of research articles is to show industrial leaders innovative cases of digitization of value creation processes that have allowed them to improve their performance in a sustainable way. This book shows several applications of JIDOKA in the quest towards an integration of human and AI within Industry 4.0 Cyber Physical Manufacturing Systems. From the use of artificial intelligence to advanced mathematical models or quantum computing, all paths are valid to advance in the process of human–machine integration

    The Evolution of Diversity

    Get PDF
    Since the beginning of time, the pre-biological and the biological world have seen a steady increase in complexity of form and function based on a process of combination and re-combination. The current modern synthesis of evolution known as the neo-Darwinian theory emphasises population genetics and does not explain satisfactorily all other occurrences of evolutionary novelty. The authors suggest that symbiosis and hybridisation and the more obscure processes such as polyploidy, chimerism and lateral transfer are mostly overlooked and not featured sufficiently within evolutionary theory. They suggest, therefore, a revision of the existing theory including its language, to accommodate the scientific findings of recent decades

    Unravelling Organelle Genome Transcription Using Publicly Available RNA-Sequencing Data

    Get PDF
    The study of organelles helped forge theories of genome evolution because of their unconventional genomes and gene expression regimes. The organelle genomics field (~35 years old) has seen the development of next generation sequencing (NGS) techniques and the consequent skyrocketing of genomic and transcriptomic data. However, these data are being underused in the studies of organelle genome transcription. My thesis investigates how NGS has affected the field of organelle genomics at both the DNA and RNA levels. First, I demonstrate that although organelle genomes are being sequenced as never before, they are un-characterized as they are published mostly as “organelle genome reports”. Then, I show that publicly available RNA-sequencing data represent an untapped datasource to study organelle genome transcription. I uncover the widespread pervasive transcription of organelle genomes across eukaryotes and speculate that this mechanism might have influenced the evolution of land plant terrestrialization and trophic mode determination in mixotrophs

    Scalable processing of aggregate functions for data streams in resource-constrained environments

    Get PDF
    The fast evolution of data analytics platforms has resulted in an increasing demand for real-time data stream processing. From Internet of Things applications to the monitoring of telemetry generated in large datacenters, a common demand for currently emerging scenarios is the need to process vast amounts of data with low latencies, generally performing the analysis process as close to the data source as possible. Devices and sensors generate streams of data across a diversity of locations and protocols. That data usually reaches a central platform that is used to store and process the streams. Processing can be done in real time, with transformations and enrichment happening on-the-fly, but it can also happen after data is stored and organized in repositories. In the former case, stream processing technologies are required to operate on the data; in the latter batch analytics and queries are of common use. Stream processing platforms are required to be malleable and absorb spikes generated by fluctuations of data generation rates. Data is usually produced as time series that have to be aggregated using multiple operators, being sliding windows one of the most common abstractions used to process data in real-time. To satisfy the above-mentioned demands, efficient stream processing techniques that aggregate data with minimal computational cost need to be developed. However, data analytics might require to aggregate extensive windows of data. Approximate computing has been a central paradigm for decades in data analytics in order to improve the performance and reduce the needed resources, such as memory, computation time, bandwidth or energy. In exchange for these improvements, the aggregated results suffer from a level of inaccuracy that in some cases can be predicted and constrained. This doctoral thesis aims to demonstrate that it is possible to have constant-time and memory efficient aggregation functions with approximate computing mechanisms for constrained environments. In order to achieve this goal, the work has been structured in three research challenges. First we introduce a runtime to dynamically construct data stream processing topologies based on user-supplied code. These dynamic topologies are built on-the-fly using a data subscription model de¿ned by the applications that consume data. The subscription-based programing model enables multiple users to deploy their own data-processing services. On top of this runtime, we present the Amortized Monoid Tree Aggregator general sliding window aggregation framework, which seamlessly combines the following features: amortized O(1) time complexity and a worst-case of O(log n) between insertions; it provides both a window aggregation mechanism and a window slide policy that are user programmable; the enforcement of the window sliding policy exhibits amortized O(1) computational cost for single evictions and supports bulk evictions with cost O(log n); and it requires a local memory space of O(log n). The framework can compute aggregations over multiple data dimensions, and has been designed to support decoupling computation and data storage through the use of distributed Key-Value Stores to keep window elements and partial aggregations. Specially motivated by edge computing scenarios, we contribute Approximate and Amortized Monoid Tree Aggregator (A2MTA). It is, to our knowledge, the first general purpose sliding window programable framework that combines constant-time aggregations with error bounded approximate computing techniques. A2MTA uses statistical analysis of the stream data in order to perform inaccurate aggregations, providing a critical reduction of needed resources for massive stream data aggregation, and an improvement of performance.La ràpida evolució de les plataformes d'anàlisi de dades ha resultat en un increment de la demanda de processament de fluxos continus de dades en temps real. Des de la internet de les coses fins al monitoratge de telemetria generada en grans servidors, una demanda recurrent per escenaris emergents es la necessitat de processar grans quantitats de dades amb latències molt baixes, generalment fent el processat de les dades tant a prop dels origines com sigui possible. Les dades son generades com a fluxos continus per dispositius que utilitzen una varietat de localitzacions i protocols. Aquests processat de les dades s pot fer en temps real amb les transformacions efectuant-se al vol, i en aquest cas la utilització de plataformes de processat d'streams és necessària. Les plataformes de processat d'streams cal que absorbeixin pics de freqüència de dades. Les dades es generen com a series temporals que s'agreguen fent servir multiples operadors, on les finestres són l'abstracció més habitual. Per a satisfer les baixes latències i maleabilitat requerides, els operadors necesiten tenir un cost computacional mínim, inclús amb extenses finestres de dades per a agregar. La computació aproximada ha sigut durant decades un paradigma rellevant per l'anàlisi de dades on cal millorar el rendiment de diferents algorismes i reduir-ne el temps de computació, la memòria requerida, l'ample de banda o el consum energètic. A canvi d'aquestes millores, els resultats poden patir d'una falta d'exactitud que pot ser estimada i controlada. Aquesta tesi doctoral vol demostrar que es posible tenir funcions d'agregació pel processat d'streams que tinc un cost de temps constant, sigui eficient en termes de memoria i faci ús de computació aproximada. Per aconseguir aquests objectius, aquesta tesi està dividida en tres reptes. Primer presentem un entorn per a la construcció dinàmica de topologies de computació d'streams de dades utilitzant codi d'usuari. Aquestes topologies es construeixen fent servir un model de subscripció a streams, en el que les aplicación consumidores de dades amplien les topologies mentre s'estan executant. Aquest entorn permet multiples entitats ampliant una mateixa topologia. A sobre d'aquest entorn, presentem un framework de propòsit general per a l'agregació de finestres de dades anomenat AMTA (Amortized Monoid Tree Aggregator). Aquest framework combina: temps amortitzat constant per a totes les operacions, amb un cas pitjor logarítmic; programable tant en termes d'agregació com en termes d'expulsió d'elements de la finestra. L'expulsió massiva d'elements de la finestra es considera una operació atòmica, amb un cost amortitzat constant; i requereix espai en memoria local per a O(log n) elements de la finestra. Aquest framework pot computar agregacions sobre multiples dimensions de dades, i ha estat dissenyat per desacoplar la computació de les dades del seu desat, podent tenir els continguts de la finestra distribuits en diferents màquines. Motivats per la computació en l'edge (edge computing), hem contribuit A2MTA (Approximate and Amortized Monoid Tree Aggregator). Des de el nostre coneixement, es el primer framework de propòsit general per a la computació de finestres que combina un cost constant per a totes les seves operacions amb tècniques de computació aproximada amb control de l'error. A2MTA fa us d'anàlisis estadístics per a poder fer agregacions amb error limitat, reduint críticament els recursos necessaris per a la computació de grans quantitats de dades

    Estimating Movement from Mobile Telephony Data

    Get PDF
    Mobile enabled devices are ubiquitous in modern society. The information gathered by their normal service operations has become one of the primary data sources used in the understanding of human mobility, social connection and information transfer. This thesis investigates techniques that can extract useful information from anonymised call detail records (CDR). CDR consist of mobile subscriber data related to people in connection with the network operators, the nature of their communication activity (voice, SMS, data, etc.), duration of the activity and starting time of the activity and servicing cell identification numbers of both the sender and the receiver when available. The main contributions of the research are a methodology for distance measurements which enables the identification of mobile subscriber travel paths and a methodology for population density estimation based on significant mobile subscriber regions of interest. In addition, insights are given into how a mobile network operator may use geographically located subscriber data to create new revenue streams and improved network performance. A range of novel algorithms and techniques underpin the development of these methodologies. These include, among others, techniques for CDR feature extraction, data visualisation and CDR data cleansing. The primary data source used in this body of work was the CDR of Meteor, a mobile network operator in the Republic of Ireland. The Meteor network under investigation has just over 1 million customers, which represents approximately a quarter of the country’s 4.6 million inhabitants, and operates using both 2G and 3G cellular telephony technologies. Results show that the steady state vector analysis of modified Markov chain mobility models can return population density estimates comparable to population estimates obtained through a census. Evaluated using a test dataset, results of travel path identification showed that developed distance measurements achieved greater accuracy when classifying the routes CDR journey trajectories took compared to traditional trajectory distance measurements. Results from subscriber segmentation indicate that subscribers who have perceived similar relationships to geographical features can be grouped based on weighted steady state mobility vectors. Overall, this thesis proposes novel algorithms and techniques for the estimation of movement from mobile telephony data addressing practical issues related to sampling, privacy and spatial uncertainty

    Modelos descriptivos basados en aprendizaje supervisado para el tratamiento de big data y flujos continuos de datos

    Get PDF
    En esta tesis se analizan en profundidad las tareas de descubrimiento de subgrupos y minería de patrones emergentes enfocadas a la resolución de problemas complejos, como big data y flujos continuos de datos, entre otros. Además, se destacan diferentes problemas abiertos en este área. En particular, para descubrimiento de subgrupos se presenta un análisis de la influencia de ruido en los datos en los principales sistemas difusos evolutivos desarrollados; un paquete software para la plataforma R con los principales algoritmos basados en sistemas difusos evolutivos; y un análisis del comportamiento de los principales enfoques a problemas multi-instancia, mediante la realización de adaptaciones de los mismos. Con respecto a la minería de patrones emergentes, se presenta una revisión de los principales enfoques desarrollados en la tarea desde el punto de vista descriptivo y tres propuestas basadas en sistemas difusos evolutivos: una enfocada a mejorar la calidad del conocimiento extraído desde el punto de vista descriptivo; otra enfocada a realizar esta extracción en el ámbito big data y un último método enfocado al contexto de la minería de flujo de datos. Los resultados obtenidos muestran que los métodos propuestos permiten obtener conocimiento de calidad capaz de ayudar a la toma de decisiones por parte de los expertos en problemas complejos.In this thesis the subgroup discovery and emerging pattern mining tasks for the resolution of complex problems, such as big data and data stream mining, among others, are analysed in depth. Different methods and tools are proposed in order to extract descriptive knowledge from these types of environments. In addition, different open problems in this area are highlighted. In particular, for subgroup discovery an analysis of the influence of data noise on the main evolutionary fuzzy systems developed is presented; a software package for the R platform with the main algorithms based on evolutionary fuzzy systems is proposed; and an initial analysis of the behaviour of the main approaches adapted to multi-instance problems, a complex problem on the rise, is shown. With respect to emerging pattern mining, a review of the main approaches developed in the task from a descriptive point of view is presented, together with three developments based on evolutionary fuzzy systems: one focused on improving the quality of the extracted knowledge from a descriptive point of view; another focused on performing this extraction in the big data domain and a last method focused on the context of data stream mining.Tesis Univ. Jaén. Departamento de Informática. Leída el 28 de abril de 2020
    corecore