371,923 research outputs found

    Multivariate time series classification with temporal abstractions

    Get PDF
    The increase in the number of complex temporal datasets collected today has prompted the development of methods that extend classical machine learning and data mining methods to time-series data. This work focuses on methods for multivariate time-series classification. Time series classification is a challenging problem mostly because the number of temporal features that describe the data and are potentially useful for classification is enormous. We study and develop a temporal abstraction framework for generating multivariate time series features suitable for classification tasks. We propose the STF-Mine algorithm that automatically mines discriminative temporal abstraction patterns from the time series data and uses them to learn a classification model. Our experimental evaluations, carried out on both synthetic and real world medical data, demonstrate the benefit of our approach in learning accurate classifiers for time-series datasets. Copyright © 2009, Assocation for the Advancement of ArtdicaI Intelligence (www.aaai.org). All rights reserved

    Multiple Ontologies for Integrating Complex Phenotype Datasets

    Get PDF
    There has been an emergence of multiple large scale phenotyping projects in the rat model organism community as well as renewed interest in the ongoing phenotype data generated by thousands of researchers using hundreds of rat strains worldwide. Unfortunately, this data is scattered and is neither described nor formatted in a standardized manner. A system to integrate complex phenotype data from multiple sources and facilitate data mining and analysis is being developed using multiple ontologies.

*Introduction*
The potential value of integrating phenotype data from multiple sources (different laboratories, varying techniques to measure similar phenotypes, multiple strains) is enormous. Presented here is a data integration system for complex phenotype data from both large-scale and individual experiments and the taxonomy and ontologies that provide the backbone of this format. RGD along with Mouse Genome Informatics (MGI) (Blake et al, 2009) and the Animal QTL Database (Hu and Reecy, 2007) is developing a Vertebrate Trait Ontology to represent morphological states and physiological processes to be used to annotate quantitative trait loci (QTL) and other data. RGD has also used the Mammalian Phenotype Ontology (Smith et al, 2005) for several years to indicate the relationship of genomic elements to abnormal phenotypes. The Vertebrate Trait Ontology represents what is being assessed, and the Mammalian Phenotype Ontology represents the conclusion that was made. The system presented here represents what was done to measure the trait in order to reach the conclusion. Because of the close relationship among these ontologies, care is being taken to ensure compatibility and similarity in structure using the phenotype properties in the Phenotypic Quality Ontology (PATO) for guidance. ("http://www.bioontology.org/wiki/index.php/PATO:Main_Page":http://www.bioontology.org/wiki/index.php/PATO:Main_Page) 

*Data Format and Ontologies*
Standardization of data types and relationships used to define the phenotype experiment and resulting data, and the ontologies to be used to standardize descriptive fields are being developed. For phenotype data, the major informational components include Researcher, Study, Experiment, Sample, Experimental Conditions and Clinical Measurement. A Rat Strain Taxonomy has been developed to standardize this information and provide the relationships among strains to allow investigators to retrieve and analyze phenotype data for strains that are related genetically. Two important aspects of a phenotype measurement include 1) what was measured and 2) how it was measured. The Clinical Measurement Ontology and the Measurement Method Ontology are being developed to standardize this information. In addition an Experimental Conditions ontology is under construction to allow integration of data measured under various conditions.

*Pilot Study Results*
Cardiovascular and biochemistry phenotype data from two major datasets have been integrated using the Rat Strain Taxonomy and the three phenotype related ontologies. A prototype data mining tool ("http://rgd.mcw.edu/rgdweb/":http://rgd.mcw.edu/rgdweb/) has also been developed that provides the user with options to begin a search with strains or any of the ontologies and make subsequent filter choices from the other ontologies. Choices presented to the user are restricted to those for which data is available and query tracking functions are provided to alert the user to the number of results being returned and the query choices made.

*References*
Blake JA, Bult CJ, Eppig JT, Kadin JA, Richardson JE; Mouse Genome Database Group, 2009 _Nucleic Acids Res_. Jan;37:D712-9.

HuZL, Reecy JM, Animal QTLdb: beyond a repository. A public platform for QTL comparisons and integration with diverse types of structural genomic information, 2007, _Mamm Genome_, Jan;18(1):1-4.

Smith CL, Goldsmith CA, Eppig JT. The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information, _Genome Biol_. 2005 6(1):R7.
&#xa

    Fusing data mining, machine learning and traditional statistics to detect biomarkers associated with depression

    Full text link
    BACKGROUND: Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. METHODS: The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. RESULTS: After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). CONCLUSION: The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin

    Computational framework to analyze agrometeorological, climate and remote sensing data: challenges and perspectives.

    Get PDF
    In the past few years, improvements in the data acquisition technology have decreased the time interval of data gathering. Consequently, institutions have stored huge amounts of data such as climate time series and remote sensing images. Computational models to filter, transform, merge and analyze data from many different areas are complex and challenging. The complexity increases even more when combining several knowledge domains. Examples are research in climatic changes, biofuel production and environmental problems. A possible solution to the problem is the association of several computational techniques. Accordingly, this paper presents a framework to analyze, monitor and visualize climate and remote sensing data by employing methods based on fractal theory, data mining and visualization techniques. Initial experiments showed that the information and knowledge discovered from this framework can be employed to monitor sugar cane crops, helping agricultural entrepreneurs to make decisions in order to become more productive. Sugar cane is the main source to ethanol production in Brazil, and has a strategic importance for the country economy and to guarantee the Brazilian self-sufficiency in this important, renewable source of energy.CSBC 2009

    A platform for discovering and sharing confidential ballistic crime data.

    Get PDF
    Criminal investigations generate large volumes of complex data that detectives have to analyse and understand. This data tends to be "siloed" within individual jurisdictions and re-using it in other investigations can be difficult. Investigations into trans-national crimes are hampered by the problem of discovering relevant data held by agencies in other countries and of sharing those data. Gun-crimes are one major type of incident that showcases this: guns are easily moved across borders and used in multiple crimes but finding that a weapon was used elsewhere in Europe is difficult. In this paper we report on the Odyssey Project, an EU-funded initiative to mine, manipulate and share data about weapons and crimes. The project demonstrates the automatic combining of data from disparate repositories for cross-correlation and automated analysis. The data arrive from different cultural/domains with multiple reference models using real-time data feeds and historical databases

    The potential of text mining in data integration and network biology for plant research : a case study on Arabidopsis

    Get PDF
    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies

    Improving efficiency of information measurement system of coal mine air gas protection

    Get PDF
    Purpose. Development of scientific approaches to creation of high-precision and high-speed optoelectronic measurement systems within the complex of air gas safety of coal mines by means of the developed and implemented methods and means of measurement systems efficiency improvement taking into account compensation of the effect of destabilizing factors. Methods. Experimental studies have been carried out in mine production conditions and laboratories on the physical models of information measurement systems using metrologically certified measuring instruments. Findings. It has been proposed to determine the efficiency of the developed information and measurement systems on the basis of the arithmetic mean of n groups and the geometric mean of the information data rate of m meters measuring mine atmosphere parameters in coal mines for each group separately. It has been found that the use of the developed information system measuring methane and dust concentration within the UTSSC increases data rate of mine air gas protection system by 16.5 bits/s. Originality. For the first time, logical design of information and measurement system of methane and dust concentration has been proposed and implemented, which, in contrast to the existing ones, is based on increasing accuracy and speed of measuring channels response to methane and dust concentration, which allowed to increase probability of detecting explosive situations from 0.90 to 0.98 and provide enhancement of mine air gas protection. Practical implications. The developed methods and techniques allowed to implement a number of projects for the mining industry: high-speed measurement system evaluating methane concentration in a mine complex of monitoring telephone communication and notification “SAT” (private company “Deyta Express”, Ukraine); measurement system of polydisperse dust concentration for unified telecommunication systems of supervisory control and automated management of mining machines and technological complexes “UTSSC” (State Enterprise “Petrovsky Plant of Mining Machinery”, Ukraine).Мета. Розробка наукових підходів до створення високоточних швидкодіючих оптоелектронних вимірювальних систем у складі комплексу забезпечення аерогазової безпеки шахт за рахунок використання запропонованих і реалізованих методів та засобів підвищення ефективності вимірювальних систем на основі обліку й компенсації впливу дестабілізуючих факторів. Методика. Експериментальні дослідження виконано у виробничих умовах шахт і в лабораторіях на фізичних моделях інформаційно-вимірювальних систем з використанням метрологічно-атестованих засобів вимірювань. Результати. Запропоновано визначати ефективність досліджуваної інформаційно-вимірювальної системи на основі середнього арифметичного n груп від середнього геометричних значень інформаційних пропускних спроможностей m вимірювачів параметрів рудничної атмосфери вугільних шахт за кожною групою окремо. Встановлено, що використання розробленої інформаційно-вимірювальної системи концентрації метану та пилу у складі УТАС підвищує пропускну спроможність системи аерогазового захисту шахт на 16.5 біт/с. Наукова новизна. Вперше запропоновано і реалізовано логічну побудову інформаційно-вимірювальної системи концентрації метану та пилу, яка, на відміну від існуючих, заснована на підвищенні точності та швидкодії вимірювальних каналів концентрації метану і пилу, що дозволило збільшити вірогідність виявлення вибухонебезпечних ситуацій з 0.90 до 0.98 та забезпечити зростання рівня аерогазового захисту шахт. Практична значимість. Розроблені методи і засоби дозволили реалізувати низку проектів для підприємств гірничої промисловості: швидкодіюча вимірювальна система концентрації метану для комплексу шахтного диспетчерського телефонного зв’язку та оповіщення “САТ” (Приватна компанія “Дейта Експрес”, Україна); вимірювальна система концентрації полідисперсного пилу для уніфікованої телекомунікаційної системи диспетчерського контролю та автоматизованого управління гірничими машинами і технологічними комплексами “УТАС” (Державне підприємство “Петровський завод вугільного машинобудування”, Україна).Цель. Разработка научных подходов к созданию высокоточных быстродействующих оптоэлектронных измерительных систем в составе комплекса обеспечения аэрогазовой безопасности шахт за счет использования предложенных и реализованных методов и средств повышения эффективности измерительных систем на основе учета и компенсации влияния дестабилизирующих факторов. Методика. Экспериментальные исследования выполнены в производственных условиях шахт и в лабораториях на физических моделях информационно-измерительных систем с использованием метрологически-аттестованных средств измерений. Результаты. Предложено определять эффективность исследуемой информационно-измерительной системы на основе среднего арифметического n групп среднего геометрических значений информационных пропускных способностей m измерителей параметров рудничной атмосферы угольных шахт по каждой группе отдельно. Установлено, что использование разработанной информационно-измерительной системы концентрации метана и пыли в составе УТАС повышает пропускную способность системы аэрогазового защиты шахт на 16.5 бит/с. Научная новизна. Впервые предложено и реализовано логическое построение информационно-измерительной системы концентрации метана и пыли, которая, в отличие от существующих, основана на повышении точности и быстродействия измерительных каналов концентрации метана и пыли, что позволило увеличить вероятность обнаружения взрывоопасных ситуаций с 0.90 до 0.98 и обеспечить рост уровня аэрогазовой защиты шахт. Практическая значимость. Разработанные методы и средства позволили реализовать ряд проектов для предприятий горной промышленности: быстродействующая измерительная система концентрации метана для комплекса шахтной диспетчерской телефонной связи и оповещения “САТ” (Частная компания “Дейта Экспресс”, Украина); измерительная система концентрации полидисперсной пыли для унифицированной телекоммуникационной системы диспетчерского контроля и автоматизированного управления горными машинами и технологическими комплексами “УТАС” (Государственное предприятие “Петровский завод угольного машиностроения”, Украина).This work would be impossible without the financial support of the Ministry of Education and Science of Ukraine during the execution of the project No 0115U002655 “Research and development of an experimental sample of optical meter of methane concentration for coal mines”. Additional financial support was provided during the implementation of the Inter-Regional Programme of the European Neighbourhood and Partnership Instrument Tempus VI on the project 544010 – TEMPUS – 1 – 2013 – 1 – DE – TEMPUS – JPHES “TATU: Trainings in Automation Technologies for Ukraine”. The authors express gratitude to the employees of the State Enterprise “Petrovsky Plant of Mining Machinery” and the private company “Deyta Express” for participating in creation of research sample meters of methane and dust concentration for coal mine conditions, as well as support in conducting research in industrial conditions

    Ontology of core data mining entities

    Get PDF
    In this article, we present OntoDM-core, an ontology of core data mining entities. OntoDM-core defines themost essential datamining entities in a three-layered ontological structure comprising of a specification, an implementation and an application layer. It provides a representational framework for the description of mining structured data, and in addition provides taxonomies of datasets, data mining tasks, generalizations, data mining algorithms and constraints, based on the type of data. OntoDM-core is designed to support a wide range of applications/use cases, such as semantic annotation of data mining algorithms, datasets and results; annotation of QSAR studies in the context of drug discovery investigations; and disambiguation of terms in text mining. The ontology has been thoroughly assessed following the practices in ontology engineering, is fully interoperable with many domain resources and is easy to extend
    corecore