213 research outputs found

    NEW METHODS FOR MINING SEQUENTIAL AND TIME SERIES DATA

    Get PDF
    Data mining is the process of extracting knowledge from large amounts of data. It covers a variety of techniques aimed at discovering diverse types of patterns on the basis of the requirements of the domain. These techniques include association rules mining, classification, cluster analysis and outlier detection. The availability of applications that produce massive amounts of spatial, spatio-temporal (ST) and time series data (TSD) is the rationale for developing specialized techniques to excavate such data. In spatial data mining, the spatial co-location rule problem is different from the association rule problem, since there is no natural notion of transactions in spatial datasets that are embedded in continuous geographic space. Therefore, we have proposed an efficient algorithm (GridClique) to mine interesting spatial co-location patterns (maximal cliques). These patterns are used as the raw transactions for an association rule mining technique to discover complex co-location rules. Our proposal includes certain types of complex relationships – especially negative relationships – in the patterns. The relationships can be obtained from only the maximal clique patterns, which have never been used until now. Our approach is applied on a well-known astronomy dataset obtained from the Sloan Digital Sky Survey (SDSS). ST data is continuously collected and made accessible in the public domain. We present an approach to mine and query large ST data with the aim of finding interesting patterns and understanding the underlying process of data generation. An important class of queries is based on the flock pattern. A flock is a large subset of objects moving along paths close to each other for a predefined time. One approach to processing a “flock query” is to map ST data into high-dimensional space and to reduce the query to a sequence of standard range queries that can be answered using a spatial indexing structure; however, the performance of spatial indexing structures rapidly deteriorates in high-dimensional space. This thesis sets out a preprocessing strategy that uses a random projection to reduce the dimensionality of the transformed space. We use probabilistic arguments to prove the accuracy of the projection and to present experimental results that show the possibility of managing the curse of dimensionality in a ST setting by combining random projections with traditional data structures. In time series data mining, we devised a new space-efficient algorithm (SparseDTW) to compute the dynamic time warping (DTW) distance between two time series, which always yields the optimal result. This is in contrast to other approaches which typically sacrifice optimality to attain space efficiency. The main idea behind our approach is to dynamically exploit the existence of similarity and/or correlation between the time series: the more the similarity between the time series, the less space required to compute the DTW between them. Other techniques for speeding up DTW, impose a priori constraints and do not exploit similarity characteristics that may be present in the data. Our experiments demonstrate that SparseDTW outperforms these approaches. We discover an interesting pattern by applying SparseDTW algorithm: “pairs trading” in a large stock-market dataset, of the index daily prices from the Australian stock exchange (ASX) from 1980 to 2002

    Interactive Feature Selection and Visualization for Large Observational Data

    Get PDF
    Data can create enormous values in both scientific and industrial fields, especially for access to new knowledge and inspiration of innovation. As the massive increases in computing power, data storage capacity, as well as capability of data generation and collection, the scientific research communities are confronting with a transformation of exploiting the advanced uses of the large-scale, complex, and high-resolution data sets in situation awareness and decision-making projects. To comprehensively analyze the big data problems requires the analyses aiming at various aspects which involves of effective selections of static and time-varying feature patterns that fulfills the interests of domain users. To fully utilize the benefits of the ever-growing size of data and computing power in real applications, we proposed a general feature analysis pipeline and an integrated system that is general, scalable, and reliable for interactive feature selection and visualization of large observational data for situation awareness. The great challenge tackled in this dissertation was about how to effectively identify and select meaningful features in a complex feature space. Our research efforts mainly included three aspects: 1. Enable domain users to better define their interests of analysis; 2. Accelerate the process of feature selection; 3. Comprehensively present the intermediate and final analysis results in a visualized way. For static feature selection, we developed a series of quantitative metrics that related the user interest with the spatio-temporal characteristics of features. For timevarying feature selection, we proposed the concept of generalized feature set and used a generalized time-varying feature to describe the selection interest. Additionally, we provided a scalable system framework that manages both data processing and interactive visualization, and effectively exploits the computation and analysis resources. The methods and the system design together actualized interactive feature selections from two representative large observational data sets with large spatial and temporal resolutions respectively. The final results supported the endeavors in applications of big data analysis regarding combining the statistical methods with high performance computing techniques to visualize real events interactively

    Wnt evolution and function shuffling in liberal and conservative chordate genomes

    Get PDF
    Background What impact gene loss has on the evolution of developmental processes, and how function shuffling has affected retained genes driving essential biological processes, remain open questions in the fields of genome evolution and EvoDevo. To investigate these problems, we have analyzed the evolution of the Wnt ligand repertoire in the chordate phylum as a case study. Results We conduct an exhaustive survey of Wnt genes in genomic databases, identifying 156 Wnt genes in 13 non-vertebrate chordates. This represents the most complete Wnt gene catalog of the chordate subphyla and has allowed us to resolve previous ambiguities about the orthology of many Wnt genes, including the identification of WntA for the first time in chordates. Moreover, we create the first complete expression atlas for the Wnt family during amphioxus development, providing a useful resource to investigate the evolution of Wnt expression throughout the radiation of chordates. Conclusions Our data underscore extraordinary genomic stasis in cephalochordates, which contrasts with the liberal and dynamic evolutionary patterns of gene loss and duplication in urochordate genomes. Our analysis has allowed us to infer ancestral Wnt functions shared among all chordates, several cases of function shuffling among Wnt paralogs, as well as unique expression domains for Wnt genes that likely reflect functional innovations in each chordate lineage. Finally, we propose a potential relationship between the evolution of WntA and the evolution of the mouth in chordates

    Oyster Aquaculture Site Selection Using Landsat 8-derived Sea Surface Temperature, Turbidity, and Chlorophyll a.

    Get PDF
    Remote sensing data is useful for selection of aquaculture sites because it can provide water-quality products mapped with no cost to users. However, the spatial resolution of most ocean color satellites is too coarse to provide usable data within many estuaries. The more recently launched Landsat 8 satellite has both the spatial resolution and the necessary signal to noise ratio to provide temperature, as well as ocean color derived products along complex coastlines. The state of Maine (USA) has an abundance of estuarine indentations (~3,500 miles of tidal shoreline within 220 miles of coast), and an expanding aquaculture industry, which makes it a prime case-study for using Landsat 8 data to provide products suitable for aquaculture site selection. We collected the Landsat 8 scenes over coastal Maine, flagged clouds, atmospherically corrected the top-of-the-atmosphere radiances, and derived time varying fields (repeat time of Landsat 8 is 16 days) of temperature (100 m resolution), turbidity (30 m resolution), and chlorophyll-a (30 m resolution). We validated the remote-sensing-based products at several in situ locations along the Maine coast where monitoring buoys and programs are in place. Initial analysis of the validated fields revealed promising areas for oyster aquaculture. The approach used and the data collected to date show potential for other applications in marine coastal environments, including water quality monitoring and ecosystem management

    Engineering Approaches for Improving Cortical Interfacing and Algorithms for the Evaluation of Treatment Resistant Epilepsy

    Get PDF
    abstract: Epilepsy is a group of disorders that cause seizures in approximately 2.2 million people in the United States. Over 30% of these patients have epilepsies that do not respond to treatment with anti-epileptic drugs. For this population, focal resection surgery could offer long-term seizure freedom. Surgery candidates undergo a myriad of tests and monitoring to determine where and when seizures occur. The “gold standard” method for focus identification involves the placement of electrocorticography (ECoG) grids in the sub-dural space, followed by continual monitoring and visual inspection of the patient’s cortical activity. This process, however, is highly subjective and uses dated technology. Multiple studies were performed to investigate how the evaluation process could benefit from an algorithmic adjust using current ECoG technology, and how the use of new microECoG technology could further improve the process. Computational algorithms can quickly and objectively find signal characteristics that may not be detectable with visual inspection, but many assume the data are stationary and/or linear, which biological data are not. An empirical mode decomposition (EMD) based algorithm was developed to detect potential seizures and tested on data collected from eight patients undergoing monitoring for focal resection surgery. EMD does not require linearity or stationarity and is data driven. The results suggest that a biological data driven algorithm could serve as a useful tool to objectively identify changes in cortical activity associated with seizures. Next, the use of microECoG technology was investigated. Though both ECoG and microECoG grids are composed of electrodes resting on the surface of the cortex, changing the diameter of the electrodes creates non-trivial changes in the physics of the electrode-tissue interface that need to be accounted for. Experimenting with different recording configurations showed that proper grounding, referencing, and amplification are critical to obtain high quality neural signals from microECoG grids. Finally, the relationship between data collected from the cortical surface with micro and macro electrodes was studied. Simultaneous recordings of the two electrode types showed differences in power spectra that suggest the inclusion of activity, possibly from deep structures, by macroelectrodes that is not accessible by microelectrodes.Dissertation/ThesisDoctoral Dissertation Bioengineering 201
    • …
    corecore