880,435 research outputs found

    Efficient Algorithms for Fast Integration on Large Data Sets from Multiple Sources

    Get PDF
    Background Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently. Methods Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block. Results We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach. Conclusions In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records

    Potential of multisensor data and strategies for data acquisition and analysis

    Get PDF
    Registration and simultaneous analysis of multisensor images is useful because the multiple data sets can be compressed through image processing techniques to facilitate interpretation. This also allows integration of other spatial data sets. Techniques being developed to analyze multisensor images involve comparison of image data with a library of attributes based on physical properties measured by each sensor. This results in the ability to characterize geologic units based on their similarity to the library attributes, as well as discriminate among them. Several studies can provide information on ways to optimize multisensor remote sensing. Continued analyses of the Death Valley and San Rafael Swell data sets can provide insight into tradeoffs in spectral and spatial resolutions of the various sensors used to obtain the coregistered data sets. These include imagery from LANDSAT, SEASAT, HCMM, SIR-A, 11-channel VIS-NIR, thermal inertia images, and aircraft L- and X-band radar

    Combining Test Statistics and Information Criteria for High Dimensional Data Integration

    Get PDF
    This research is focused on high dimensional data integration by combing test statistics or information criteria. Our research contains four projects. Firstly, an integration method is developed to perform hypothesis testing and biomarkers selection based on multi-platform data sets observed from normal and diseased populations. Secondly, non-parametric method is developed to cluster continuous data mixed with categorical data, where modified Chi-squared tests are used to detect of cluster patterns on the product space. Thirdly, weighted integrative AICs criterion is developed to be used for model selection across multiple data sets. Finally, Linhart's and Shimodaria's test statistics are extended onto composite likelihood function to perform model comparison test for correlated data

    Interactive 3-D Visualization: A tool for seafloor navigation, exploration, and engineering

    Get PDF
    Recent years have seen remarkable advances in sonar technology, positioning capabilities, and computer processing power that have revolutionized the way we image the seafloor. The massive amounts of data produced by these systems present many challenges but also offer tremendous opportunities in terms of visualization and analysis. We have developed a suite of interactive 3-D visualization and exploration tools specifically designed to facilitate the interpretation and analysis of very large (10\u27s to 100\u27s of megabytes), complex, multi-component spatial data sets. If properly georeferenced and treated, these complex data sets can be presented in a natural and intuitive manner that allows the integration of multiple components each at their inherent level of resolution and without compromising the quantitative nature of the data. Artificial sun-illumination, shading, and 3-D rendering can be used with digital bathymetric data (DTM\u27s) to form natural looking and easily interpretable, yet quantitative, landscapes. Color can be used to represent depth or other parameters (like backscatter or sediment properties) which can be draped over the DTM, or high resolution imagery can be texture mapped on bathymetric data. When combined with interactive analytical tools, this environment has facilitated the use of multibeam sonar and other data sets in a range of geologic, environmental, fisheries, and engineering applications

    An integrated approach to the prediction of domain-domain interactions

    Get PDF
    BACKGROUND: The development of high-throughput technologies has produced several large scale protein interaction data sets for multiple species, and significant efforts have been made to analyze the data sets in order to understand protein activities. Considering that the basic units of protein interactions are domain interactions, it is crucial to understand protein interactions at the level of the domains. The availability of many diverse biological data sets provides an opportunity to discover the underlying domain interactions within protein interactions through an integration of these biological data sets. RESULTS: We combine protein interaction data sets from multiple species, molecular sequences, and gene ontology to construct a set of high-confidence domain-domain interactions. First, we propose a new measure, the expected number of interactions for each pair of domains, to score domain interactions based on protein interaction data in one species and show that it has similar performance as the E-value defined by Riley et al. [1]. Our new measure is applied to the protein interaction data sets from yeast, worm, fruitfly and humans. Second, information on pairs of domains that coexist in known proteins and on pairs of domains with the same gene ontology function annotations are incorporated to construct a high-confidence set of domain-domain interactions using a Bayesian approach. Finally, we evaluate the set of domain-domain interactions by comparing predicted domain interactions with those defined in iPfam database [2,3] that were derived based on protein structures. The accuracy of predicted domain interactions are also confirmed by comparing with experimentally obtained domain interactions from H. pylori [4]. As a result, a total of 2,391 high-confidence domain interactions are obtained and these domain interactions are used to unravel detailed protein and domain interactions in several protein complexes. CONCLUSION: Our study shows that integration of multiple biological data sets based on the Bayesian approach provides a reliable framework to predict domain interactions. By integrating multiple data sources, the coverage and accuracy of predicted domain interactions can be significantly increased

    A sparse PLS for variable selection when integrating omics data

    Get PDF
    Recent biotechnology advances allow for multiple types of omics data, such as transcriptomic, proteomic or metabolomic data sets to be integrated. The problem of feature selection has been addressed several times in the context of classification, but needs to be handled in a specific manner when integrating data. In this study, we focus on the integration of two-block data that are measured on the same samples. Our goal is to combine integration and simultaneous variable selection of the two data sets in a one-step procedure using a Partial Least Squares regression (PLS) variant to facilitate the biologists' interpretation. A novel computational methodology called "sparse PLS" is introduced for a predictive analysis to deal with these newly arisen problems. The sparsity of our approach is achieved with a Lasso penalization of the PLS loading vectors when computing the Singular Value Decomposition. Sparse PLS is shown to be effective and biologically meaningful. Comparisons with classical PLS are performed on a simulated data set and on real data sets. On one data set, a thorough biological interpretation of the obtained results is provided. We show that sparse PLS provides a valuable variable selection tool for highly dimensional data sets. Copyright ©2008 The Berkeley Electronic Press. All rights reserved

    Smoothing Hazard Functions and Time-Varying Effects in Discrete Duration and Competing Risks Models

    Get PDF
    State space or dynamic approaches to discrete or grouped duration data with competing risks or multiple terminating events allow simultaneous modelling and smooth estimation of hazard functions and time-varying effects in a flexible way. Full Bayesian or posterior mean estimation, using numerical integration techniques or Monte Carlo methods, can become computationally rather demanding or even infeasible for higher dimensions and larger data sets. Therefore, based on previous work on filtering and smoothing for multicategorical time series and longitudinal data, our approach uses posterior mode estimation. Thus we have to maximize posterior densities or, equivalently, a penalized likelihood, which enforces smoothness of hazard functions and time-varying effects by a roughness penalty. Dropping the Bayesian smoothness prior and adopting a nonparametric viewpoint, one might also start directly from maximizing this penalized likelihood. We show how Fisher scoring smoothing iterations can be carried out efficiently by iteratively applying linear Kalman filtering and smoothing to a working model. This algorithm can be combined with an EM-type procedure to estimate unknown smoothing- or hyperparameters. The methods are applied to a larger set of unemployment duration data with one and, in a further analysis, multiple terminating events from the German socio-economic panel GSOEP

    Integration in the European Retail Banking Sector : Evidence from Deposit and Lending Rates

    Get PDF
    This paper investigates the degree of integration in the retail-banking sector for 15 European Union member states between the period 1991 to March 2008. In view of consolidating and creating a single market in their financial services sector, the EU has launched and implemented major initiatives over the past years. The wholesale banking sector has been studied extensively but the retail market to a much lesser extent. The difficulty in analysing the integration process in the banking market is linked to the heterogeneity that exists across the European countries with regards to factors such as risk attitudes, cultural differences, and the home-bias criteria. As a result, it is argued that any convergence process in the banking sector, if present, should rather be perceived as a long-run relationship. Consequently, cointegration analysis, a technique used to capture such long-term relationships between sets of variables is used to analyse the integration process in the EU retail-banking sector. The starting point in the empirical analysis involves conducting multiple structural break analysis. Given that during the period under investigation, there have been significant milestones in the history of the European single market, the deposit and lending rates corresponding to this period are likely to exhibit structural change. Moreover, the timing and pattern of structural break occurrence should also act as an indicator of retail banking integration. The next steps in the empirical analysis look at stationarity tests for both time series data and panel data on data series that are also individually demeaned so as to account for structural breaks. Finally, bivariate time-series cointegration analysis on each of the EU countries and a weighted European average rate is performed. The cointegration analysis is performed on both level and demeaned data
    • 

    corecore