336 research outputs found
An analysis of extensible modelling for functional genomics data
BACKGROUND: Several data formats have been developed for large scale biological experiments, using a variety of methodologies. Most data formats contain a mechanism for allowing extensions to encode unanticipated data types. Extensions to data formats are important because the experimental methodologies tend to be fairly diverse and rapidly evolving, which hinders the creation of formats that will be stable over time. RESULTS: In this paper we review the data formats that exist in functional genomics, some of which have become de facto or de jure standards, with a particular focus on how each domain has been modelled, and how each format allows extensions. We describe the tasks that are frequently performed over data formats and analyse how well each task is supported by a particular modelling structure. CONCLUSION: From our analysis, we make recommendations as to the types of modelling structure that are most suitable for particular types of experimental annotation. There are several standards currently under development that we believe could benefit from systematically following a set of guidelines
Deep Clustering for Data Cleaning and Integration
Deep Learning (DL) techniques now constitute the state-of-the-art for
important problems in areas such as text and image processing, and there have
been impactful results that deploy DL in several data management tasks. Deep
Clustering (DC) has recently emerged as a sub-discipline of DL, in which data
representations are learned in tandem with clustering, with a view to
automatically identifying the features of the data that lead to improved
clustering results. While DC has been used to good effect in several domains,
particularly in image processing, the impact of DC on mainstream data
management tasks remains unexplored. In this paper, we address this gap by
investigating the impact of DC in data cleaning and integration tasks,
specifically schema inference, entity resolution, and domain discovery, tasks
that represent clustering from the perspective of tables, rows, and columns,
respectively. In this setting, we compare and contrast several DC and non-DC
clustering algorithms using standard benchmarks. The results show, among other
things, that the most effective DC algorithms consistently outperform non-DC
clustering algorithms for data integration tasks. However, we observed a
significant correlation between the DC method and embedding approaches for
rows, columns, and tables, highlighting that the suitable combination can
enhance the efficiency of DC methods.Comment: The following enhancements have been carried out in the updated
version of the manuscript: *Evaluated each data integration problem on
additional datasets. *Added more DC and SC methods to the evaluation
*Discussed algorithmic-specific observation
A critical and Integrated View of the Yeast Interactome
Global studies of proteināprotein interactions are crucial to both elucidating gene
function and producing an integrated view of the workings of living cells. High-throughput
studies of the yeast interactome have been performed using both genetic
and biochemical screens. Despite their size, the overlap between these experimental
datasets is very limited. This could be due to each approach sampling only a small
fraction of the total interactome. Alternatively, a large proportion of the data from
these screens may represent false-positive interactions. We have used the Genome
Information Management System (GIMS) to integrate interactome datasets with
transcriptome and protein annotation data and have found significant evidence that
the proportion of false-positive results is high. Not all high-throughput datasets are
similarly contaminated, and the tandem affinity purification (TAP) approach appears
to yield a high proportion of reliable interactions for which corroborating evidence
is available. From our integrative analyses, we have generated a set of verified
interactome data for yeast
Deep Clustering for Data Cleaning and Integration
Deep Learning (DL) techniques now constitute the state-of-theart for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the potential of DC for data management tasks remains unexplored. In this paper, we address this gap by investigating the suitability of DC for data cleaning and integration tasks, specifically schema inference, entity resolution and domain discovery, from the perspective of tables, rows and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. Experiments also show consistently strong performance compared with state-of-the-art bespoke algorithms for each of the data integration tasks
Dynamic decision making for situational awareness using drones: Requirements, identification and comparison of decision support methods
Decision makers increasingly operate in real-time information-rich environments, where limited time is available for interpreting data to inform decisions. These environments are driven by static or mobile sensing devices that can provide numerous dynamic data points. A prominent approach in this space is to utilise drones, which can be deployed to gather targeted information. However, deciding how best to deploy available drones is nontrivial, and stands to benefit from decision support aids that plan routes. Such a system must operate under time constraints created by the changing attributes of routes as the situation unfolds. This study describes a dynamic decision support system (DSS) for situational awareness with drones. The system applies Multi-Criteria Decision Making (MCDM) methods within a dynamic genetic algorithm to provide a continuously revised ranking of routes. Five desiderata for dynamic decision support are presented. It is shown how a dynamic DSS can be equipped with declarative specification of preferences (Desiderata 1), dynamic revision of recommendations (Desiderata 2), and high diversity of options (Desiderata 3). The study then compares four MCDM methods, namely the Weighted Product Model (WPM), the Analytic Hierarchy Process (AHP), the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS), and the Preference Ranking Organization METHod for Enrichment Evaluation (PROMETHEE), with regards to how consistently they trade-off between criteria (Desiderata 4) and the stability of results under small changes to criteria values (Desiderata 5). To evaluate the trade-offs between criteria we analyse the smoothness of change in criteria outcomes as criteria weightings increase for each algorithm. The outcomes are calculated by automating the selection of routes in a case study that applies drones to the task of harbour management. The stability of results for the different MCDM methods are compared. Perturbations were applied to sets of routes ranked by each algorithm then each algorithm was reapplied and the magnitude of the changes in ranking was assessed. Overall, TOPSIS was found to be the algorithm which made the most consistent trade-offs between criteria, only under-performing another algorithm with respect to a single criterion. AHP and WPM were the next most consistent algorithms and PROMETHEE was the least consistent algorithm. TOPSIS was also found to be the most stable method under small changes to criteria values. AHP was the second most stable, followed by PROMETHEE and WPM respectively. The results show that TOPSIS achieves the best result for both Desiderata 4 and 5 and consequently the study finds TOPSIS to be an appropriate MCDM method for dynamic decision support.<br/
SBRML: A markup language for associating systems biology data with models
MOTIVATION: Research in systems biology is carried out through a combination of experiments and models. Several data standards have been adopted for representing models (Systems Biology Markup Language) and various types of relevant experimental data (such as FuGE and those of the Proteomics Standards Initiative). However, until now, there has been no standard way to associate a model and its entities to the corresponding datasets, or vice versa. Such a standard would provide a means to represent computational simulation results as well as to frame experimental data in the context of a particular model. Target applications include model-driven data analysis, parameter estimation, and sharing and archiving model simulations. RESULTS: We propose the Systems Biology Results Markup Language (SBRML), an XML-based language that associates a model with several datasets. Each dataset is represented as a series of values associated with model variables, and their corresponding parameter values. SBRML provides a flexible way of indexing the results to model parameter values, which supports both spreadsheet-like data and multidimensional data cubes. We present and discuss several examples of SBRML usage in applications such as enzyme kinetics, microarray gene expression and various types of simulation results
- ā¦