6,429 research outputs found
An automated ETL for online datasets
While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance
of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common
data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established
approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a
close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed.
In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide
datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation
process with our systemâs machine generated ETL process, with very favourable results, illustrating the value and impact of
an automated approach
An automated ETL for online datasets
While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance
of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common
data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established
approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a
close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed.
In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide
datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation
process with our systemâs machine generated ETL process, with very favourable results, illustrating the value and impact of
an automated approach
An automated ETL for online datasets
While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance
of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common
data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established
approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a
close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed.
In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide
datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation
process with our systemâs machine generated ETL process, with very favourable results, illustrating the value and impact of
an automated approach
Semantic processing of EHR data for clinical research
There is a growing need to semantically process and integrate clinical data
from different sources for clinical research. This paper presents an approach
to integrate EHRs from heterogeneous resources and generate integrated data in
different data formats or semantics to support various clinical research
applications. The proposed approach builds semantic data virtualization layers
on top of data sources, which generate data in the requested semantics or
formats on demand. This approach avoids upfront dumping to and synchronizing of
the data with various representations. Data from different EHR systems are
first mapped to RDF data with source semantics, and then converted to
representations with harmonized domain semantics where domain ontologies and
terminologies are used to improve reusability. It is also possible to further
convert data to application semantics and store the converted results in
clinical research databases, e.g. i2b2, OMOP, to support different clinical
research settings. Semantic conversions between different representations are
explicitly expressed using N3 rules and executed by an N3 Reasoner (EYE), which
can also generate proofs of the conversion processes. The solution presented in
this paper has been applied to real-world applications that process large scale
EHR data.Comment: Accepted for publication in Journal of Biomedical Informatics, 2015,
preprint versio
- âŠ