52 research outputs found
SeLINA: a Self-Learning Insightful Network Analyzer
Understanding the behavior of a network from a large scale traffic dataset is a challenging problem. Big data frameworks offer scalable algorithms to extract information from raw data, but often require a sophisticated fine-tuning and a detailed knowledge of machine learning algorithms. To streamline this process, we propose SeLINA (Self-Learning Insightful Network Analyzer), a generic, self-tuning, simple tool to extract knowledge from network traffic measurements. SeLINA includes different data analytics techniques providing self-learning capabilities to state-of-the-art scalable approaches, jointly with parameter auto-selection to off-load the network expert from parameter tuning. We combine both unsupervised and supervised approaches to mine data with a scalable approach. SeLINA embeds mechanisms to check if the new data fits the model, to detect possible changes in the traffic, and to, possibly automatically, trigger model rebuilding. The result is a system that offers human-readable models of the data with minimal user intervention, supporting domain experts in extracting actionable knowledge and highlighting possibly meaningful interpretations. SeLINA's current implementation runs on Apache Spark. We tested it on large collections of realworld passive network measurements from a nationwide ISP, investigating YouTube and P2P traffic. The experimental results confirmed the ability of SeLINA to provide insights and detect changes in the data that suggest further analyse
A Fuzzy Classification Framework to Identify Equivalent Atoms in Complex Materials and Molecules
The nature of an atom in a bonded structure -- such as in molecules, in
nanoparticles or solids, at surfaces or interfaces -- depends on its local
atomic environment. In atomic-scale modeling and simulation, identifying groups
of atoms with equivalent environments is a frequent task, to gain an
understanding of the material function, to interpret experimental results or to
simply restrict demanding first-principles calculations. While routine, this
task can often be challenging for complex molecules or non-ideal materials with
breaks of symmetries or long-range order. To automatize this task, we here
present a general machine-learning framework to identify groups of (nearly)
equivalent atoms. The initial classification rests on the representation of the
local atomic environment through a high-dimensional smooth overlap of atomic
positions (SOAP) vector. Recognizing that not least thermal vibrations may lead
to deviations from ideal positions, we then achieve a fuzzy classification by
mean-shift clustering within a low-dimensional embedded representation of the
SOAP points as obtained through multidimensional scaling. The performance of
this classification framework is demonstrated for simple aromatic molecules and
crystalline Pd surface examples.Comment: Accepted manuscript in Journal of Chemical Physics. Repositories of
the package (DECAF): DOI:10.17617/3.U7VKBM or
https://gitlab.mpcdf.mpg.de/klai/deca
SeLINA: a Self-Learning Insightful Network Analyzer
Understanding the behavior of a network from a large scale traffic dataset is a challenging problem. Big data frameworks offer scalable algorithms to extract information from raw data, but often require a sophisticated fine-tuning and a detailed knowledge of machine learning algorithms. To streamline this process, we propose SeLINA (Self-Learning Insightful Network Analyzer), a generic, self-tuning, simple tool to extract knowledge from network traffic measurements. SeLINA includes different data analytics techniques providing self-learning capabilities to state-of-the-art scalable approaches, jointly with parameter auto-selection to off-load the network expert from parameter tuning. We combine both unsupervised and supervised approaches to mine data with a scalable approach. SeLINA embeds mechanisms to check if the new data fits the model, to detect possible changes in the traffic, and to, possibly automatically, trigger model rebuilding. The result is a system that offers human-readable models of the data with minimal user intervention, supporting domain experts in extracting actionable knowledge and highlighting possibly meaningful interpretations. SeLINA’s current implementation runs on Apache Spark. We tested it on large collections of realworld passive network measurements from a nationwide ISP, investigating YouTube and P2P traffic. The experimental results confirmed the ability of SeLINA to provide insights and detect changes in the data that suggest further analyses
Implementation of an interactive pattern mining framework on electronic health record datasets
Large collections of electronic patient records contain a broad range of clinical information highly relevant for data analysis. However, they are maintained primarily for patient administration, and automated methods are required to extract valuable knowledge for predictive, preventive, personalized and participatory medicine. Sequential pattern mining is a fundamental task in data mining which can be used to find statistically relevant, non-trivial temporal dependencies of events such as disease comorbidities. This works objective is to use this mining technique to identify disease associations based on ICD-9-CM codes data of the entire Taiwanese population obtained from Taiwan’s National Health Insurance Research Database.
This thesis reports the development and implementation of the Disease Pattern Miner – a pattern mining framework in a medical domain. The framework was designed as a Web application which can be used to run several state-of-the-art sequence mining algorithms on electronic health records, collect and filter the results to reduce the number of patterns to a meaningful size, and visualize the disease associations as an interactive model in a specific population group. This may be crucial to discover new disease associations and offer novel insights to explain disease pathogenesis. A structured evaluation of the data and models are required before medical data-scientist may use this application as a tool for further research to get a better understanding of disease comorbidities
- …