389 research outputs found
A Probabilistic Linear Genetic Programming with Stochastic Context-Free Grammar for solving Symbolic Regression problems
Traditional Linear Genetic Programming (LGP) algorithms are based only on the
selection mechanism to guide the search. Genetic operators combine or mutate
random portions of the individuals, without knowing if the result will lead to
a fitter individual. Probabilistic Model Building Genetic Programming (PMB-GP)
methods were proposed to overcome this issue through a probability model that
captures the structure of the fit individuals and use it to sample new
individuals. This work proposes the use of LGP with a Stochastic Context-Free
Grammar (SCFG), that has a probability distribution that is updated according
to selected individuals. We proposed a method for adapting the grammar into the
linear representation of LGP. Tests performed with the proposed probabilistic
method, and with two hybrid approaches, on several symbolic regression
benchmark problems show that the results are statistically better than the
obtained by the traditional LGP.Comment: Genetic and Evolutionary Computation Conference (GECCO) 2017, Berlin,
German
The Minimum Description Length Principle for Pattern Mining: A Survey
This is about the Minimum Description Length (MDL) principle applied to
pattern mining. The length of this description is kept to the minimum.
Mining patterns is a core task in data analysis and, beyond issues of
efficient enumeration, the selection of patterns constitutes a major challenge.
The MDL principle, a model selection method grounded in information theory, has
been applied to pattern mining with the aim to obtain compact high-quality sets
of patterns. After giving an outline of relevant concepts from information
theory and coding, as well as of work on the theory behind the MDL and similar
principles, we review MDL-based methods for mining various types of data and
patterns. Finally, we open a discussion on some issues regarding these methods,
and highlight currently active related data analysis problems
Gene Regulatory Network Reconstruction Using Dynamic Bayesian Networks
High-content technologies such as DNA microarrays can provide a system-scale overview of how genes interact with each other in a network context. Various mathematical methods and computational approaches have been proposed to reconstruct GRNs, including Boolean networks, information theory, differential equations and Bayesian networks. GRN reconstruction faces huge intrinsic challenges on both experimental and theoretical fronts, because the inputs and outputs of the molecular processes are unclear and the underlying principles are unknown or too complex.
In this work, we focused on improving the accuracy and speed of GRN reconstruction with Dynamic Bayesian based method. A commonly used structure-learning algorithm is based on REVEAL (Reverse Engineering Algorithm). However, this method has some limitations when it is used for reconstructing GRNs. For instance, the two-stage temporal Bayes network (2TBN) cannot be well recovered by application of REVEAL; it has low accuracy and speed for high dimensionality networks that has above a hundred nodes; and it even cannot accomplish the task of reconstructing a network with 400 nodes. We implemented an algorithm for DBN structure learning with Friedman\u27s score function to replace REVEAL, and tested it on reconstruction of both synthetic networks and real yeast networks and compared it with REVEAL in the absence or presence of preprocessed network generated by Zou and Conzen\u27s algorithm. The new score metric improved the precision and recall of GRN reconstruction. Networks of gene interactions were reconstructed using a Dynamic Bayesian Network (DBN) approach and were analyzed to identify the mechanism of chemical-induced reversible neurotoxicity through reconstruction of gene regulatory networks in earthworms with tools curating relevant genes from non-model organism\u27s pathway to model organism pathway
On Practical machine Learning and Data Analysis
This thesis discusses and addresses some of the difficulties
associated with practical machine learning and data
analysis. Introducing data driven methods in e.g industrial and
business applications can lead to large gains in productivity and
efficiency, but the cost and complexity are often
overwhelming. Creating machine learning applications in practise often
involves a large amount of manual labour, which often needs to be
performed by an experienced analyst without significant experience
with the application area. We will here discuss some of the hurdles
faced in a typical analysis project and suggest measures and methods
to simplify the process.
One of the most important issues when applying machine learning
methods to complex data, such as e.g. industrial applications, is that
the processes generating the data are modelled in an appropriate
way. Relevant aspects have to be formalised and represented in a way
that allow us to perform our calculations in an efficient manner. We
present a statistical modelling framework, Hierarchical Graph
Mixtures, based on a combination of graphical models and mixture
models. It allows us to create consistent, expressive statistical
models that simplify the modelling of complex systems. Using a
Bayesian approach, we allow for encoding of prior knowledge and make
the models applicable in situations when relatively little data are
available.
Detecting structures in data, such as clusters and dependency
structure, is very important both for understanding an application
area and for specifying the structure of e.g. a hierarchical graph
mixture. We will discuss how this structure can be extracted for
sequential data. By using the inherent dependency structure of
sequential data we construct an information theoretical measure of
correlation that does not suffer from the problems most common
correlation measures have with this type of data.
In many diagnosis situations it is desirable to perform a
classification in an iterative and interactive manner. The matter is
often complicated by very limited amounts of knowledge and examples
when a new system to be diagnosed is initially brought into use. We
describe how to create an incremental classification system based on a
statistical model that is trained from empirical data, and show how
the limited available background information can still be used
initially for a functioning diagnosis system.
To minimise the effort with which results are achieved within data
analysis projects, we need to address not only the models used, but
also the methodology and applications that can help simplify the
process. We present a methodology for data preparation and a software
library intended for rapid analysis, prototyping, and deployment.
Finally, we will study a few example applications, presenting tasks
within classification, prediction and anomaly detection. The examples
include demand prediction for supply chain management, approximating
complex simulators for increased speed in parameter optimisation, and
fraud detection and classification within a media-on-demand system
Unveiling the frontiers of deep learning: innovations shaping diverse domains
Deep learning (DL) enables the development of computer models that are
capable of learning, visualizing, optimizing, refining, and predicting data. In
recent years, DL has been applied in a range of fields, including audio-visual
data processing, agriculture, transportation prediction, natural language,
biomedicine, disaster management, bioinformatics, drug design, genomics, face
recognition, and ecology. To explore the current state of deep learning, it is
necessary to investigate the latest developments and applications of deep
learning in these disciplines. However, the literature is lacking in exploring
the applications of deep learning in all potential sectors. This paper thus
extensively investigates the potential applications of deep learning across all
major fields of study as well as the associated benefits and challenges. As
evidenced in the literature, DL exhibits accuracy in prediction and analysis,
makes it a powerful computational tool, and has the ability to articulate
itself and optimize, making it effective in processing data with no prior
training. Given its independence from training data, deep learning necessitates
massive amounts of data for effective analysis and processing, much like data
volume. To handle the challenge of compiling huge amounts of medical,
scientific, healthcare, and environmental data for use in deep learning, gated
architectures like LSTMs and GRUs can be utilized. For multimodal learning,
shared neurons in the neural network for all activities and specialized neurons
for particular tasks are necessary.Comment: 64 pages, 3 figures, 3 table
Data Enrichment for Data Mining Applied to Bioinformatics and Cheminformatics Domains
Problemas cada vez mais complexos estão a ser tratados na àrea das ciências da vida. A aquisição de todos os dados que possam estar relacionados com o problema em questão é primordial. Igualmente importante é saber como os dados estão relacionados uns com os outros e com o próprio problema. Por outro lado, existem grandes quantidades de dados e informações disponíveis na Web. Os investigadores já estão a utilizar Data Mining e Machine Learning como ferramentas valiosas nas suas investigações, embora o procedimento habitual seja procurar a informação baseada nos modelos indutivos.
Até agora, apesar dos grandes sucessos já alcançados com a utilização de Data Mining e Machine Learning, não é fácil integrar esta vasta quantidade de informação disponível no processo indutivo, com algoritmos proposicionais. A nossa principal motivação é abordar o problema da integração de informação de domínio no processo indutivo de técnicas proposicionais de Data Mining e Machine Learning, enriquecendo os dados de treino a serem utilizados em sistemas de programação de lógica indutiva.
Os algoritmos proposicionais de Machine Learning são muito dependentes dos atributos dos dados. Ainda é difícil identificar quais os atributos mais adequados para uma determinada tarefa na investigação. É também difícil extrair informação relevante da enorme quantidade de dados disponíveis. Vamos concentrar os dados disponíveis, derivar características que os algoritmos de ILP podem utilizar para induzir descrições, resolvendo os problemas.
Estamos a criar uma plataforma web para obter informação relevante para problemas de Bioinformática (particularmente Genómica) e Quimioinformática. Esta vai buscar os dados a repositórios públicos de dados genómicos, proteicos e químicos. Após o enriquecimento dos dados, sistemas Prolog utilizam programação lógica indutiva para induzir regras e resolver casos específicos de Bioinformática e Cheminformática. Para avaliar o impacto do enriquecimento dos dados com ILP, comparamos com os resultados obtidos na resolução dos mesmos casos utilizando algoritmos proposicionais.Increasingly more complex problems are being addressed in life sciences. Acquiring all the data that may be related to the problem in question is paramount. Equally important is to know how the data is related to each other and to the problem itself. On the other hand, there are large amounts of data and information available on the Web. Researchers are already using Data Mining and Machine Learning as a valuable tool in their researches, albeit the usual procedure is to look for the information based on induction models.
So far, despite the great successes already achieved using Data Mining and Machine Learning, it is not easy to integrate this vast amount of available information in the inductive process with propositional algorithms. Our main motivation is to address the problem of integrating domain information into the inductive process of propositional Data Mining and Machine Learning techniques by enriching the training data to be used in inductive logic programming systems.
The algorithms of propositional machine learning are very dependent on data attributes. It still is hard to identify which attributes are more suitable for a particular task in the research. It is also hard to extract relevant information from the enormous quantity of data available. We will concentrate the available data, derive features that ILP algorithms can use to induce descriptions, solving the problems.
We are creating a web platform to obtain relevant bioinformatics (particularly Genomics) and Cheminformatics problems. It fetches the data from public repositories with genomics, protein and chemical data. After the data enrichment, Prolog systems use inductive logic programming to induce rules and solve specific Bioinformatics and Cheminformatics case studies. To assess the impact of the data enrichment with ILP, we compare with the results obtained solving the same cases using propositional algorithms
SEMANTICALLY INTEGRATED E-LEARNING INTEROPERABILITY AGENT
Educational collaboration through e-learning is one of the fields that have been
worked on since the emergence of e-learning in educational system. The e-learning
standards (e.g. learning object metadata standard) and e-learning system architectures
or frameworks, which support interoperation of correlated e-learning systems, are the
proposed technologies to support the collaboration. However, these technologies have
not been successful in creating boundless educational collaboration through e-learning.
In particular, these technologies offer solutions with their own requirements or
limitations and endeavor challenging efforts in applying the technologies into their elearning
system. Thus, the simpler the technology enhances possibility in forging the
collaboration.
This thesis explores a suite of techniques for creating an interoperability tool
model in e-learning domain that can be applied on diverse e-learning platforms. The
proposed model is called the e-learning Interoperability Agent or eiA. The scope of
eiA focuses on two aspects of e-learning: Learning Objects (LOs) and the users of elearning
itself. Learning objects that are accessible over the Web are valuable assets
for sharing knowledge in teaching, training, problem solving and decision support.
Meanwhile, there is still tacit knowledge that is not documented through LOs but
embedded in form of users' expertise and experiences. Therefore, the establishment of
educational collaboration can be formed by the users of e-learning with a common
interest in a specific problem domain.
The eiA is a loosely coupled model designed as an extension of various elearning
systems platforms. The eiA utilizes XML (eXtensible Markup Language)
technology, which has been accepted as the knowledge representation syntax, to
bridge the heterogeneous platforms. At the end, the use of eiA as facilitator to mediate
interconununication between e-leaming systems is to engage the creation of
semantically Federated e-learning Community (FeC). Eventually, maturity of the FeC
is driven by users' willingness to grow the community, by means of increasing the elearning
systems that use eiA and adding new functionalities into eiA
Parsimony-based genetic algorithm for haplotype resolution and block partitioning
This dissertation proposes a new algorithm for performing simultaneous haplotype resolution and block partitioning. The algorithm is based on genetic algorithm approach and the parsimonious principle. The multiloculs LD measure (Normalized Entropy Difference) is used as a block identification criterion. The proposed algorithm incorporates missing data is a part of the model and allows blocks of arbitrary length. In addition, the algorithm provides scores for the block boundaries which represent measures of strength of the boundaries at specific positions. The performance of the proposed algorithm was validated by running it on several publicly available data sets including the HapMap data and comparing results to those of the existing state-of-the-art algorithms. The results show that the proposed genetic algorithm provides the accuracy of haplotype decomposition within the range of the same indicators shown by the other algorithms. The block structure output by our algorithm in general agrees with the block structure for the same data provided by the other algorithms. Thus, the proposed algorithm can be successfully used for block partitioning and haplotype phasing while providing some new valuable features like scores for block boundaries and fully incorporated treatment of missing data. In addition, the proposed algorithm for haplotyping and block partitioning is used in development of the new clustering algorithm for two-population mixed genotype samples. The proposed clustering algorithm extracts from the given genotype sample two clusters with substantially different block structures and finds haplotype resolution and block partitioning for each cluster
- …