26 research outputs found
Detection and Evaluation of Clusters within Sequential Data
Motivated by theoretical advancements in dimensionality reduction techniques
we use a recent model, called Block Markov Chains, to conduct a practical study
of clustering in real-world sequential data. Clustering algorithms for Block
Markov Chains possess theoretical optimality guarantees and can be deployed in
sparse data regimes. Despite these favorable theoretical properties, a thorough
evaluation of these algorithms in realistic settings has been lacking.
We address this issue and investigate the suitability of these clustering
algorithms in exploratory data analysis of real-world sequential data. In
particular, our sequential data is derived from human DNA, written text, animal
movement data and financial markets. In order to evaluate the determined
clusters, and the associated Block Markov Chain model, we further develop a set
of evaluation tools. These tools include benchmarking, spectral noise analysis
and statistical model selection tools. An efficient implementation of the
clustering algorithm and the new evaluation tools is made available together
with this paper.
Practical challenges associated to real-world data are encountered and
discussed. It is ultimately found that the Block Markov Chain model assumption,
together with the tools developed here, can indeed produce meaningful insights
in exploratory data analyses despite the complexity and sparsity of real-world
data.Comment: 37 pages, 12 figure
IN SILICO METHODS FOR DRUG DESIGN AND DISCOVERY
Computer-aided drug design (CADD) methodologies are playing an ever-increasing role in drug discovery that are critical in the cost-effective identification of promising drug candidates. These computational methods are relevant in limiting the use of animal models in pharmacological research, for aiding the rational design of novel and safe drug candidates, and for repositioning marketed drugs, supporting medicinal chemists and pharmacologists during the drug discovery trajectory.Within this field of research, we launched a Research Topic in Frontiers in Chemistry in March 2019 entitled âIn silico Methods for Drug Design and Discovery,â which involved two sections of the journal: Medicinal and Pharmaceutical Chemistry and Theoretical and Computational Chemistry. For the reasons mentioned, this Research Topic attracted the attention of scientists and received a large number of submitted manuscripts. Among them 27 Original Research articles, five Review articles, and two Perspective articles have been published within the Research Topic. The Original Research articles cover most of the topics in CADD, reporting advanced in silico methods in drug discovery, while the Review articles offer a point of view of some computer-driven techniques applied to drug research. Finally, the Perspective articles provide a vision of specific computational approaches with an outlook in the modern era of CADD
Structured Prediction on Dirty Datasets
Many errors cannot be detected or repaired without taking into account the underlying structure and dependencies in the dataset. One way of modeling the structure of the data is graphical models. Graphical models combine probability theory and graph theory in order to address one of the key objectives in designing and fitting probabilistic models, which is to capture dependencies among relevant random variables. Structure representation helps to understand the side effect of the errors or it reveals correct interrelationships between data points. Hence, principled representation of structure in prediction and cleaning tasks of dirty data is essential for the quality of downstream analytical results. Existing structured prediction research considers limited structures and configurations, with little attention to the performance limitations and how well the problem can be solved in more general settings where the structure is complex and rich.
In this dissertation, I present the following thesis: By leveraging the underlying dependency and structure in machine learning models, we can effectively detect and clean errors via pragmatic structured predictions techniques. To highlight the main contributions: I investigate prediction algorithms and systems on dirty data with a more realistic structure and dependencies to help deploy this type of learning in more pragmatic settings. Specifically, We introduce a few-shot learning framework for error detection that uses structure-based features of data such as denial constraints violations and Bayesian network as co-occurrence feature. I have studied the problem of recovering the latent ground truth labeling of a structured instance. Then, I consider the problem of mining integrity constraints from data and specifically using the sampling methods for extracting approximate denial constraints. Finally, I have introduced an ML framework that uses solitary and structured data features to solve the problem of record fusion
Learning discrete word embeddings to achieve better interpretability and processing efficiency
LâomniprĂ©sente utilisation des plongements de mot dans le traitement des langues naturellesest la preuve de leur utilitĂ© et de leur capacitĂ© dâadaptation a une multitude de tĂąches. Ce-pendant, leur nature continue est une importante limite en terme de calculs, de stockage enmĂ©moire et dâinterprĂ©tation. Dans ce travail de recherche, nous proposons une mĂ©thode pourapprendre directement des plongements de mot discrets. Notre modĂšle est une adaptationdâune nouvelle mĂ©thode de recherche pour base de donnĂ©es avec des techniques dernier crien traitement des langues naturelles comme les Transformers et les LSTM. En plus dâobtenirdes plongements nĂ©cessitant une fraction des ressources informatiques nĂ©cĂ©ssaire Ă leur sto-ckage et leur traitement, nos expĂ©rimentations suggĂšrent fortement que nos reprĂ©sentationsapprennent des unitĂ©s de bases pour le sens dans lâespace latent qui sont analogues Ă desmorphĂšmes. Nous appelons ces unitĂ©s dessememes, qui, de lâanglaissemantic morphemes,veut dire morphĂšmes sĂ©mantiques. Nous montrons que notre modĂšle a un grand potentielde gĂ©nĂ©ralisation et quâil produit des reprĂ©sentations latentes montrant de fortes relationssĂ©mantiques et conceptuelles entre les mots apparentĂ©s.The ubiquitous use of word embeddings in Natural Language Processing is proof of theirusefulness and adaptivity to a multitude of tasks. However, their continuous nature is pro-hibitive in terms of computation, storage and interpretation. In this work, we propose amethod of learning discrete word embeddings directly. The model is an adaptation of anovel database searching method using state of the art natural language processing tech-niques like Transformers and LSTM. On top of obtaining embeddings requiring a fractionof the resources to store and process, our experiments strongly suggest that our representa-tions learn basic units of meaning in latent space akin to lexical morphemes. We call theseunitssememes, i.e., semantic morphemes. We demonstrate that our model has a greatgeneralization potential and outputs representation showing strong semantic and conceptualrelations between related words
Business Risk in Changing Dynamics of Global Village 2
The monograph is prepared based on the presentations and discussions made at the II International Conference âBUSINESS RISK IN CHANGING DYNAMICS OF GLOBAL VILLAGE (BRCDGV 2019)â, November, 7th-08th, 2019, in Ternopil, Ukraine. The aim of this scientific international conference is to provide a platform for professional debate with the participation of experts from around the globe in order to identify & analyze risks and opportunities in todayâs global business, and specifically in Ukraine. The conference will provide a framework for researchers, business elites and decision makers to uplift the business ties and minimise the risk for creating a better world and better Ukraine.The Conference is designed to call experts around the globe from different sectors of practices which are effected by globalization and watching changes in Europe as well as in Ukraine. It is an excellent platform for interactions and communication between academicians, corporate representatives, policy makers, representatives of organizations and community, as well as individuals being the part of this globalized world.
The 1st edition of this conference was held at the University of Applied Sciences in Nysa, Poland (2017); the 2nd edition took place at Ternopil Ivan Puluj National Technical University, Ukraine (2019); the 3rd edition will be organized at Patna University, India (2020) in cooperation with Indo-European Education Foundation (IEEF, Poland) and its partner universities from Poland, India, Europe and other part of the world.Under modern conditions of globalization nowadays, economic activity is undergoing changes. Innovative technologies, new forms of business, dynamic changes taking place in the world today result in the emergence of the necessity to minimize risks in order to maximize benefits.
The cooperation between experts from different fields with the aim to ensure sustainable growth â policymakers, scientists, universities representatives and business elites is essential nowadays. With the purpose to bring them together and discuss the main issues of todaysâ global world this conference took place in Ternopil, Ukraine. As Ukraine is now passing through a dynamic period of changes, recommendations coming up from such discussions can be very beneficial for building stronger society and meet the risks globalization brings up.
This monograph provides a useful review of economic, financial and policy issues in the context of globalization processes and has proven extremely popular with practitioners and industry advisors. This edition is given the continued high demand and interest for experts form different areas working on diminishing of business risks wishing to keep abreast of current thinking on this subject.
According to many experts process of managing risks is currently one of the most relevant business technologies and at the same time it is a complex process which requires ground knowledge in the research field and practical experience. The popularity of business risks management is due to objective reasons such as dynamics of society, interconnections and interdependence between different players in the society, increasing role of human capital in the countryâs sustainable developmen
Toward coherent accounting of uncertainty in hydrometeorological modeling
La considĂ©ration adĂ©quate des diffĂ©rentes sources dâincertitude est un aspect crucial de la prĂ©vision hydromĂ©tĂ©orologique. La prĂ©vision dâensemble, en fournissant des informations sur la probabilitĂ© dâoccurrence des sorties du modĂšle, reprĂ©sente une alternative sĂ©duisante Ă la prĂ©vision dĂ©terministe traditionnelle. De plus, elle permet de faire face aux diffĂ©rentes sources dâincertitude qui se trouvent le long de la chaĂźne de modĂ©lisation hydromĂ©tĂ©orologique en gĂ©nĂ©rant des ensembles lĂ oĂč ces incertitudes se situent. Le principal objectif de cette thĂšse est dâidentifier un systĂšme qui soit capable dâapprĂ©hender les trois sources principales dâincertitude que sont la structure du modĂšle hydrologique, ses conditions initiales et le forçage mĂ©tĂ©orologique, dans le but de fournir une prĂ©vision qui soit Ă la fois prĂ©cise, fiable et Ă©conomiquement attractive. Lâaccent est mis sur la cohĂ©rence avec laquelle les diffĂ©rentes incertitudes doivent ĂȘtre quantifiĂ©es et rĂ©duites. Notamment, celles-ci doivent ĂȘtre considĂ©rĂ©es explicitement avec une approche cohĂ©sive qui fasse en sorte que chacune dâentre elles soit traitĂ©e adĂ©quatement, intĂ©gralement et sans redondance dans lâaction des divers outils qui composent le systĂšme. Afin de rĂ©pondre Ă cette attente, plusieurs sous-objectifs sont dĂ©finis. Le premier se penche sur lâapproche multimodĂšle pour Ă©valuer ses bĂ©nĂ©fices dans un contexte opĂ©rationnel. Dans un second temps, dans le but dâidentifier une implĂ©mentation optimale du filtre dâensemble de Kalman, diffĂ©rents aspects du filtre qui conditionnent ses performances sont Ă©tudiĂ©s en dĂ©tail. LâĂ©tape suivante rassemble les connaissances acquises lors des deux premiers objectifs en rĂ©unissant leurs atouts et en y incluant la prĂ©vision mĂ©tĂ©orologique dâensemble pour construire un systĂšme qui puisse fournir des prĂ©visions Ă la fois prĂ©cises et fiables. Il est attendu que ce systĂšme soit en mesure de prendre en compte les diffĂ©rentes sources dâincertitude de façon cohĂ©rente tout en fournissant un cadre de travail pour Ă©tudier la contribution des diffĂ©rents outils hydromĂ©tĂ©orologiques et leurs interactions. Enfin, le dernier volet porte sur lâidentification des relations entre les diffĂ©rents systĂšmes de prĂ©visions prĂ©cĂ©demment crĂ©Ă©s, leur valeur Ă©conomique et leur qualitĂ© de la prĂ©vision. La combinaison du filtre dâensemble de Kalman, de lâapproche multimodĂšle et de la prĂ©vision mĂ©tĂ©orologique dâensemble se rĂ©vĂšle ĂȘtre plus performante quâaucun des outils utilisĂ©s sĂ©parĂ©ment, Ă la fois en prĂ©cision et fiabilitĂ© et ceci en raison dâunemeilleure prise en compte de lâincertitude que permet leur action complĂ©mentaire. Lâensemble multimodĂšle, composĂ© par 20 modĂšles hydrologiques sĂ©lectionnĂ©s pour leurs diffĂ©rences structurelles, est capable de minimiser lâincertitude liĂ©e Ă la structure et Ă la conceptualisation, grĂące au rĂŽle spĂ©cifique que jouent les modĂšles au sein de lâensemble. Cette approche, mĂȘme si utilisĂ©e seule, peut conduire Ă des rĂ©sultats supĂ©rieurs Ă ceux dâun modĂšle semi-distribuĂ© utilisĂ© de façon opĂ©rationnelle. Lâidentification de la configuration optimale du filtre dâensemble de Kalman afin de rĂ©duire lâincertitude sur les conditions initiales est complexe, notamment en raison de lâidentification parfois contre-intuitive des hyper-paramĂštres et des variables dâĂ©tat qui doivent ĂȘtre mises Ă jour, mais Ă©galement des performances qui varient grandement en fonction du modĂšle hydrologique. Cependant, le filtre reste un outil de premiĂšre importance car il participe efficacement Ă la rĂ©duction de lâincertitude sur les conditions initiales et contribue de façon importante Ă la dispersion de lâensemble prĂ©visionnel. Il doit ĂȘtre malgrĂ© tout assistĂ© par lâapproche multimodĂšle et la prĂ©vision mĂ©tĂ©orologique dâensemble pour pouvoir maintenir une dispersion adĂ©quate pour des horizons dĂ©passant le court terme. Il est Ă©galement dĂ©montrĂ© que les systĂšmes qui sont plus prĂ©cis et plus fiables fournissent en gĂ©nĂ©ral une meilleure valeur Ă©conomique, mĂȘme si cette relation nâest pas dĂ©finie prĂ©cisĂ©ment. Les diffĂ©rentes incertitudes inhĂ©rentes Ă la prĂ©vision hydromĂ©tĂ©orologique ne sont pas totalement Ă©liminĂ©es, mais en les traitant avec des outils spĂ©cifiques et adaptĂ©s, il est possible de fournir une prĂ©vision dâensemble qui soit Ă la fois prĂ©cise, fiable et Ă©conomiquement attractive.A proper consideration of the different sources of uncertainty is a key point in hydrometeorological forecasting. Ensembles are an attractive alternative to traditional deterministic forecasts that provide information about the likelihood of the outcomes. Moreover, ensembles can be generated wherever a source of uncertainty takes place in the hydrometeorological modeling chain. The global objective of this thesis is to identify a system that is able to decipher the three main sources of uncertainty in modeling, i.e. the model structure, the hydrological model initial conditions and the meteorological forcing uncertainty, to provide accurate, reliable, and valuable forecast. The different uncertainties should be quantified and reduced in a coherent way, that is to say that they should be addressed explicitly with a cohesive approach that ensures to handle them adequately without redundancy in the action of the different tools that compose the system. This motivated several sub-objectives, the first one of which focusing on the multimodel approach to identify its benefits in an operational framework. Secondly, the implementation and the features of the Ensemble Kalman Filter (EnKF) are put under scrutiny to identify an optimal implementation. The next step reunites the knowledge of the two first goals by merging their strengths and by adding the meteorological ensembles to build a framework that issues accurate and reliable forecasts. This system is expected to decipher the main sources of uncertainty in a coherent way and provides a framework to study the contribution of the different tools and their interactions. Finally, the focus is set on the forecast economic value and provides an attempt to relate the different systems that have been built to economic value and forecast quality. It is found that the combination of the EnKF, the multimodel, and ensemble forcing, allows to issue forecasts that are accurate and nearly reliable. The combination of the three tools outperforms any other used separately and the uncertainties that were considered are deciphered thanks to their complementary actions. The 20 dissimilar models that compose the multimodel ensemble are able to minimize the uncertainty related to the model structure, thanks to the particular role they play in the ensemble. Such approach has the capacity to outperform more a complex semi-distributed model used operationally. To deal optimally with the initial condition uncertainty, the EnKF implementation may be complex to reach because of the unintuitive specification of hyper-parameters and the selection of the state variable to update, and its varying compatibility with hydrological model. Nonetheless, the filter is a powerful tool to reduce initial condition uncertainty and contributes largely to the predictive ensemble spread. However, it needs to be supported by a multimodel approach and ensemble meteorological forcing to maintain adequate ensemble dispersion for longer lead times. Finally, it is shown that systems that exhibit better accuracy and reliability have generally higher economic value, even if this relation is loosely defined. The different uncertainties inherent to the forecasting process may not be eliminated, nonetheless by explicitly accounting for them with dedicated and suitable tools, an accurate, reliable, and valuable predictive ensemble can be issued