264 research outputs found

    Dating Phylogenies with Hybrid Local Molecular Clocks

    Get PDF
    BACKGROUND: Because rates of evolution and species divergence times cannot be estimated directly from molecular data, all current dating methods require that specific assumptions be made before inferring any divergence time. These assumptions typically bear either on rates of molecular evolution (molecular clock hypothesis, local clocks models) or on both rates and times (penalized likelihood, Bayesian methods). However, most of these assumptions can affect estimated dates, oftentimes because they underestimate large amounts of rate change. PRINCIPAL FINDINGS: A significant modification to a recently proposed ad hoc rate-smoothing algorithm is described, in which local molecular clocks are automatically placed on a phylogeny. This modification makes use of hybrid approaches that borrow from recent theoretical developments in microarray data analysis. An ad hoc integration of phylogenetic uncertainty under these local clock models is also described. The performance and accuracy of the new methods are evaluated by reanalyzing three published data sets. CONCLUSIONS: It is shown that the new maximum likelihood hybrid methods can perform better than penalized likelihood and almost as well as uncorrelated Bayesian models. However, the new methods still tend to underestimate the actual amount of rate change. This work demonstrates the difficulty of estimating divergence times using local molecular clocks

    Indexing and knowledge discovery of gaussian mixture models and multiple-instance learning

    Get PDF
    Due to the increasing quantity and variety of generated and stored data, the manual and automatic analysis becomes a more and more challenging task in many modern applications, like biometric identification and content-based image retrieval. In this thesis, we consider two very typical, related inherent structures of objects: Multiple-Instance (MI) objects and Gaussian Mixture Models (GMM). In both approaches, each object is represented by a set. For MI, each object is a set of vectors from a multi-dimensional space. For GMM, each object is a set of multi-variate Gaussian distribution functions, providing the ability to approximate arbitrary distributions in a concise way. Both approaches are very powerful and natural as they allow to express (1) that an object is additively composed from several components or (2) that an object may have several different, alternative kinds of behavior. Thus we can model e.g. an image which may depict a set of different things (1). Likewise, we can model a sports player who has performed differently at different games (2). We can use GMM to approximate MI objects and vice versa. Both ways of approximation can be appealing because GMM are more concise whereas for MI objects the single components are less complex. A similarity measure quantifies similarities between two objects to assess how much alike these objects are. On this basis, indexing and similarity search play essential roles in data mining, providing efficient and/or indispensable supports for a variety of algorithms such as classification and clustering. This thesis aims to solve challenges in the indexing and knowledge discovery of complex data using MI objects and GMM. For the indexing of GMM, there are several techniques available, including universal index structures and GMM-specific methods. However, the well-known approaches either suffer from poor performance or have too many limitations. To make use of the parameterized properties of GMM and tackle the problem of potential unequal length of components, we propose the Gaussian Components based Index (GCI) for efficient queries on GMM. GCI decomposes GMM into their components, and stores the n-lets of Gaussian combinations that have uniform length of parameter vectors in traditional index structures. We introduce an efficient pruning strategy to filter unqualified GMM using the so-called Matching Probability (MP) as the similarity measure. MP sums up the joint probabilities of two objects all over the space. GCI achieves better performance than its competitors on both synthetic and real-world data. To further increase its efficiency, we propose a strategy to store GMM components in a normalized way. This strategy improves the ability of filtering unqualified GMM. Based on the normalized transformation, we derive a set of novel similarity measures for GMM. Since MP is not a metric (i.e., a symmetric, positive definite distance function guaranteeing the triangle inequality), which would be essential for the application of various analysis techniques, we introduce Infinite Euclidean Distance (IED) for probability distribution functions, a metric with a closed-form expression for GMM. IED allows us to store GMM in well-known metric trees like the Vantage-Point tree or M-tree, which facilitate similarity search in sublinear time by exploiting the triangle inequality. Moreover, analysis techniques that require the properties of a metric (e.g. Multidimensional Scaling) can be applied on GMM with IED. For MI objects which are not well-approximated by GMM, we introduce the potential densities of instances for the representation of MI objects. Based on that, two joint Gaussian based measures are proposed for MI objects and we extend GCI on MI objects for efficient queries as well. To sum up, we propose in this thesis a number of novel similarity measures and novel indexing techniques for GMM and MI objects, enabling efficient queries and knowledge discovery on complex data. In a thorough theoretic analysis as well as extensive experiments we demonstrate the superiority of our approaches over the state-of-the-art with respect to the run-time efficiency and the quality of the result.Angesichts der steigenden Quantität und Vielfalt der generierten und gespeicherten Daten werden manuelle und automatisierte Analysen in vielen modernen Anwendungen eine zunehmend anspruchsvolle Aufgabe, wie z.B. biometrische Identifikation und inhaltbasierter Bildzugriff. In dieser Arbeit werden zwei sehr typische und relevante inhärente Strukturen von Objekten behandelt: Multiple-Instance-Objects (MI) und Gaussian Mixture Models (GMM). In beiden Anwendungsfällen wird das Objekt in Form einer Menge dargestellt. Bei MI besteht jedes Objekt aus einer Menge von Vektoren aus einem multidimensionalen Raum. Bei GMM wird jedes Objekt durch eine Menge von multivariaten normalverteilten Dichtefunktionen repräsentiert. Dies bietet die Möglichkeit, beliebige Wahrscheinlichkeitsverteilungen in kompakter Form zu approximieren. Beide Ansätze sind sehr leistungsfähig, denn sie basieren auf einfachsten Ideen: (1) entweder besteht ein Objekt additiv aus mehreren Komponenten oder (2) ein Objekt hat unterschiedliche alternative Verhaltensarten. Dies ermöglicht es uns z.B. ein Bild zu repräsentieren, welches unterschiedliche Objekte und Szenen zeigt (1). In gleicher Weise können wir einen Sportler modellieren, der bei verschiedenen Wettkämpfen unterschiedliche Leistungen gezeigt hat (2). Wir können MI-Objekte durch GMM approximieren und auch der umgekehrte Weg ist möglich. Beide Vorgehensweisen können sehr ansprechend sein, da GMM im Vergleich zu MI kompakter sind, wogegen in MI-Objekten die einzelnen Komponenten weniger Komplexität aufweisen. Ein ähnlichkeitsmaß dient der Quantifikation der Gemeinsamkeit zwischen zwei Objekten. Darauf basierend spielen Indizierung und ähnlichkeitssuche eine wesentliche Rolle für die effiziente Implementierung von einer Vielzahl von Klassifikations- und Clustering-Algorithmen im Bereich des Data Minings. Ziel dieser Arbeit ist es, die Herausforderungen bei Indizierung und Wissensextraktion von komplexen Daten unter Verwendung von MI Objekten und GMM zu bewältigen. Für die Indizierung der GMM stehen verschiedene universelle und GMM-spezifische Indexstrukuren zur Verfügung. Jedoch leiden solche bekannten Ansätze unter schwacher Leistung oder zu vielen Einschränkungen. Um die parametrisieren Eigenschaften der GMM auszunutzen und dem Problem der möglichen ungleichen Komponentenlänge entgegenzuwirken, präsentieren wir das Verfahren Gaussian Components based Index (GCI), welches effizienten Abfrage auf GMM ermöglicht. GCI zerlegt dabei ein GMM in Parameterkomponenten und speichert alle möglichen Kombinationen mit einheitlicher Vektorlänge in traditionellen Indexstrukturen. Wir stellen ein effizientes Pruningverfahren vor, um ungeeignete GMM unter Verwendung der sogenannten Matching Probability (MP) als ähnlichkeitsma\ss auszufiltern. MP errechnet die Summe der gemeinsamen Wahrscheinlichkeit zweier Objekte aus dem gesamten Raum. CGI erzielt bessere Leistung als konkurrierende Verfahren, sowohl in Bezug auf synthetische, als auch auf reale Datensätze. Um ihre Effizienz weiter zu verbessern, stellen wir eine Strategie zur Speicherung der GMM-Komponenten in normalisierter Form vor. Diese Strategie verbessert die Fähigkeit zum Ausfiltern ungeeigneter GMM. Darüber hinaus leiten wir, basierend auf dieser Transformation, neuartige ähnlichkeitsmaße für GMM her. Da MP keine Metrik (d.h. eine symmetrische, positiv definite Distanzfunktion, die die Dreiecksungleichung garantiert) ist, dies jedoch unentbehrlich für die Anwendung mehrerer Analysetechniken ist, führen wir Infinite Euclidean Distance (IED) ein, ein Metrik mit geschlossener Ausdrucksform für GMM. IED erlaubt die Speicherung der GMM in Metrik-Bäumen wie z.B. Vantage-Point Trees oder M-Trees, die die ähnlichkeitssuche in sublinear Zeit mit Hilfe der Dreiecksungleichung erleichtert. Außerdem können Analysetechniken, die die Eigenschaften einer Metrik erfordern (z.B. Multidimensional Scaling), auf GMM mit IED angewandt werden. Für MI-Objekte, die mit GMM nicht in außreichender Qualität approximiert werden können, stellen wir Potential Densities of Instances vor, um MI-Objekte zu repräsentieren. Darauf beruhend werden zwei auf multivariater Gaußverteilungen basierende Maße für MI-Objekte eingeführt. Außerdem erweitern wir GCI für MI-Objekte zur effizienten Abfragen. Zusammenfassend haben wir in dieser Arbeit mehrere neuartige ähnlichkeitsmaße und Indizierungstechniken für GMM- und MI-Objekte vorgestellt. Diese ermöglichen effiziente Abfragen und die Wissensentdeckung in komplexen Daten. Durch eine gründliche theoretische Analyse und durch umfangreiche Experimente demonstrieren wir die überlegenheit unseres Ansatzes gegenüber anderen modernen Ansätzen bezüglich ihrer Laufzeit und Qualität der Resultate

    Macro- and Microevolution of Languages: Exploring Linguistic Divergence with Approaches from Evolutionary Biology

    Get PDF
    There are more than 7000 languages in the world, and many of these have emerged through linguistic divergence. While questions related to the drivers of linguistic diversity have been studied before, including studies with quantitative methods, there is no consensus as to which factors drive linguistic divergence, and how. In the thesis, I have studied linguistic divergence with a multidisciplinary approach, applying the framework and quantitative methods of evolutionary biology to language data. With quantitative methods, large datasets may be analyzed objectively, while approaches from evolutionary biology make it possible to revisit old questions (related to, for example, the shape of the phylogeny) with new methods, and adopt novel perspectives to pose novel questions. My chief focus was on the effects exerted on the speakers of a language by environmental and cultural factors. My approach was thus an ecological one, in the sense that I was interested in how the local environment affects humans and whether this human-environment connection plays a possible role in the divergence process. I studied this question in relation to the Uralic language family and to the dialects of Finnish, thus covering two different levels of divergence. However, as the Uralic languages have not previously been studied using quantitative phylogenetic methods, nor have population genetic methods been previously applied to any dialect data, I first evaluated the applicability of these biological methods to language data. I found the biological methodology to be applicable to language data, as my results were rather similar to traditional views as to both the shape of the Uralic phylogeny and the division of Finnish dialects. I also found environmental conditions, or changes in them, to be plausible inducers of linguistic divergence: whether in the first steps in the divergence process, i.e. dialect divergence, or on a large scale with the entire language family. My findings concerning Finnish dialects led me to conclude that the functional connection between linguistic divergence and environmental conditions may arise through human cultural adaptation to varying environmental conditions. This is also one possible explanation on the scale of the Uralic language family as a whole. The results of the thesis bring insights on several different issues in both a local and a global context. First, they shed light on the emergence of the Finnish dialects. If the approach used in the thesis is applied to the dialects of other languages, broader generalizations may be drawn as to the inducers of linguistic divergence. This again brings us closer to understanding the global patterns of linguistic diversity. Secondly, the quantitative phylogeny of the Uralic languages, with estimated times of language divergences, yields another hypothesis as to the shape and age of the language family tree. In addition, the Uralic languages can now be added to the growing list of language families studied with quantitative methods. This will allow broader inferences as to global patterns of language evolution, and more language families can be included in constructing the tree of the world’s languages. Studying history through language, however, is only one way to illuminate the human past. Therefore, thirdly, the findings of the thesis, when combined with studies of other language families, and those for example in genetics and archaeology, bring us again closer to an understanding of human history.Monet maailman yli 7000 kielestä ovat syntyneet erkaantumisprosessin kautta. Tällöin yhdestä kielestä muotoutuu eri tekijöiden vaikutuksesta aikojen saatossa useampia kieliä. Kielten erkaantumiseen vaikuttavia tekijöitä on tutkittu aiemminkin ja myös laskennallisia menetelmiä käyttäen. Vielä on kuitenkin epäselvää mitkä kaikki tekijät voivat vaikuttaa kielten erkaantumiseen ja miten. Tutkin väitöskirjassani kielten erkaantumiseen vaikuttavia tekijöitä. Lähestymistapani on monitieteinen, sillä sovellan laskennallisia evoluutiobiologian menetelmiä ja teorioita kieliaineistoon. Laskennalliset menetelmät mahdollistavat suurien aineistojen objektiivisen analysoinnin, kun taas evoluutiobiologisen lähestymistavan avulla voin muodostaa uudenlaisia tutkimuskysymyksiä ja käyttää uusia menetelmiä vastatakseni aiemmin esitettyihin kysymyksiin (esimerkiksi sukupuun muotoon liittyen). Tutkimuksessani keskityin selvittämään kielten erkaantumista ihmisen ekologian kannalta. Toisin sanoen olin kiinnostunut ympäristö- ja/tai kulttuuritekijöiden vaikutuksesta kielenpuhujiin ja siitä, voiko tämä kytkös olla osallisena kielten erkaantumisprosessissa. Tutkin kysymystä tämän prosessin kahdessa eri vaiheessa: sen alussa ennen kuin eriytyminen on kokonaan tapahtunut, ja sen jo tapahduttua. Murteiden eriytyminen vastaa prossessin alkuvaihetta, ja tutkin sitä suomen kielen murreaineistoa käyttäen. Tapahtuneita erkaantumisia tutkin sukupuista, joita tein uralilaisten kielten sanastoaineistosta. Koska uralilaisia kieliä ei ole aiemmin tutkittu vastaavanlaisin laskennallisin menetelmin eikä käyttämiäni populaatiogenetiikan menetelmiä ole käytetty aiemmin mihinkään murreaineistoon, testasin aluksi näiden menetelmien soveltuvuutta aineistojeni analysointiin. Totesin biologisten menetelmien soveltuvan kieliaineiston analysointiin, sillä tulokseni vastasivat perinteisiä näkemyksiä sekä uralilaisen sukupuun muodosta että suomen murrejaosta. Lisäksi havaitsin, että erot ympäristöoloissa mahdollisesti vaikuttavat kielten erkaantumiseen. Tämä oli havaittavissa niin eriytymisprosessin varhaisissa vaiheissa murteiden välillä kuin myös koko kieliryhmän eriytymisiä tutkittaessa. Koska ihmisten tiedetään usein sopeutuvan vallitseviin ympäristöolosuhteisiin kulttuurisopeumien avulla, päättelin murretutkimusteni tuloksista, että juuri kieltenpuhujien kulttuurinen sopeutuminen paikallisiin ympäristöolosuhteisiin saattaisi toimia puhujapopulaatioita erottavana tekijänä ja täten kytköksenä ympäristöerojen ja kielellisen erkaantumisen välillä. Tämä voisi mahdollisesti selittää myös uralilaisten kielten erkaantumisia. Väitöstutkimukseni tulokset tuovat uusia näkemyksiä kielten erkaantumiseen niin paikallisella kuin maailmanlaajuisellakin tasolla. Havaintoni ympäristöerojen mahdollisesta vaikutuksesta suomen murteiden muotoutumisessa herättää kysymyksen löytöni yleistettävyydestä myös muihin kieliin ja niiden murteisiin. Koska murteiden erkaantuminen on ensimmäinen vaihe kielen eriytymisprosessissa, on murteiden muotoutumista tutkimalla mahdollista myös selvittää, mitkä tekijät ovat aikaansaaneet maailmanlaajuisen kielten kirjon. Tästä syystä tarvitaan vastaavanlaisia tutkimuksia myös muiden kielten murteista. Esitän väitöskirjassani myös uralilaisten kielten laskennallisesti tehdyn sukupuun, jota voidaan verrata vastaavilla menetelmillä tehtyihin muiden kieliryhmien puihin. Tämän vertailun kautta on mahdollista selvittää onko kielisukupuiden muodossa jotain maailmanlaajuisia säännönmukaisuuksia, josta voi edelleen tehdä päätelmiä kieliin vaikuttavista lainalaisuuksista. Ihmiskunnan historian ja esihistorian selvittäminen on haasteellinen palapeli, jossa eri tieteenalojen palasia yhteen sovittelemalla voidaan päästä lähemmäksi yleistä ymmärrystä menneisyydestä. Väitöstutkimukseni on pieni osa tätä kokonaisuutta, mutta yhdistelemällä havaintojani niin muista kieliryhmistä tehtyihin havaintoihin kuin myös esimerkiksi arkeologian ja genetiikan tuloksiin, olemme taas askeleen lähempänä tätä tavoitetta.Siirretty Doriast

    Comprehensive clustering approach for managing maintenance in large fleet of assets

    Get PDF
    Proceedings of the 29th European Safety and Reliability Conference (ESREL), 22 – 26 September 2019, Hannover, Germany. Editors, Michael Beer and Enrico ZioThe maintenance management of large fleets of assets which include several technical solutions operating in different operational contexts has been a recurrent research topic in the literature. Current approaches to establishing fleet maintenance plans are primarily criticality-based, considering failures consequences and assets reliability; the reliability model is often supported by the idea of pooling data from similar pieces of equipment. In spite of the capability to reduce the population offered by data-pooling, its criteria may still lead to a quite large number of segments. Therefore, it results in an equally large amount of maintenance plans along with their inherent operational and administrative difficulties. It is the purpose of the paper to introduce a novel and comprehensive approach; it integrates statistical methods and clustering algorithms to render a fleet segmentation which allows better customization of maintenance plans involving fewer efforts. The approach is summarized in a decision chart which collects the logic behind the use of every algorithm, tool and technique

    Design and operation of energy systems under uncertainty: a comparison between deterministic and stochastic approach

    Get PDF
    openL’obiettivo di questa tesi è l’analisi delle fasi di design e operation di un sistema energetico con incertezze. Nel dettaglio, i risultati devono spiegare in quale misura la modellazione dell’incertezza associata all’irradianza solare e alla temperatura ambiente possa consentire il miglioramento delle scelte di design, come ad esempio una maggiore precisione riguardo la taglia di un’unità. L’introduzione delle incertezze risulta importante a causa di diversi fattori, quali il cambiamento climatico, condizioni di mercato inaspettate, evoluzione della richiesta energetica o pianificazioni interattive. Molti studi hanno evidenziato i vantaggi legati all’analisi delle incertezze rispetto ad un tradizionale approccio deterministico. Per le fasi di design e operation di un sistema energetico, le condizioni climatiche, il prezzo dei vettori energetici e la domanda energetica sono i principali parametri incerti da tenere in considerazione. In questo lavoro, solo le condizioni climatiche sono considerate fonti di incertezza, in modo da vedere quanto esse possano influenzare le soluzioni di design. Per affrontare il problema, si è analizzato un sistema multi-energy residenziale: l’idea è di essere nell’anno 2010, con l’obiettivo di trovare la miglior soluzione per il “futuro”, corrispondente al periodo 2010-2020, usando dati storici inerenti al periodo 2005-2009. Diversi modelli deterministici e stocastici a due stadi, con riferimento a tale sistema, sono stati sviluppati per comparare le soluzioni ottimizzate con quella di riferimento per il periodo 2010-2020. Per prima cosa, viene discusso il peso della temperatura ambiente nel processo di clustering: questo parametro è raramente considerato nella Letteratura, ma consente di migliorare la qualità della rappresentazione del dataset iniziale. Infatti, la considerazione della sola irradianza solare presenta, in media, il 10% in meno di elementi ben posizionati rispetto al processo che utilizza sia irradianza che temperatura come attributi. In seguito, l’attenzione è posta sui diversi metodi per la generazione di giornate rappresentative, corrispondenti al periodo di ottimizzazione, per vedere qual è il più adatto ad essere utilizzato per la fase di design di un sistema energetico. Tecniche di clustering sono comparate con profili stagionali o mensili medi. La generazione di cluster stagionali è altresì discussa. I profili medi sono dimostrati essere i peggiori, presentando errori relativi fino al 13% per la funzione obiettivo, paragonata alla soluzione di riferimento. I cluster annuali performano meglio se il numero di giorni rappresentativi è basso, uguale a 4 o 8, o alto, pari a 28. Infine, è presentata una procedura innovativa di clustering a due stadi per la generazione di scenari stocastici per i diversi giorni rappresentativi. L’idea è quella di assegnare un set di scenari di irradianza e temperatura a ciascun giorno rappresentativo. In ogni caso, le soluzioni ottenute sono troppo conservative, il che è coerente con la teoria dello stochastic programming, ma comporta costi totali elevati.The aim of this thesis is to study the design and operation phases of an energy system under uncertainty. In particular, results should explain whether modelling the uncertainty associated with global solar irradiance and air temperature helps improving design choices, such as components sizes. The importance of introducing uncertainty is related to many aspects, such as climate change, unexpected market conditions, evolution of energy demand, interactive planning. Many studies highlight the advantages of uncertainty analysis with respect to traditional deterministic approaches. For the design and operation of an energy system, climate conditions, price of energy carriers and energy demand are the main uncertain parameters. In the following work, only climate conditions are considered as a source of uncertainty, to see how much they can affect design solutions. To address the problem, a residential multi-energy system is considered: the idea is to be in the year 2010, trying to find the best solution for the “future”, the period 2010-2020, using historical data from the period 2005-2009. Deterministic and two-stage stochastic models are developed, with respect to such system, to compare the optimised solutions with the reference one for the period 2010-2020. First, the relevance of the air temperature in the clustering process is discussed: this parameter is rarely considered in the Literature, but it allows to improve the quality of the dataset representation. In fact, the clustering process with just global solar irradiance presents, as average, 10% fewer well-positioned elements than the process using irradiance and air temperature. Then, attention is put on different methods for generating representative days as optimisation period, to see which is the most suitable to use for the design phase of an energy system. Clustering techniques are compared with average seasonal and monthly profiles. Generation of seasonal clusters is also discussed. Average profiles are proved the worst ones, presenting relative errors up to 13% for the objective function, with respect to the reference solution. Annual clusters are better than seasonal ones when the number of representative days is low, equal to 4 or 8, or high, equal to 28. Finally, an innovative two-step clustering procedure to generate scenarios for representative days is presented. The idea is to assign a set of scenarios to each representative day. However, obtained solutions are too conservative, which is consistent with stochastic programming theory, but entails higher total costs

    Observation-Based Multi-Agent Planning with Communication

    Get PDF
    This research has been sponsored by SELEX ES. We thank Feng Wu for providing the source code of the MAOP-COMM planner.Publisher PD
    corecore