318 research outputs found

    Task automation through email data analysis

    Get PDF
    Currently, many companies do not use the information contained in their emails, yet it is a data set that is full of information and could be very useful. This thesis report focuses on email data analysis and task automation, particularly in the area of email-based process mining. The state of the art section reviews existing research on extracting information from email content using techniques such as lexical analysis, language detection, semantic analysis and machine learning methods. It explores different areas of process mining, including process pattern discovery, anomaly discovery, and process extraction from texts. The objectives of this research are to assess the feasibility of extracting candidate processes from emails, to develop human-understandable metrics to classify processes, to propose a system to identify automation opportunities in email templates and explore possibilities for automation in email interactions. To do this, we carried out different steps such as data preparation, chains detection, text representation, distance matrix calculation and grouping methods

    Finding communities in networks in the strong and almost-strong sense

    No full text
    International audienceFinding communities, or clusters or modules, in networks can be done by optimizing an objective function defined globally and/or by specifying conditions which must be satisfied by all communities. Radicchi et al. [ Proc. Natl. Acad. Sci. USA 101 2658 (2004)] define a susbset of vertices of a network to be a community in the strong sense if each vertex of that subset has a larger inner degree than its outer degree. A partition in the strong sense has only strong communities. In this paper we first define an enumerative algorithm to list all partitions in the strong sense of a network of moderate size. The results of this algorithm are given for the Zachary karate club data set, which is solved by hand, as well as for several well-known real-world problems of the literature. Moreover, this algorithm is slightly modified in order to apply it to larger networks, keeping only partitions with the largest number of communities. It is shown that some of the partitions obtained are informative, although they often have only a few communities, while they fail to give any information in other cases having only one community. It appears that degree 2 vertices play a big role in forcing large inhomogeneous communities. Therefore, a weakening of the strong condition is proposed and explored: we define a partition in the almost-strong sense by substituting a nonstrict inequality to a strict one in the definition of strong community for all vertices of degree 2. Results, for the same set of problems as before, then give partitions with a larger number of communities and are more informative

    Identifying markers of cell identity from single-cell omics data

    Get PDF
    Einzelzell-Omics-Daten stehen derzeit im Fokus der Entwicklung computergestützter Methoden in der Molekularbiologie und Genetik. Einzelzellexperimenten lieferen dünnbesetzte, hochdimensionale Daten über zehntausende Gene oder hunderttausende regulatorische Regionen in zehntausenden Zellen. Diese Daten bieten den Forschenden die Möglichkeit, Gene und regulatorische Regionen zu identifizieren, welche die Bestimmung und Aufrechterhaltung der Zellidentität koordinieren. Die gängigste Strategie zur Identifizierung von Zellidentitätsmarkern besteht darin, die Zellen zu clustern und dann Merkmale zu finden, welche die Cluster unterscheiden, wobei davon ausgegangen wird, dass die Zellen innerhalb eines Clusters die gleiche Identität haben. Diese Annahme ist jedoch nicht immer zutreffend, insbesondere nicht für Entwicklungsdaten bei denen sich die Zellen in einem Kontinuum befinden und die Definition von Clustergrenzen biologisch gesehen potenziell willkürlich ist. Daher befasst sich diese Dissertation mit Clustering-unabhängigen Strategien zur Identifizierung von Markern aus Einzelzell-Omics-Daten. Der wichtigste Beitrag dieser Dissertation ist SEMITONES, eine auf linearer Regression basierende Methode zur Identifizierung von Markern. SEMITONES identifiziert (Gruppen von) Markern aus verschiedenen Arten von Einzelzell-Omics-Daten, identifiziert neue Marker und übertrifft bestehende Marker-Identifizierungsansätze. Außerdem ermöglicht die Identifizierung von regulatorischen Markerregionen durch SEMITONES neue Hypothesen über die Regulierung der Genexpression während dem Erwerb der Zellidentität. Schließlich beschreibt die Dissertation einen Ansatz zur Identifizierung neuer Markergene für sehr ähnliche, dennoch underschiedliche neurale Vorlauferzellen im zentralen Nervensystem von Drosphila melanogaster. Ingesamt zeigt die Dissertation, wie Cluster-unabhängige Ansätze zur Aufklärung bisher uncharakterisierter biologischer Phänome aus Einzelzell-Omics-Daten beitragen.Single-cell omics approaches are the current frontier of computational method development in molecular biology and genetics. A single single-cell experiment provides sparse, high-dimensional data on tens of thousands of genes or hundreds of thousands of regulatory regions (i.e. features) in tens of thousands of cells (i.e. samples). This data provides researchers with an unprecedented opportunity to identify those genes and regulatory regions that determine and coordinate cell identity acquisition and maintenance. The most common strategy for identifying cell identity markers consists of clustering the cells and then identifying differential features between these clusters, assuming that cells within a cluster share the same identity. This assumption is, however, not guaranteed to hold, particularly for developmental data where cells lie along a continuum and inferring cluster boundaries becomes non-trivial and potentially biologically arbitrary. In response, this thesis presents clustering-independent strategies for marker feature identification from single-cell omics data. The primary contribution of this thesis is a linear regression-based method for marker feature identification from single-cell omics data called SEMITONES. SEMITONES can identify markers or marker sets from diverse single-cell omics data types, identifies novel markers, outperforms existing marker identification approaches. The thesis also describes how the identification of marker regulatory regions by SEMITONES enables the generation of novel hypotheses regarding gene regulation during cell identity acquisition. Lastly, the thesis describes the clustering-independent identification of novel marker genes for highly similar yet distinct neural progenitor cells in the Drosophila melanogaster central nervous system. Altogether, the thesis demonstrates how clustering-independent approaches aid the elucidation of yet uncharacterised biological patterns from single cell-omics data

    Covariance Selection Quality and Approximation Algorithms

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2018

    NP-hard networking problems : exact and approximate algorithms

    Get PDF
    An important class of problems that occur in different fields of research such as biology, linguistics or in the design of wireless communication networks, deal with the problem of finding an interconnection of a given set of objects. Additionaly, these networks should satisfy certain properties and minimize a certain cost function. In this thesis, we discuss such NP-hard networking problems in two parts. First, we mainly deal with the so-called Steiner minimum tree problem in Hamming metric. The computation of such trees has become a key tool for the reconstruction of the ancestral relationships of species. We give a new exact algorithm that clearly outperforms the branch and bound based method of Hendy and Penny which has been considered to be the fastest for the last 25 years. Further, we propose an extended model to cope with the case in which the ancestral relationships are best described by a non-tree structure. Finally, we deal with several problems occurring in the design of wireless ad-hoc networks: While minimizing the total power consumption of a wireless communication network, one wants to establish a messaging structure such that certain communication tasks can be performed. We show how approximate solutions can be found for these problems.In verschiedenen wissenschaftlichen Disziplinen, wie der Biologie, der Linguistik und dem Entwurf kabelloser Kommunikationsnetzwerke, wird man mit der Konstruktion von Verbindungsnetzwerken über einer gegebenen Menge von Objekten konfrontiert. Diese Netzwerke sollen bestimmte Eigenschaften erfüllen und gleichzeitig eine gegebene Kostenfunktion minimieren. In dieser Arbeit werden NP-schwere Netzwerkprobleme dieser Art behandelt. Die Arbeit untergliedert sich in zwei Teile. Im ersten Teil beschäftigen wir uns hauptsächlich mit dem so genannten Steinerbaumproblem in der Hamming-Metrik. Die Berechnung solcher Bäume hat sich als eines der Hauptwerkzeuge in der Rekonstruktion abstammungsgeschichtlicher Beziehungen zwischen Spezien herausgestellt. Wir geben einen neuen, exakten Algorithmus, welcher der Branch-and-Bound-Methode von Hendy und Penny deutlich überlegen ist. Diese galt in den letzten 25 Jahren als die schnellste Methode zur Berechnung solcher Bäume. Des Weiteren stellen wir ein erweitertes Modell vor, welches die Fälle behandelt, in denen die abstammungsgeschichtlichen Beziehungen bestmöglich durch eine nicht baumartige Struktur beschrieben wird. Im zweiten Teil beschäftigen wir uns mit verschiedenen Problemen, wie sie bei dem Entwurf kabelloser Ad-hoc-Netzwerke auftreten: Unter denjenigen Kommunikationsstrukturen, die bestimmte Kommunikationsarten zulassen, versucht man diejenige zu finden, welche die Stromaufnahme des Netzwerkes minimiert. Wir zeigen, wie für diese Probleme approximative Lösungen gefunden werden können

    Interactive, tree-based graph visualization

    Get PDF
    We introduce an interactive graph visualization scheme that allows users to explore graphs by viewing them as a sequence of spanning trees, rather than the entire graph all at once. The user determines which spanning trees are displayed by selecting a vertex from the graph to be the root. Our main contributions are a graph drawing algorithm that generates meaningful representations of graphs using extracted spanning trees, and a graph animation algorithm for creating smooth, continuous transitions between graph drawings. We conduct experiments to measure how well our algorithms visualize graphs and compare them to another visualization scheme

    Privaatsust säilitavad paralleelarvutused graafiülesannete jaoks

    Get PDF
    Turvalisel mitmeosalisel arvutusel põhinevate reaalsete privaatsusrakenduste loomine on SMC-protokolli arvutusosaliste ümmarguse keerukuse tõttu keeruline. Privaatsust säilitavate tehnoloogiate uudsuse ja nende probleemidega kaasnevate suurte arvutuskulude tõttu ei ole paralleelseid privaatsust säilitavaid graafikualgoritme veel uuritud. Graafikalgoritmid on paljude arvutiteaduse rakenduste selgroog, nagu navigatsioonisüsteemid, kogukonna tuvastamine, tarneahela võrk, hüperspektraalne kujutis ja hõredad lineaarsed lahendajad. Graafikalgoritmide suurte privaatsete andmekogumite töötlemise kiirendamiseks ja kõrgetasemeliste arvutusnõuete täitmiseks on vaja privaatsust säilitavaid paralleelseid algoritme. Seetõttu esitleb käesolev lõputöö tipptasemel protokolle privaatsuse säilitamise paralleelarvutustes erinevate graafikuprobleemide jaoks, ühe allika lühima tee, kõigi paaride lühima tee, minimaalse ulatuva puu ja metsa ning algebralise tee arvutamise. Need uued protokollid on üles ehitatud kombinatoorsete ja algebraliste graafikualgoritmide põhjal lisaks SMC protokollidele. Nende protokollide koostamiseks kasutatakse ka ühe käsuga mitut andmeoperatsiooni, et vooru keerukust tõhusalt vähendada. Oleme väljapakutud protokollid juurutanud Sharemind SMC platvormil, kasutades erinevaid graafikuid ja võrgukeskkondi. Selles lõputöös kirjeldatakse uudseid paralleelprotokolle koos nendega seotud algoritmide, tulemuste, kiirendamise, hindamiste ja ulatusliku võrdlusuuringuga. Privaatsust säilitavate ühe allika lühimate teede ja minimaalse ulatusega puuprotokollide tegelike juurutuste tulemused näitavad tõhusat meetodit, mis vähendas tööaega võrreldes varasemate töödega sadu kordi. Lisaks ei ole privaatsust säilitavate kõigi paaride lühima tee protokollide hindamine ja ulatuslik võrdlusuuringud sarnased ühegi varasema tööga. Lisaks pole kunagi varem käsitletud privaatsust säilitavaid metsa ja algebralise tee arvutamise protokolle.Constructing real-world privacy applications based on secure multiparty computation is challenging due to the round complexity of the computation parties of SMC protocol. Due to the novelty of privacy-preserving technologies and the high computational costs associated with these problems, parallel privacy-preserving graph algorithms have not yet been studied. Graph algorithms are the backbone of many applications in computer science, such as navigation systems, community detection, supply chain network, hyperspectral image, and sparse linear solvers. In order to expedite the processing of large private data sets for graphs algorithms and meet high-end computational demands, privacy-preserving parallel algorithms are needed. Therefore, this Thesis presents the state-of-the-art protocols in privacy-preserving parallel computations for different graphs problems, single-source shortest path (SSSP), All-pairs shortest path (APSP), minimum spanning tree (MST) and forest (MSF), and algebraic path computation. These new protocols have been constructed based on combinatorial and algebraic graph algorithms on top of the SMC protocols. Single-instruction-multiple-data (SIMD) operations are also used to build those protocols to reduce the round complexities efficiently. We have implemented the proposed protocols on the Sharemind SMC platform using various graphs and network environments. This Thesis outlines novel parallel protocols with their related algorithms, the results, speed-up, evaluations, and extensive benchmarking. The results of the real implementations of the privacy-preserving single-source shortest paths and minimum spanning tree protocols show an efficient method that reduced the running time hundreds of times compared with previous works. Furthermore, the evaluation and extensive benchmarking of privacy-preserving All-pairs shortest path protocols are not similar to any previous work. Moreover, the privacy-preserving minimum spanning forest and algebraic path computation protocols have never been addressed before.https://www.ester.ee/record=b555865

    Elder-Rule-Staircodes for Augmented Metric Spaces

    Get PDF
    An augmented metric space is a metric space (X,dX)(X, d_X) equipped with a function fX:XRf_X: X \to \mathbb{R}. This type of data arises commonly in practice, e.g, a point cloud XX in Rd\mathbb{R}^d where each point xXx\in X has a density function value fX(x)f_X(x) associated to it. An augmented metric space (X,dX,fX)(X, d_X, f_X) naturally gives rise to a 2-parameter filtration K\mathcal{K}. However, the resulting 2-parameter persistent homology H(K)\mathrm{H}_{\bullet}(\mathcal{K}) could still be of wild representation type, and may not have simple indecomposables. In this paper, motivated by the elder-rule for the zeroth homology of 1-parameter filtration, we propose a barcode-like summary, called the elder-rule-staircode, as a way to encode H0(K)\mathrm{H}_0(\mathcal{K}). Specifically, if n=Xn = |X|, the elder-rule-staircode consists of nn number of staircase-like blocks in the plane. We show that if H0(K)\mathrm{H}_0(\mathcal{K}) is interval decomposable, then the barcode of H0(K)\mathrm{H}_0(\mathcal{K}) is equal to the elder-rule-staircode. Furthermore, regardless of the interval decomposability, the fibered barcode, the dimension function (a.k.a. the Hilbert function), and the graded Betti numbers of H0(K)\mathrm{H}_0(\mathcal{K}) can all be efficiently computed once the elder-rule-staircode is given. Finally, we develop and implement an efficient algorithm to compute the elder-rule-staircode in O(n2logn)O(n^2\log n) time, which can be improved to O(n2α(n))O(n^2\alpha(n)) if XX is from a fixed dimensional Euclidean space Rd\mathbb{R}^d, where α(n)\alpha(n) is the inverse Ackermann function.Comment: A few important questions considered in the previous version have been settled; see Example 4.12 and Section 4.3 in particular. The paper has been reorganized. This is the full version of the paper in the Proceedings of the 36th International Symposium on Computational Geometry (SoCG 2020); 41 pages, 17 figure

    Algorithms for Hierarchical Clustering: An Overview, II

    Get PDF
    We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm. This review adds to the earlier version, Murtagh and Contreras (2012)
    corecore