16 research outputs found

    Space- and Time-Efficient Algorithm for Maintaining Dense Subgraphs on One-Pass Dynamic Streams

    Get PDF
    While in many graph mining applications it is crucial to handle a stream of updates efficiently in terms of {\em both} time and space, not much was known about achieving such type of algorithm. In this paper we study this issue for a problem which lies at the core of many graph mining applications called {\em densest subgraph problem}. We develop an algorithm that achieves time- and space-efficiency for this problem simultaneously. It is one of the first of its kind for graph problems to the best of our knowledge. In a graph G=(V,E)G = (V, E), the "density" of a subgraph induced by a subset of nodes SVS \subseteq V is defined as E(S)/S|E(S)|/|S|, where E(S)E(S) is the set of edges in EE with both endpoints in SS. In the densest subgraph problem, the goal is to find a subset of nodes that maximizes the density of the corresponding induced subgraph. For any ϵ>0\epsilon>0, we present a dynamic algorithm that, with high probability, maintains a (4+ϵ)(4+\epsilon)-approximation to the densest subgraph problem under a sequence of edge insertions and deletions in a graph with nn nodes. It uses O~(n)\tilde O(n) space, and has an amortized update time of O~(1)\tilde O(1) and a query time of O~(1)\tilde O(1). Here, O~\tilde O hides a O(\poly\log_{1+\epsilon} n) term. The approximation ratio can be improved to (2+ϵ)(2+\epsilon) at the cost of increasing the query time to O~(n)\tilde O(n). It can be extended to a (2+ϵ)(2+\epsilon)-approximation sublinear-time algorithm and a distributed-streaming algorithm. Our algorithm is the first streaming algorithm that can maintain the densest subgraph in {\em one pass}. The previously best algorithm in this setting required O(logn)O(\log n) passes [Bahmani, Kumar and Vassilvitskii, VLDB'12]. The space required by our algorithm is tight up to a polylogarithmic factor.Comment: A preliminary version of this paper appeared in STOC 201

    Alternative Approaches for Analysis of Bin Packing and List Update Problems

    Get PDF
    In this thesis we introduce and evaluate new algorithms and models for the analysis of online bin packing and list update problems. These are two classic online problems which are extensively studied in the literature and have many applications in the real world. Similar to other online problems, the framework of competitive analysis is often used to study the list update and bin packing algorithms. Under this framework, the behavior of online algorithms is compared to an optimal offline algorithm on the worst possible input. This is aligned with the traditional algorithm theory built around the concept of worst-case analysis. However, the pessimistic nature of the competitive analysis along with unrealistic assumptions behind the proposed models for the problems often result in situations where the existing theory is not quite useful in practice. The main goal of this thesis is to develop new approaches for studying online problems, and in particular bin packing and list update, to guide development of practical algorithms performing quite well on real-world inputs. In doing so, we introduce new algorithms with good performance (not only under the competitive analysis) as well as new models which are more realistic for certain applications of the studied problems. For many online problems, competitive analysis fails to provide a theoretical justification for observations made in practice. This is partially because, as a worst-case analysis method, competitive analysis does not necessarily reflect the typical behavior of algorithms. In the case of bin packing problem, the Best Fit and First Fit algorithms are widely used in practice. There are, however, other algorithms with better competitive ratios which are rarely used in practice since they perform poorly on average. We show that it is possible to optimize for both cases. In doing so, we introduce online bin packing algorithms which outperform Best Fit and First Fit in terms of competitive ratio while maintaining their good average-case performance. An alternative for analysis of online problems is the advice model which has received significant attention in the past few years. Under the advice model, an online algorithm receives a number of bits of advice about the unrevealed parts of the sequence. Generally, there is a trade-off between the size of the advice and the performance of online algorithms. The advice model generalizes the existing frameworks in which an online algorithm has partial knowledge about the input sequence, e.g., the access graph model for the paging problem. We study list update and bin packing problems under the advice model and answer several relevant questions about the advice complexity of these problems. Online problems are usually studied under specific settings which are not necessarily valid for all applications of the problem. As an example, online bin packing algorithms are widely used for server consolidation to minimize the number of active servers in a data center. In some applications, e.g., tenant placement in the Cloud, often a `fault-tolerant' solution for server consolidation is required. In this setting, the problem becomes different and the classic algorithms can no longer be used. We study a fault-tolerant model for the bin packing problem and analyze algorithms which fit this particular application of the problem. Similarly, the list update problem was initially proposed for maintaining self-adjusting linked lists. However, presently, the main application of the problem is in the data compression realm. We show that the standard cost model is not suitable for compression purposes and study a compression cost model for the list update problem. Our analysis justifies the advantage of the compression schemes which are based on Move-To-Front algorithm and might lead to improved compression algorithms

    Dynamic Data Structures for Parameterized String Problems

    Get PDF
    We revisit classic string problems considered in the area of parameterized complexity, and study them through the lens of dynamic data structures. That is, instead of asking for a static algorithm that solves the given instance efficiently, our goal is to design a data structure that efficiently maintains a solution, or reports a lack thereof, upon updates in the instance. We first consider the Closest String problem, for which we design randomized dynamic data structures with amortized update times dO(d)d^{\mathcal{O}(d)} and ΣO(d)|\Sigma|^{\mathcal{O}(d)}, respectively, where Σ\Sigma is the alphabet and dd is the assumed bound on the maximum distance. These are obtained by combining known static approaches to Closest String with color-coding. Next, we note that from a result of Frandsen et al.~[J. ACM'97] one can easily infer a meta-theorem that provides dynamic data structures for parameterized string problems with worst-case update time of the form O(loglogn)\mathcal{O}(\log \log n), where kk is the parameter in question and nn is the length of the string. We showcase the utility of this meta-theorem by giving such data structures for problems Disjoint Factors and Edit Distance. We also give explicit data structures for these problems, with worst-case update times O(k2kloglogn)\mathcal{O}(k2^{k}\log \log n) and O(k2loglogn)\mathcal{O}(k^2\log \log n), respectively. Finally, we discuss how a lower bound methodology introduced by Amarilli et al.~[ICALP'21] can be used to show that obtaining update time O(f(k))\mathcal{O}(f(k)) for Disjoint Factors and Edit Distance is unlikely already for a constant value of the parameter kk.Comment: 28 page

    ConnectIt: A Framework for Static and Incremental Parallel Graph Connectivity Algorithms

    Full text link
    Connected components is a fundamental kernel in graph applications due to its usefulness in measuring how well-connected a graph is, as well as its use as subroutines in many other graph algorithms. The fastest existing parallel multicore algorithms for connectivity are based on some form of edge sampling and/or linking and compressing trees. However, many combinations of these design choices have been left unexplored. In this paper, we design the ConnectIt framework, which provides different sampling strategies as well as various tree linking and compression schemes. ConnectIt enables us to obtain several hundred new variants of connectivity algorithms, most of which extend to computing spanning forest. In addition to static graphs, we also extend ConnectIt to support mixes of insertions and connectivity queries in the concurrent setting. We present an experimental evaluation of ConnectIt on a 72-core machine, which we believe is the most comprehensive evaluation of parallel connectivity algorithms to date. Compared to a collection of state-of-the-art static multicore algorithms, we obtain an average speedup of 37.4x (2.36x average speedup over the fastest existing implementation for each graph). Using ConnectIt, we are able to compute connectivity on the largest publicly-available graph (with over 3.5 billion vertices and 128 billion edges) in under 10 seconds using a 72-core machine, providing a 3.1x speedup over the fastest existing connectivity result for this graph, in any computational setting. For our incremental algorithms, we show that our algorithms can ingest graph updates at up to several billion edges per second. Finally, to guide the user in selecting the best variants in ConnectIt for different situations, we provide a detailed analysis of the different strategies in terms of their work and locality

    Development of statistical methods for the analysis of single-cell RNA-seq data

    Get PDF
    Single-cell RNA-sequencing profiles the transcriptome of cells from diverse populations. A popular intermediate data format is a large count matrix of genes x cells. This type of data brings several analytical challenges. Here, I present three projects that I worked on during my PhD that address particular aspects of working with such datasets: - The large number of cells in the count matrix is a challenge for fitting gamma-Poisson generalized linear models with existing tools. I developed a new R package called glmGamPoi to address this gap. I optimized the overdispersion estimation procedure to be quick and robust for datasets with many cells and small counts. I compared the performance against two popular tools (edgeR and DESeq2) and find that my inference is 6x to 13x faster and achieves a higher likelihood for a majority of the genes in four single-cell datasets. - The variance of single-cell RNA-seq counts depends on their mean but many existing statistical tools have optimal performance when the variance is uniform. Accordingly, variance-stabilizing transformations are applied to unlock the large number of methods with such an requirement. I compared four approaches to variance-stabilize the data based on the delta method, model residuals, inferred latent expression state or count factor analysis. I describe the theoretical strength and weaknesses, and compare their empirical performance in a benchmark on simulated and real single-cell data. I find that none of the mathematically more sophisticated transformations consistently outperform the simple log(y/s+1) transformation. - Multi-condition single-cell data offers the opportunity to find differentially expressed genes for individual cell subpopulations. However, the prevalent approach to analyze such data is to start by dividing the cells into discrete populations and then test for differential expression within each group. The results are interpretable but may miss interesting cases by (1) choosing the cluster size too small and lacking power to detect effects or (2) choosing the cluster size too large and obscuring interesting effects apparent on a smaller scale. I developed a new statistical framework for the analysis of multi-condition single-cell data that avoids the premature discretization. The approach performs regression on the latent subspaces occupied by the cells in each condition. The method is implemented as an R package called lemur

    Identifikasjon av materialer med lav termisk gitterledningsevne ved bruk av maskinlæring og databeregninger

    Get PDF
    Lattice thermal conductivity is a key materials property in applications related to thermal functionality, such as thermal barrier coatings, thermal conductors in microelectronics, and solid-state waste-heat recovery devices. The lattice thermal conductivity governs the rate of heat energy transfer in thermoelectric materials, which are materials that can directly convert heat to electricity and vice versa. These materials become interesting in applications that require electricity generation or local cooling. Thermoelectric materials depend on a low lattice thermal conductivity to attain high heat-to-electricity conversion efficiency. The materials used in present thermoelectric generators are often based on toxic or scarce elements. New high-efficiency thermoelectric materials are therefore desired for sustainable and environmentally friendly energy harvesting. Two main research challenges are investigated in this thesis: 1) reducing the lattice thermal conductivity to enhance thermoelectric performance, and 2) identifying new compounds with low lattice thermal conductivity. Addressing these challenges experimentally is a daunting task -- especially for 100s or 1000s of compounds -- as experiments are costly, time-consuming, and require expert domain knowledge. This thesis, therefore, relies on lattice thermal conductivity from theoretical calculations based on quantum mechanical simulations. Addressing challenge 1), the lattice thermal conductivity of 122 half-Heusler compounds is calculated using density functional theory and the temperature-dependent effective potential method. Phonon scattering from partial sublattice substitutions and grain boundaries are included in calculations, in an attempt to reduce the lattice thermal conductivity. We find that isovalent substitutions on the site hosting the heaviest atom should be performed to optimally reduce the lattice thermal conductivity in most half-Heuslers. Compounds with large atomic mass differences can have a large drop in lattice thermal conductivity with substitutions. Examples of such compounds are AlSiLi and TiNiPb, which achieve a 70\sim 70~\% reduction of their lattice thermal conductivity when substituting Si by Ge and Pb by Sn at 10~\% concentration. The reduction from additional scattering mechanisms enables a handful half-Heuslers to attain a lattice thermal conductivity close to 22~W/Km at 300~K. Calculations for full-Heusler AlVFe2\rm{AlVFe}_2 reveal that the introduction of 15~\% Ru substitutions on the Fe-site and 100~nm grain boundaries can reduce the lattice thermal conductivity from 46~W/Km to 7~W/Km. Tackling challenge 2) is done by computational screening for low lattice thermal conductivity compounds. Coupling calculations with machine learning accelerates the screening. When training the machine learning model on calculated lattice thermal conductivities, it learns to recognize descriptor patterns for compounds with low lattice thermal conductivity. The size of the training set is limited by the large computational cost of calculating lattice thermal conductivity. It is therefore challenging to obtain a diverse set of training compounds, especially so because low lattice thermal conductivity compounds tend to be rare. We find that including certain compounds in the training can be crucial for identifying low lattice thermal conductivity compounds. Active sampling enables scouting of the compound space for compounds that should enter the training set. Principal component analysis and Gaussian process regression are used in the active sampling schemes. With Gaussian process regression we screen 1573 cubic compounds, where 34 have predicted lattice thermal conductivity 1.3\leq 1.3~W/Km at 300 K -- as well as electronic band gaps -- indicating that they could be potential thermoelectric compounds. The findings in this thesis show that certain compounds could have a drastic reduction in the lattice thermal conductivity with sublattice substitutions. Thermoelectric compounds with favorable electronic properties -- but high lattice thermal conductivity -- can be investigated in future studies if there is a potential for a large drop in the lattice thermal conductivity with sublattice substitutions. The machine learning and active sampling schemes are scalable, and future works could expand upon this thesis by including different compound classes in training and screening. This would enlarge the search space for promising thermoelectric compounds, increasing the likelihood of encountering high-efficiency candidates. It is also possible to combine the two challenges faced in this thesis. A machine learning model can be trained to predict the lattice thermal conductivity of compounds with sublattice substitutions. This would further increase the pool of possible compounds where promising thermoelectric compounds could reside.Termisk gitterledningsevne er en viktig materialegenskap i tekniske instrumenter som anvender varmeledningsteknologi, slik som termiske barriere-belegg, termiske ledere i mikroelektronikk, og varmegjenvinningsenheter. Denne egenskapen styrer raten av varmeenergi-overføring i termoelektriske materialer. Disse materialene kan omgjøre varmeenergi til elektrisk energi og motsatt, og er derfor lovende i produkter som avhenger av elektrisitetsgenerering eller utnytter lokal kjøling. Termoelektriske materialer må ha lav termisk gitterledningsevne for å opprettholde høy effektivitet. Dagens termoelektriske materialer er ofte basert på giftige eller sjeldne materialer, slik som bly eller tellur. Det er derfor nyttig å finne nye materialer med høy effektivitet for å videre anvende termoelektrisk energi-høsting på en bærekraftig måte. To hovedutfordringer er undersøkt i denne avhandlinga: 1) reduksjon av termisk gitterledningsevne for å øke termoelektrisk effekt, og 2) identifikasjon av nye materialer med lav gitterledningsevne. Å løse disse utfordringene eksperimentelt er krevende siden eksperimenter er dyre, tar mye tid, og krever ekspert-kunnskap. I denne avhandlinga brukes derfor teoretiske beregninger basert på kvantemekaniske simuleringer for å estimere termisk gitterledningsevne. I arbeidet med utfordring 1) beregnes termisk gitterledningsevne til 122 half-Heusler-materialer basert på temperaturavhengige materialsimuleringer. For å redusere termisk gitterledningsevne inkluderes ekstra fonon-spredningsmekanismer: sub-gitter-substitusjoner (legeringer) og korngrenser. Vi finner at isovalente substitusjoner på gitter-plassen som innehar det tyngste atomet gir den største reduksjonen i termisk gitterledningsevne for de fleste materialene. Materialer med stor atommasse-forskjell kan ha en stor reduksjon i termisk gitterledningsevne med substitusjoner. AlSiLi og TiNiPb er eksempler på slike materialer, og oppnår en ∼ 70 % reduksjon i termisk gitterledningsevne når Si er substituert med Ge og Pb er substituert med Sn med 10 % konsentrasjon. Reduksjonen fra ekstra spredningsmekanismer gjør at en håndfull half-Heuslere oppnår termisk gitterledningsevne nærme 2 W/Km. Beregninger for fullHeusleren AlVFe2 viser at introduksjonen av 15 % Ru-substitusjon på Fegitterplassen og 100 nm korngrenser kan redusere termisk gitterledningsevne fra 46 W/Km til 7 W/Km

    Dissimilarity-based learning for complex data

    Get PDF
    Mokbel B. Dissimilarity-based learning for complex data. Bielefeld: Universität Bielefeld; 2016.Rapid advances of information technology have entailed an ever increasing amount of digital data, which raises the demand for powerful data mining and machine learning tools. Due to modern methods for gathering, preprocessing, and storing information, the collected data become more and more complex: a simple vectorial representation, and comparison in terms of the Euclidean distance is often no longer appropriate to capture relevant aspects in the data. Instead, problem-adapted similarity or dissimilarity measures refer directly to the given encoding scheme, allowing to treat information constituents in a relational manner. This thesis addresses several challenges of complex data sets and their representation in the context of machine learning. The goal is to investigate possible remedies, and propose corresponding improvements of established methods, accompanied by examples from various application domains. The main scientific contributions are the following: (I) Many well-established machine learning techniques are restricted to vectorial input data only. Therefore, we propose the extension of two popular prototype-based clustering and classification algorithms to non-negative symmetric dissimilarity matrices. (II) Some dissimilarity measures incorporate a fine-grained parameterization, which allows to configure the comparison scheme with respect to the given data and the problem at hand. However, finding adequate parameters can be hard or even impossible for human users, due to the intricate effects of parameter changes and the lack of detailed prior knowledge. Therefore, we propose to integrate a metric learning scheme into a dissimilarity-based classifier, which can automatically adapt the parameters of a sequence alignment measure according to the given classification task. (III) A valuable instrument to make complex data sets accessible are dimensionality reduction techniques, which can provide an approximate low-dimensional embedding of the given data set, and, as a special case, a planar map to visualize the data's neighborhood structure. To assess the reliability of such an embedding, we propose the extension of a well-known quality measure to enable a fine-grained, tractable quantitative analysis, which can be integrated into a visualization. This tool can also help to compare different dissimilarity measures (and parameter settings), if ground truth is not available. (IV) All techniques are demonstrated on real-world examples from a variety of application domains, including bioinformatics, motion capturing, music, and education

    Histopathologic and proteogenomic heterogeneity reveals features of clear cell renal cell carcinoma aggressiveness

    Get PDF
    Clear cell renal cell carcinomas (ccRCCs) represent ∼75% of RCC cases and account for most RCC-associated deaths. Inter- and intratumoral heterogeneity (ITH) results in varying prognosis and treatment outcomes. To obtain the most comprehensive profile of ccRCC, we perform integrative histopathologic, proteogenomic, and metabolomic analyses on 305 ccRCC tumor segments and 166 paired adjacent normal tissues from 213 cases. Combining histologic and molecular profiles reveals ITH in 90% of ccRCCs, with 50% demonstrating immune signature heterogeneity. High tumor grade, along with BAP1 mutation, genome instability, increased hypermethylation, and a specific protein glycosylation signature define a high-risk disease subset, where UCHL1 expression displays prognostic value. Single-nuclei RNA sequencing of the adverse sarcomatoid and rhabdoid phenotypes uncover gene signatures and potential insights into tumor evolution. In vitro cell line studies confirm the potential of inhibiting identified phosphoproteome targets. This study molecularly stratifies aggressive histopathologic subtypes that may inform more effective treatment strategies
    corecore