16 research outputs found
Space- and Time-Efficient Algorithm for Maintaining Dense Subgraphs on One-Pass Dynamic Streams
While in many graph mining applications it is crucial to handle a stream of
updates efficiently in terms of {\em both} time and space, not much was known
about achieving such type of algorithm. In this paper we study this issue for a
problem which lies at the core of many graph mining applications called {\em
densest subgraph problem}. We develop an algorithm that achieves time- and
space-efficiency for this problem simultaneously. It is one of the first of its
kind for graph problems to the best of our knowledge.
In a graph , the "density" of a subgraph induced by a subset of
nodes is defined as , where is the set of
edges in with both endpoints in . In the densest subgraph problem, the
goal is to find a subset of nodes that maximizes the density of the
corresponding induced subgraph. For any , we present a dynamic
algorithm that, with high probability, maintains a -approximation
to the densest subgraph problem under a sequence of edge insertions and
deletions in a graph with nodes. It uses space, and has an
amortized update time of and a query time of . Here,
hides a O(\poly\log_{1+\epsilon} n) term. The approximation ratio
can be improved to at the cost of increasing the query time to
. It can be extended to a -approximation
sublinear-time algorithm and a distributed-streaming algorithm. Our algorithm
is the first streaming algorithm that can maintain the densest subgraph in {\em
one pass}. The previously best algorithm in this setting required
passes [Bahmani, Kumar and Vassilvitskii, VLDB'12]. The space required by our
algorithm is tight up to a polylogarithmic factor.Comment: A preliminary version of this paper appeared in STOC 201
Alternative Approaches for Analysis of Bin Packing and List Update Problems
In this thesis we introduce and evaluate new algorithms and models for the analysis of online bin packing and list update problems. These are two classic online problems which are extensively studied in the literature and have many applications in the real world. Similar to other online problems, the framework of competitive analysis is often used to study the list update and bin packing algorithms. Under this framework, the behavior of online algorithms is compared to an optimal offline algorithm on the worst possible input. This is aligned with the traditional algorithm theory built around the concept of worst-case analysis. However, the pessimistic nature of the competitive analysis along with unrealistic assumptions behind the proposed models for the problems often result in situations where the existing theory is not quite useful in practice. The main goal of this thesis is to develop new approaches for studying online problems, and in particular bin packing and list update, to guide development of practical algorithms performing quite well on real-world inputs. In doing so, we introduce new algorithms with good performance (not only under the competitive analysis) as well as new models which are more realistic for certain applications of the studied problems.
For many online problems, competitive analysis fails to provide a theoretical justification for observations made in practice. This is partially because, as a worst-case analysis method, competitive analysis does not necessarily reflect the typical behavior of algorithms. In the case of bin packing problem, the Best Fit and First Fit algorithms are widely used in practice. There are, however, other algorithms with better competitive ratios which are rarely used in practice since they perform poorly on average. We show that it is possible to optimize for both cases. In doing so, we introduce online bin packing algorithms which outperform Best Fit and First Fit in terms of competitive ratio while maintaining their good average-case performance.
An alternative for analysis of online problems is the advice model which has received significant attention in the past few years. Under the advice model, an online algorithm receives a number of bits of advice about the unrevealed parts of the sequence. Generally, there is a trade-off between the size of the advice and the performance of online algorithms. The advice model generalizes the existing frameworks in which an online algorithm has partial knowledge about the input sequence, e.g., the access graph model for the paging problem. We study list update and bin packing problems under the advice model and answer several relevant questions about the advice complexity of these problems.
Online problems are usually studied under specific settings which are not necessarily valid for all applications of the problem. As an example, online bin packing algorithms are widely used for server consolidation to minimize the number of active servers in a data center. In some applications, e.g., tenant placement in the Cloud, often a `fault-tolerant' solution for server consolidation is required. In this setting, the problem becomes different and the classic algorithms can no longer be used. We study a fault-tolerant model for the bin packing problem and analyze algorithms which fit this particular application of the problem.
Similarly, the list update problem was initially proposed for maintaining self-adjusting linked lists. However, presently, the main application of the problem is in the data compression realm.
We show that the standard cost model is not suitable for compression purposes and study a compression cost model for the list update problem. Our analysis justifies the advantage of the compression schemes which are based on Move-To-Front algorithm and might lead to improved compression algorithms
Dynamic Data Structures for Parameterized String Problems
We revisit classic string problems considered in the area of parameterized
complexity, and study them through the lens of dynamic data structures. That
is, instead of asking for a static algorithm that solves the given instance
efficiently, our goal is to design a data structure that efficiently maintains
a solution, or reports a lack thereof, upon updates in the instance.
We first consider the Closest String problem, for which we design randomized
dynamic data structures with amortized update times and
, respectively, where is the alphabet and
is the assumed bound on the maximum distance. These are obtained by
combining known static approaches to Closest String with color-coding.
Next, we note that from a result of Frandsen et al.~[J. ACM'97] one can
easily infer a meta-theorem that provides dynamic data structures for
parameterized string problems with worst-case update time of the form
, where is the parameter in question and is
the length of the string. We showcase the utility of this meta-theorem by
giving such data structures for problems Disjoint Factors and Edit Distance. We
also give explicit data structures for these problems, with worst-case update
times and ,
respectively. Finally, we discuss how a lower bound methodology introduced by
Amarilli et al.~[ICALP'21] can be used to show that obtaining update time
for Disjoint Factors and Edit Distance is unlikely already
for a constant value of the parameter .Comment: 28 page
ConnectIt: A Framework for Static and Incremental Parallel Graph Connectivity Algorithms
Connected components is a fundamental kernel in graph applications due to its
usefulness in measuring how well-connected a graph is, as well as its use as
subroutines in many other graph algorithms. The fastest existing parallel
multicore algorithms for connectivity are based on some form of edge sampling
and/or linking and compressing trees. However, many combinations of these
design choices have been left unexplored. In this paper, we design the
ConnectIt framework, which provides different sampling strategies as well as
various tree linking and compression schemes. ConnectIt enables us to obtain
several hundred new variants of connectivity algorithms, most of which extend
to computing spanning forest. In addition to static graphs, we also extend
ConnectIt to support mixes of insertions and connectivity queries in the
concurrent setting.
We present an experimental evaluation of ConnectIt on a 72-core machine,
which we believe is the most comprehensive evaluation of parallel connectivity
algorithms to date. Compared to a collection of state-of-the-art static
multicore algorithms, we obtain an average speedup of 37.4x (2.36x average
speedup over the fastest existing implementation for each graph). Using
ConnectIt, we are able to compute connectivity on the largest
publicly-available graph (with over 3.5 billion vertices and 128 billion edges)
in under 10 seconds using a 72-core machine, providing a 3.1x speedup over the
fastest existing connectivity result for this graph, in any computational
setting. For our incremental algorithms, we show that our algorithms can ingest
graph updates at up to several billion edges per second. Finally, to guide the
user in selecting the best variants in ConnectIt for different situations, we
provide a detailed analysis of the different strategies in terms of their work
and locality
Development of statistical methods for the analysis of single-cell RNA-seq data
Single-cell RNA-sequencing profiles the transcriptome of cells from diverse populations. A popular intermediate data format is a large count matrix of genes x cells. This type of data brings several analytical challenges. Here, I present three projects that I worked on during my PhD that address particular aspects of working with such datasets:
- The large number of cells in the count matrix is a challenge for fitting gamma-Poisson generalized linear models with existing tools. I developed a new R package called glmGamPoi to address this gap. I optimized the overdispersion estimation procedure to be quick and robust for datasets with many cells and small counts. I compared the performance against two popular tools (edgeR and DESeq2) and find that my inference is 6x to 13x faster and achieves a higher likelihood for a majority of the genes in four single-cell datasets.
- The variance of single-cell RNA-seq counts depends on their mean but many existing statistical tools have optimal performance when the variance is uniform. Accordingly, variance-stabilizing transformations are applied to unlock the large number of methods with such an requirement. I compared four approaches to variance-stabilize the data based on the delta method, model residuals, inferred latent expression state or count factor analysis. I describe the theoretical strength and weaknesses, and compare their empirical performance in a benchmark on simulated and real single-cell data. I find that none of the mathematically more sophisticated transformations consistently outperform the simple log(y/s+1) transformation.
- Multi-condition single-cell data offers the opportunity to find differentially expressed genes for individual cell subpopulations. However, the prevalent approach to analyze such data is to start by dividing the cells into discrete populations and then test for differential expression within each group. The results are interpretable but may miss interesting cases by (1) choosing the cluster size too small and lacking power to detect effects or (2) choosing the cluster size too large and obscuring interesting effects apparent on a smaller scale. I developed a new statistical framework for the analysis of multi-condition single-cell data that avoids the premature discretization. The approach performs regression on the latent subspaces occupied by the cells in each condition. The method is implemented as an R package called lemur
Identifikasjon av materialer med lav termisk gitterledningsevne ved bruk av maskinlæring og databeregninger
Lattice thermal conductivity is a key materials property in applications related to thermal functionality, such as thermal barrier coatings, thermal conductors in microelectronics, and solid-state waste-heat recovery devices. The lattice thermal conductivity governs the rate of heat energy transfer in thermoelectric materials, which are materials that can directly convert heat to electricity and vice versa. These materials become interesting in applications that require electricity generation or local cooling. Thermoelectric materials depend on a low lattice thermal conductivity to attain high heat-to-electricity conversion efficiency. The materials used in present thermoelectric generators are often based on toxic or scarce elements. New high-efficiency thermoelectric materials are therefore desired for sustainable and environmentally friendly energy harvesting. Two main research challenges are investigated in this thesis: 1) reducing the lattice thermal conductivity to enhance thermoelectric performance, and 2) identifying new compounds with low lattice thermal conductivity. Addressing these challenges experimentally is a daunting task -- especially for 100s or 1000s of compounds -- as experiments are costly, time-consuming, and require expert domain knowledge. This thesis, therefore, relies on lattice thermal conductivity from theoretical calculations based on quantum mechanical simulations.
Addressing challenge 1), the lattice thermal conductivity of 122 half-Heusler compounds is calculated using density functional theory and the temperature-dependent effective potential method. Phonon scattering from partial sublattice substitutions and grain boundaries are included in calculations, in an attempt to reduce the lattice thermal conductivity. We find that isovalent substitutions on the site hosting the heaviest atom should be performed to optimally reduce the lattice thermal conductivity in most half-Heuslers. Compounds with large atomic mass differences can have a large drop in lattice thermal conductivity with substitutions. Examples of such compounds are AlSiLi and TiNiPb, which achieve a ~\% reduction of their lattice thermal conductivity when substituting Si by Ge and Pb by Sn at 10~\% concentration. The reduction from additional scattering mechanisms enables a handful half-Heuslers to attain a lattice thermal conductivity close to ~W/Km at 300~K. Calculations for full-Heusler reveal that the introduction of 15~\% Ru substitutions on the Fe-site and 100~nm grain boundaries can reduce the lattice thermal conductivity from 46~W/Km to 7~W/Km.
Tackling challenge 2) is done by computational screening for low lattice thermal conductivity compounds. Coupling calculations with machine learning accelerates the screening. When training the machine learning model on calculated lattice thermal conductivities, it learns to recognize descriptor patterns for compounds with low lattice thermal conductivity. The size of the training set is limited by the large computational cost of calculating lattice thermal conductivity. It is therefore challenging to obtain a diverse set of training compounds, especially so because low lattice thermal conductivity compounds tend to be rare. We find that including certain compounds in the training can be crucial for identifying low lattice thermal conductivity compounds. Active sampling enables scouting of the compound space for compounds that should enter the training set. Principal component analysis and Gaussian process regression are used in the active sampling schemes. With Gaussian process regression we screen 1573 cubic compounds, where 34 have predicted lattice thermal conductivity ~W/Km at 300 K -- as well as electronic band gaps -- indicating that they could be potential thermoelectric compounds.
The findings in this thesis show that certain compounds could have a drastic reduction in the lattice thermal conductivity with sublattice substitutions. Thermoelectric compounds with favorable electronic properties -- but high lattice thermal conductivity -- can be investigated in future studies if there is a potential for a large drop in the lattice thermal conductivity with sublattice substitutions.
The machine learning and active sampling schemes are scalable, and future works could expand upon this thesis by including different compound classes in training and screening. This would enlarge the search space for promising thermoelectric compounds, increasing the likelihood of encountering high-efficiency candidates.
It is also possible to combine the two challenges faced in this thesis. A machine learning model can be trained to predict the lattice thermal conductivity of compounds with sublattice substitutions. This would further increase the pool of possible compounds where promising thermoelectric compounds could reside.Termisk gitterledningsevne er en viktig materialegenskap i tekniske instrumenter som anvender varmeledningsteknologi, slik som termiske barriere-belegg, termiske ledere i mikroelektronikk, og varmegjenvinningsenheter. Denne egenskapen styrer raten av varmeenergi-overføring i termoelektriske materialer. Disse materialene kan omgjøre varmeenergi til elektrisk energi og motsatt, og er derfor lovende i produkter som avhenger av elektrisitetsgenerering eller utnytter lokal kjøling. Termoelektriske materialer må ha lav termisk gitterledningsevne for å opprettholde høy effektivitet. Dagens termoelektriske materialer er ofte basert på giftige eller sjeldne materialer, slik som bly eller tellur. Det er derfor nyttig å finne nye materialer med høy effektivitet for å videre anvende termoelektrisk energi-høsting på en bærekraftig måte. To hovedutfordringer er undersøkt i denne avhandlinga: 1) reduksjon av termisk gitterledningsevne for å øke termoelektrisk effekt, og 2) identifikasjon av nye materialer med lav gitterledningsevne. Å løse disse utfordringene eksperimentelt er krevende siden eksperimenter er dyre, tar mye tid, og krever ekspert-kunnskap. I denne avhandlinga brukes derfor teoretiske beregninger basert på kvantemekaniske simuleringer for å estimere termisk gitterledningsevne. I arbeidet med utfordring 1) beregnes termisk gitterledningsevne til 122 half-Heusler-materialer basert på temperaturavhengige materialsimuleringer. For å redusere termisk gitterledningsevne inkluderes ekstra fonon-spredningsmekanismer: sub-gitter-substitusjoner (legeringer) og korngrenser. Vi finner at isovalente substitusjoner på gitter-plassen som innehar det tyngste atomet gir den største reduksjonen i termisk gitterledningsevne for de fleste materialene. Materialer med stor atommasse-forskjell kan ha en stor reduksjon i termisk gitterledningsevne med substitusjoner. AlSiLi og TiNiPb er eksempler på slike materialer, og oppnår en ∼ 70 % reduksjon i termisk gitterledningsevne når Si er substituert med Ge og Pb er substituert med Sn med 10 % konsentrasjon. Reduksjonen fra ekstra spredningsmekanismer gjør at en håndfull half-Heuslere oppnår termisk gitterledningsevne nærme 2 W/Km. Beregninger for fullHeusleren AlVFe2 viser at introduksjonen av 15 % Ru-substitusjon på Fegitterplassen og 100 nm korngrenser kan redusere termisk gitterledningsevne fra 46 W/Km til 7 W/Km
Dissimilarity-based learning for complex data
Mokbel B. Dissimilarity-based learning for complex data. Bielefeld: Universität Bielefeld; 2016.Rapid advances of information technology have entailed an ever increasing amount of digital data, which raises the demand for powerful data mining and machine learning tools. Due to modern methods for gathering, preprocessing, and storing information, the collected data become more and more complex: a simple vectorial representation, and comparison in terms of the Euclidean distance is often no longer appropriate to capture relevant aspects in the data. Instead, problem-adapted similarity or dissimilarity measures refer directly to the given encoding scheme, allowing to treat information constituents in a relational manner.
This thesis addresses several challenges of complex data sets and their representation in the context of machine learning. The goal is to investigate possible remedies, and propose corresponding improvements of established methods, accompanied by examples from various application domains. The main scientific contributions are the following:
(I) Many well-established machine learning techniques are restricted to vectorial input data only. Therefore, we propose the extension of two popular prototype-based clustering and classification algorithms to non-negative symmetric dissimilarity matrices.
(II) Some dissimilarity measures incorporate a fine-grained parameterization, which allows to configure the comparison scheme with respect to the given data and the problem at hand. However, finding adequate parameters can be hard or even impossible for human users, due to the intricate effects of parameter changes and the lack of detailed prior knowledge. Therefore, we propose to integrate a metric learning scheme into a dissimilarity-based classifier, which can automatically adapt the parameters of a sequence alignment measure according to the given classification task.
(III) A valuable instrument to make complex data sets accessible are dimensionality reduction techniques, which can provide an approximate low-dimensional embedding of the given data set, and, as a special case, a planar map to visualize the data's neighborhood structure. To assess the reliability of such an embedding, we propose the extension of a well-known quality measure to enable a fine-grained, tractable quantitative analysis, which can be integrated into a visualization. This tool can also help to compare different dissimilarity measures (and parameter settings), if ground truth is not available.
(IV) All techniques are demonstrated on real-world examples from a variety of application domains, including bioinformatics, motion capturing, music, and education
Histopathologic and proteogenomic heterogeneity reveals features of clear cell renal cell carcinoma aggressiveness
Clear cell renal cell carcinomas (ccRCCs) represent ∼75% of RCC cases and account for most RCC-associated deaths. Inter- and intratumoral heterogeneity (ITH) results in varying prognosis and treatment outcomes. To obtain the most comprehensive profile of ccRCC, we perform integrative histopathologic, proteogenomic, and metabolomic analyses on 305 ccRCC tumor segments and 166 paired adjacent normal tissues from 213 cases. Combining histologic and molecular profiles reveals ITH in 90% of ccRCCs, with 50% demonstrating immune signature heterogeneity. High tumor grade, along with BAP1 mutation, genome instability, increased hypermethylation, and a specific protein glycosylation signature define a high-risk disease subset, where UCHL1 expression displays prognostic value. Single-nuclei RNA sequencing of the adverse sarcomatoid and rhabdoid phenotypes uncover gene signatures and potential insights into tumor evolution. In vitro cell line studies confirm the potential of inhibiting identified phosphoproteome targets. This study molecularly stratifies aggressive histopathologic subtypes that may inform more effective treatment strategies