30 research outputs found
Privacy and Transparency in Graph Machine Learning: A Unified Perspective
Graph Machine Learning (GraphML), whereby classical machine learning is
generalized to irregular graph domains, has enjoyed a recent renaissance,
leading to a dizzying array of models and their applications in several
domains. With its growing applicability to sensitive domains and regulations by
government agencies for trustworthy AI systems, researchers have started
looking into the issues of transparency and privacy of graph learning. However,
these topics have been mainly investigated independently. In this position
paper, we provide a unified perspective on the interplay of privacy and
transparency in GraphML
User Fairness in Recommender Systems
Recent works in recommendation systems have focused on diversity in
recommendations as an important aspect of recommendation quality. In this work
we argue that the post-processing algorithms aimed at only improving diversity
among recommendations lead to discrimination among the users. We introduce the
notion of user fairness which has been overlooked in literature so far and
propose measures to quantify it. Our experiments on two diversification
algorithms show that an increase in aggregate diversity results in increased
disparity among the users
Boilerplate Removal using a Neural Sequence Labeling Model
The extraction of main content from web pages is an important task for
numerous applications, ranging from usability aspects, like reader views for
news articles in web browsers, to information retrieval or natural language
processing. Existing approaches are lacking as they rely on large amounts of
hand-crafted features for classification. This results in models that are
tailored to a specific distribution of web pages, e.g. from a certain time
frame, but lack in generalization power. We propose a neural sequence labeling
model that does not rely on any hand-crafted features but takes only the HTML
tags and words that appear in a web page as input. This allows us to present a
browser extension which highlights the content of arbitrary web pages directly
within the browser using our model. In addition, we create a new, more current
dataset to show that our model is able to adapt to changes in the structure of
web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape
The Multiple-orientability Thresholds for Random Hypergraphs
A -uniform hypergraph is called -orientable, if there
is an assignment of each edge to one of its vertices such
that no vertex is assigned more than edges. Let be a
hypergraph, drawn uniformly at random from the set of all -uniform
hypergraphs with vertices and edges. In this paper we establish the
threshold for the -orientability of for all and
, i.e., we determine a critical quantity such that
with probability the graph has an -orientation if
.
Our result has various applications including sharp load thresholds for
cuckoo hashing, load balancing with guaranteed maximum load, and massive
parallel access to hard disk arrays.Comment: An extended abstract appeared in the proceedings of SODA 201
Joint learning from multiple information sources for biological problems
Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies.
Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised modelsâ knowledge base with publicly available related data to enhance the computational modelsâ prediction performance.
Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approachesâ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systemsâ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the modelâs generated prediction results to facilitate field expertsâ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between âComputer Scienceâ and âBiologyâ that will open a new era of fruitful collaboration between computer scientists and biological field experts
Multiple choice allocations with small maximum loads
The idea of using multiple choices to improve allocation schemes is now well understood and is often illustrated by the following example. Suppose balls are allocated to bins with each ball choosing a bin independently and uniformly at random. The \emph{maximum load}, or the number of balls in the most loaded bin, will then be approximately with high probability. Suppose now the balls are allocated sequentially by placing a ball in the least loaded bin among the bins chosen independently and uniformly at random. Azar, Broder,
Karlin, and Upfal showed that in this scenario, the maximum load drops to , with high probability, which is an exponential improvement over the previous case.
In this thesis we investigate multiple choice allocations from a slightly different perspective. Instead of minimizing the maximum load, we fix the bin capacities and focus on maximizing the number of balls that can be allocated without overloading any bin. In the process that we consider we have balls and bins. Each ball chooses bins independently and uniformly at random. \emph{Is it possible to assign each ball to one of its choices such that the no bin receives more than balls?} For all and we give a critical value, , such that when this is not the case.
In case such an allocation exists, \emph{how quickly can we find it?} Previous work on total allocation time for case and has analyzed a \emph{breadth first strategy} which is shown to be linear only in expectation. We give a simple and efficient algorithm which we also call \emph{local search allocation}(LSA) to find an allocation for all and . Provided the number of balls are below (but arbitrarily close to) the theoretical achievable load threshold, we give a \emph{linear} bound for the total allocation time that holds with high probability.
We demonstrate, through simulations, an order of magnitude improvement for total and maximum allocation times when compared to the state of the art method.
Our results find applications in many areas including hashing, load balancing, data management, orientability of random hypergraphs and maximum matchings in a special class of bipartite graphs.Die Idee, mehrere Wahlmöglichkeiten zu benutzen, um Zuordnungsschemas zu verbessern, ist mittlerweile gut verstanden und wird oft mit Hilfe des folgenden Beispiels illustriert: Man nehme an, dass n Kugeln auf n BehĂ€lter verteilt werden und jede Kugel unabhĂ€ngig und gleichverteilt per Zufall ihren BehĂ€lter wĂ€hlt. Die maximale Auslastung, bzw. die Anzahl an Kugeln im meist befĂŒllten BehĂ€lter, wird dann mit hoher Wahrscheinlichkeit schĂ€tzungsweise sein. Alternativ können die Kugeln sequenziell zugeordnet werden, indem jede Kugel k â„ 2 BehĂ€lter unabhĂ€ngig und gleichverteilt zufĂ€llig auswĂ€hlt und in dem am wenigsten befĂŒllten dieser k BehĂ€lter platziert wird. Azar, Broder, Karlin, and Upfal haben gezeigt, dass in diesem Szenario die maximale Auslastung mit hoher Wahrscheinlichkeit auf sinkt, was eine exponentielle Verbesserung des vorhergehenden Falls darstellt.
In dieser Doktorarbeit untersuchen wir solche Zuteilungschemas von einem etwas anderen Standpunkt. Statt die maximale Last zu minimieren, ïŹxieren wir die KapazitĂ€ten der BehĂ€lter und konzentrieren uns auf die Maximierung der Anzahl der Kugeln, die ohne Ăberlastung eines BehĂ€lters zugeteilt werden können. In dem von uns betrachteten Prozess haben wir m = bcnc Kugeln und n BehĂ€lter. Jede Kugel wĂ€hlt unabhĂ€ngig und gleichverteilt zufĂ€llig k BehĂ€lter. Ist es möglich, jeder Kugel einen BehĂ€lter ihrer Wahl zuzuordnen, so dass kein BehĂ€lter mehr als Kugeln erhĂ€lt? FĂŒr alle k â„ 3 und â„ 2 geben wir einen kritischen Wert , an sodass fĂŒr c c {k,\ell}^*\ell = 1\ell = 1$ ïŹndet. Sofern die Anzahl der Kugeln unter (aber beliebig nahe an) der theoretisch erreichbaren Lastschwelle ist, zeigen wir eine lineare Schranke fĂŒr die Gesamtzuordnungszeit, die mit hoher Wahrscheinlichkeit gilt. Anhand von Simulationen demonstrieren wir eine Verbesserung der Gesamt- und Maximalzuordnungszeiten um eine GröĂenordnung im Vergleich zu anderen aktuellen Methoden.
Unsere Ergebnisse ïŹnden Anwendung in vielen Bereichen einschlieĂlich Hashing, Lastbalancierung, Datenmanagement, Orientierbarkeit von zufĂ€lligen Hypergraphen und maximale Paarungen in einer speziellen Klasse von bipartiten Graphen
The Multiple-Orientability Thresholds for Random Hypergraphs
A k-uniform hypergraph H = (V, E) is called l-orientable if there is an assignment of each edge e is an element of E to one of its vertices v is an element of e such that no vertex is assigned more than l edges. Let H-n,H-m,H-k be a hypergraph, drawn uniformly at random from the set of all k-uniform hypergraphs with n vertices and m edges. In this paper we establish the threshold for the l-orientability of H-n,H-m,H-k for all k >= 3 and l >= 2, that is, we determine a critical quantity c(*)k,l such that with probability 1-o(1) the graph H-n,H-cn,(k) has an l-orientation if c c(k,l)(*) . Our result has various applications, including sharp load thresholds for cuckoo hashing, load balancing with guaranteed maximum load, and massive parallel access to hard disk arrays