2,537 research outputs found
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Mining Butterflies in Streaming Graphs
This thesis introduces two main-memory systems sGrapp and sGradd for performing the fundamental analytic tasks of biclique counting and concept drift detection over a streaming graph. A data-driven heuristic is used to architect the systems. To this end, initially, the growth patterns of bipartite streaming graphs are mined and the emergence principles of streaming motifs are discovered. Next, the discovered principles are (a) explained by a graph generator called sGrow; and (b) utilized to establish the requirements for efficient, effective, explainable, and interpretable management and processing of streams. sGrow is used to benchmark stream analytics, particularly in the case of concept drift detection.
sGrow displays robust realization of streaming growth patterns independent of initial conditions, scale and temporal characteristics, and model configurations. Extensive evaluations confirm the simultaneous effectiveness and efficiency of sGrapp and sGradd. sGrapp achieves mean absolute percentage error up to 0.05/0.14 for the cumulative butterfly count in streaming graphs with uniform/non-uniform temporal distribution and a processing throughput of 1.5 million data records per second. The throughput and estimation error of sGrapp are 160x higher and 0.02x lower than baselines. sGradd demonstrates an improving performance over time, achieves zero false detection rates when there is not any drift and when drift is already detected, and detects sequential drifts in zero to a few seconds after their occurrence regardless of drift intervals
Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing
Developers often dedicate significant time to maintaining and refactoring
existing code. However, most prior work on generative models for code focuses
solely on creating new code, neglecting the unique requirements of editing
existing code. In this work, we explore a multi-round code auto-editing
setting, aiming to predict edits to a code region based on recent changes
within the same codebase. Our model, Coeditor, is a fine-tuned CodeT5 model
with enhancements specifically designed for code editing tasks. We encode code
changes using a line diff format and employ static analysis to form large
customized model contexts, ensuring appropriate information for prediction. We
collect a code editing dataset from the commit histories of 1650 open-source
Python projects for training and evaluation. In a simplified single-round,
single-edit task, Coeditor significantly outperforms the best code completion
approach -- nearly doubling its exact-match accuracy, despite using a much
smaller model -- demonstrating the benefits of incorporating editing history
for code completion. In a multi-round, multi-edit setting, we observe
substantial gains by iteratively prompting the model with additional user
edits. We open-source our code, data, and model weights to encourage future
research and release a VSCode extension powered by our model for interactive
usage
Planar Disjoint Paths, Treewidth, and Kernels
In the Planar Disjoint Paths problem, one is given an undirected planar graph
with a set of vertex pairs and the task is to find pairwise
vertex-disjoint paths such that the -th path connects to . We
study the problem through the lens of kernelization, aiming at efficiently
reducing the input size in terms of a parameter. We show that Planar Disjoint
Paths does not admit a polynomial kernel when parameterized by unless coNP
NP/poly, resolving an open problem by [Bodlaender, Thomass{\'e},
Yeo, ESA'09]. Moreover, we rule out the existence of a polynomial Turing kernel
unless the WK-hierarchy collapses. Our reduction carries over to the setting of
edge-disjoint paths, where the kernelization status remained open even in
general graphs.
On the positive side, we present a polynomial kernel for Planar Disjoint
Paths parameterized by , where denotes the treewidth of the input
graph. As a consequence of both our results, we rule out the possibility of a
polynomial-time (Turing) treewidth reduction to under the same
assumptions. To the best of our knowledge, this is the first hardness result of
this kind. Finally, combining our kernel with the known techniques [Adler,
Kolliopoulos, Krause, Lokshtanov, Saurabh, Thilikos, JCTB'17; Schrijver,
SICOMP'94] yields an alternative (and arguably simpler) proof that Planar
Disjoint Paths can be solved in time , matching the
result of [Lokshtanov, Misra, Pilipczuk, Saurabh, Zehavi, STOC'20].Comment: To appear at FOCS'23, 82 pages, 30 figure
Private set intersection: A systematic literature review
Secure Multi-party Computation (SMPC) is a family of protocols which allow some parties to compute a function on their private inputs, obtaining the output at the end and nothing more. In this work, we focus on a particular SMPC problem named Private Set Intersection (PSI). The challenge in PSI is how two or more parties can compute the intersection of their private input sets, while the elements that are not in the intersection remain private. This problem has attracted the attention of many researchers because of its wide variety of applications, contributing to the proliferation of many different approaches. Despite that, current PSI protocols still require heavy cryptographic assumptions that may be unrealistic in some scenarios. In this paper, we perform a Systematic Literature Review of PSI solutions, with the objective of analyzing the main scenarios where PSI has been studied and giving the reader a general taxonomy of the problem together with a general understanding of the most common tools used to solve it. We also analyze the performance using different metrics, trying to determine if PSI is mature enough to be used in realistic scenarios, identifying the pros and cons of each protocol and the remaining open problems.This work has been partially supported by the projects: BIGPrivDATA (UMA20-FEDERJA-082) from the FEDER AndalucÃa 2014–
2020 Program and SecTwin 5.0 funded by the Ministry of Science and Innovation, Spain, and the European Union (Next Generation EU) (TED2021-129830B-I00). The first author has been funded by the Spanish Ministry of Education under the National F.P.U. Program (FPU19/01118). Funding for open access charge: Universidad de Málaga/CBU
Reasoning about quantities and concepts: studies in social learning
We live and learn in a ‘society of mind’. This means that we form beliefs not
just based on our own observations and prior expectations but also based on the
communications from other people, such as our social network peers. Across seven
experiments, I study how people combine their own private observations with other
people’s communications to form and update beliefs about the environment. I will
follow the tradition of rational analysis and benchmark human learning against optimal Bayesian inference at Marr’s computational level. To accommodate human
resource constraints and cognitive biases, I will further contrast human learning
with a variety of process level accounts. In Chapters 2–4, I examine how people
reason about simple environmental quantities. I will focus on the effect of dependent information sources on the success of group and individual learning across a
series of single-player and multi-player judgement tasks. Overall, the results from
Chapters 2–4 highlight the nuances of real social network dynamics and provide
insights into the conditions under which we can expect collective success versus
failures such as the formation of inaccurate worldviews. In Chapter 5, I develop a
more complex social learning task which goes beyond estimation of environmental
quantities and focuses on inductive inference with symbolic concepts. Here, I investigate how people search compositional theory spaces to form and adapt their
beliefs, and how symbolic belief adaptation interfaces with individual and social
learning in a challenging active learning task. Results from Chapter 5 suggest that
people might explore compositional theory spaces using local incremental search;
and that it is difficult for people to use another person’s learning data to improve
upon their hypothesis
Behavior quantification as the missing link between fields: Tools for digital psychiatry and their role in the future of neurobiology
The great behavioral heterogeneity observed between individuals with the same
psychiatric disorder and even within one individual over time complicates both
clinical practice and biomedical research. However, modern technologies are an
exciting opportunity to improve behavioral characterization. Existing
psychiatry methods that are qualitative or unscalable, such as patient surveys
or clinical interviews, can now be collected at a greater capacity and analyzed
to produce new quantitative measures. Furthermore, recent capabilities for
continuous collection of passive sensor streams, such as phone GPS or
smartwatch accelerometer, open avenues of novel questioning that were
previously entirely unrealistic. Their temporally dense nature enables a
cohesive study of real-time neural and behavioral signals.
To develop comprehensive neurobiological models of psychiatric disease, it
will be critical to first develop strong methods for behavioral quantification.
There is huge potential in what can theoretically be captured by current
technologies, but this in itself presents a large computational challenge --
one that will necessitate new data processing tools, new machine learning
techniques, and ultimately a shift in how interdisciplinary work is conducted.
In my thesis, I detail research projects that take different perspectives on
digital psychiatry, subsequently tying ideas together with a concluding
discussion on the future of the field. I also provide software infrastructure
where relevant, with extensive documentation.
Major contributions include scientific arguments and proof of concept results
for daily free-form audio journals as an underappreciated psychiatry research
datatype, as well as novel stability theorems and pilot empirical success for a
proposed multi-area recurrent neural network architecture.Comment: PhD thesis cop
Contributions to time series analysis, modelling and forecasting to increase reliability in industrial environments.
356 p.La integración del Internet of Things en el sector industrial es clave para alcanzar la inteligencia empresarial. Este estudio se enfoca en mejorar o proponer nuevos enfoques para aumentar la confiabilidad de las soluciones de IA basadas en datos de series temporales en la industria. Se abordan tres fases: mejora de la calidad de los datos, modelos y errores. Se propone una definición estándar de métricas de calidad y se incluyen en el paquete dqts de R. Se exploran los pasos del modelado de series temporales, desde la extracción de caracterÃsticas hasta la elección y aplicación del modelo de predicción más eficiente. El método KNPTS, basado en la búsqueda de patrones en el histórico, se presenta como un paquete de R para estimar datos futuros. Además, se sugiere el uso de medidas elásticas de similitud para evaluar modelos de regresión y la importancia de métricas adecuadas en problemas de clases desbalanceadas. Las contribuciones se validaron en casos de uso industrial de diferentes campos: calidad de producto, previsión de consumo eléctrico, detección de porosidad y diagnóstico de máquinas
2015 GREAT Day Program
SUNY Geneseo’s Ninth Annual GREAT Day.https://knightscholar.geneseo.edu/program-2007/1009/thumbnail.jp
Algorithms for sparse convolution and sublinear edit distance
In this PhD thesis on fine-grained algorithm design and complexity, we investigate output-sensitive and sublinear-time algorithms for two important problems. (1) Sparse Convolution: Computing the convolution of two vectors is a basic algorithmic primitive with applications across all of Computer Science and Engineering. In the sparse convolution problem we assume that the input and output vectors have at most t nonzero entries, and the goal is to design algorithms with running times dependent on t. For the special case where all entries are nonnegative, which is particularly important for algorithm design, it is known since twenty years that sparse convolutions can be computed in near-linear randomized time O(t log^2 n). In this thesis we develop a randomized algorithm with running time O(t \log t) which is optimal (under some mild assumptions), and the first near-linear deterministic algorithm for sparse nonnegative convolution. We also present an application of these results, leading to seemingly unrelated fine-grained lower bounds against distance oracles in graphs. (2) Sublinear Edit Distance: The edit distance of two strings is a well-studied similarity measure with numerous applications in computational biology. While computing the edit distance exactly provably requires quadratic time, a long line of research has lead to a constant-factor approximation algorithm in almost-linear time. Perhaps surprisingly, it is also possible to approximate the edit distance k within a large factor O(k) in sublinear time O~(n/k + poly(k)). We drastically improve the approximation factor of the known sublinear algorithms from O(k) to k^{o(1)} while preserving the O(n/k + poly(k)) running time.In dieser Doktorarbeit über feinkörnige Algorithmen und Komplexität untersuchen wir ausgabesensitive Algorithmen und Algorithmen mit sublinearer Lauf-zeit für zwei wichtige Probleme. (1) Dünne Faltungen: Die Berechnung der Faltung zweier Vektoren ist ein grundlegendes algorithmisches Primitiv, das in allen Bereichen der Informatik und des Ingenieurwesens Anwendung findet. Für das dünne Faltungsproblem nehmen wir an, dass die Eingabe- und Ausgabevektoren höchstens t Einträge ungleich Null haben, und das Ziel ist, Algorithmen mit Laufzeiten in Abhängigkeit von t zu entwickeln. Für den speziellen Fall, dass alle Einträge nicht-negativ sind, was insbesondere für den Entwurf von Algorithmen relevant ist, ist seit zwanzig Jahren bekannt, dass dünn besetzte Faltungen in nahezu linearer randomisierter Zeit O(t \log^2 n) berechnet werden können. In dieser Arbeit entwickeln wir einen randomisierten Algorithmus mit Laufzeit O(t \log t), der (unter milden Annahmen) optimal ist, und den ersten nahezu linearen deterministischen Algorithmus für dünne nichtnegative Faltungen. Wir stellen auch eine Anwendung dieser Ergebnisse vor, die zu scheinbar unverwandten feinkörnigen unteren Schranken gegen Distanzorakel in Graphen führt. (2) Sublineare Editierdistanz: Die Editierdistanz zweier Zeichenketten ist ein gut untersuchtes Ähnlichkeitsmaß mit zahlreichen Anwendungen in der Computerbiologie. Während die exakte Berechnung der Editierdistanz nachweislich quadratische Zeit erfordert, hat eine lange Reihe von Forschungsarbeiten zu einem Approximationsalgorithmus mit konstantem Faktor in fast-linearer Zeit geführt. Überraschenderweise ist es auch möglich, die Editierdistanz k innerhalb eines großen Faktors O(k) in sublinearer Zeit O~(n/k + poly(k)) zu approximieren. Wir verbessern drastisch den Approximationsfaktor der bekannten sublinearen Algorithmen von O(k) auf k^{o(1)} unter Beibehaltung der O(n/k + poly(k))-Laufzeit
- …