43 research outputs found
LIPIcs, Volume 261, ICALP 2023, Complete Volume
LIPIcs, Volume 261, ICALP 2023, Complete Volum
On the enhancement of Big Data Pipelines through Data Preparation, Data Quality, and the distribution of Optimisation Problems
Nowadays, data are fundamental for companies, providing operational support by facilitating daily
transactions. Data has also become the cornerstone of strategic decision-making processes in
businesses. For this purpose, there are numerous techniques that allow to extract knowledge and
value from data. For example, optimisation algorithms excel at supporting decision-making
processes to improve the use of resources, time and costs in the organisation. In the current
industrial context, organisations usually rely on business processes to orchestrate their daily
activities while collecting large amounts of information from heterogeneous sources. Therefore,
the support of Big Data technologies (which are based on distributed environments) is required
given the volume, variety and speed of data. Then, in order to extract value from the data, a set
of techniques or activities is applied in an orderly way and at different stages. This set of
techniques or activities, which facilitate the acquisition, preparation, and analysis of data, is known
in the literature as Big Data pipelines.
In this thesis, the improvement of three stages of the Big Data pipelines is tackled: Data
Preparation, Data Quality assessment, and Data Analysis. These improvements can be
addressed from an individual perspective, by focussing on each stage, or from a more complex
and global perspective, implying the coordination of these stages to create data workflows.
The first stage to improve is the Data Preparation by supporting the preparation of data with
complex structures (i.e., data with various levels of nested structures, such as arrays).
Shortcomings have been found in the literature and current technologies for transforming complex
data in a simple way. Therefore, this thesis aims to improve the Data Preparation stage through
Domain-Specific Languages (DSLs). Specifically, two DSLs are proposed for different use cases.
While one of them is a general-purpose Data Transformation language, the other is a DSL aimed
at extracting event logs in a standard format for process mining algorithms.
The second area for improvement is related to the assessment of Data Quality. Depending on the
type of Data Analysis algorithm, poor-quality data can seriously skew the results. A clear example
are optimisation algorithms. If the data are not sufficiently accurate and complete, the search
space can be severely affected. Therefore, this thesis formulates a methodology for modelling
Data Quality rules adjusted to the context of use, as well as a tool that facilitates the automation
of their assessment. This allows to discard the data that do not meet the quality criteria defined
by the organisation. In addition, the proposal includes a framework that helps to select actions to
improve the usability of the data.
The third and last proposal involves the Data Analysis stage. In this case, this thesis faces the
challenge of supporting the use of optimisation problems in Big Data pipelines. There is a lack of
methodological solutions that allow computing exhaustive optimisation problems in distributed
environments (i.e., those optimisation problems that guarantee the finding of an optimal solution
by exploring the whole search space). The resolution of this type of problem in the Big Data
context is computationally complex, and can be NP-complete. This is caused by two different
factors. On the one hand, the search space can increase significantly as the amount of data to
be processed by the optimisation algorithms increases. This challenge is addressed through a
technique to generate and group problems with distributed data. On the other hand, processing
optimisation problems with complex models and large search spaces in distributed environments
is not trivial. Therefore, a proposal is presented for a particular case in this type of scenario.
As a result, this thesis develops methodologies that have been published in scientific journals and
conferences.The methodologies have been implemented in software tools that are integrated with
the Apache Spark data processing engine. The solutions have been validated through tests and use cases with real datasets
Constraint Programming-based Job Dispatching for Modern HPC Applications
A High-Performance Computing job dispatcher is a critical software that assigns the finite computing resources to submitted jobs. This resource assignment over time is known as the on-line job dispatching problem in HPC systems. The fact the problem is on-line means that solutions must be computed in real-time, and their required time cannot exceed some threshold to do not affect the normal system functioning. In addition, a job dispatcher must deal with a lot of uncertainty: submission times, the number of requested resources, and duration of jobs. Heuristic-based techniques have been broadly used in HPC systems, at the cost of achieving (sub-)optimal solutions in a short time. However, the scheduling and resource allocation components are separated, thus generates a decoupled decision that may cause a performance loss. Optimization-based techniques are less used for this problem, although they can significantly improve the performance of HPC systems at the expense of higher computation time.
Nowadays, HPC systems are being used for modern applications, such as big data analytics and predictive model building, that employ, in general, many short jobs. However, this information is unknown at dispatching time, and job dispatchers need to process large numbers of them quickly while ensuring high Quality-of-Service (QoS) levels. Constraint Programming (CP) has been shown to be an effective approach to tackle job dispatching problems. However, state-of-the-art CP-based job dispatchers are unable to satisfy the challenges of on-line dispatching, such as generate dispatching decisions in a brief period and integrate current and past information of the housing system.
Given the previous reasons, we propose CP-based dispatchers that are more suitable for HPC systems running modern applications, generating on-line dispatching decisions in a proper time and are able to make effective use of job duration predictions to improve QoS levels, especially for workloads dominated by short jobs
Judaica Americana: A Bibliography of Publications to 1900
Judaica Americana: A Bibliography of Publications to 1900, with an estimated total of 9,500 entries, chronicles the decades prior to the twentieth century, a formative era for Jewish institutional development at a time when the Jewish community grew from 1,350 persons in 1790 to 1,050,000 in 1900. Taken as a whole, the bibliography provides extensive documentation of American Jewish communal activity. Equally important for the study of Jewish-Christian relations, hundreds of titles, many of them prophetic and proto-Zionist in nature, are included as relevant primary sources for assessing Christian attitudes on the development, history and testimony of the Jewish religion and the Jewish nation from early times to the close of the nineteenth century. Adventism and millenarian speculation, so pervasive in nineteenth-century America, are well documented in these pages; the same is true of conversionist activity. Creative writing (novels, short stories, dramas, poets) with Jewish themes or characters forms yet another subject emphasis and one that will prove to be exceedingly valuable for any extended study of stereotypes and the negative portrayal of the Jew in literature. For the purposes of this bibliography, annual gift books are approached as monographs.
This edition is divided into three sections. The first section contains the chronological file of 1890 to 1900. A second section, “Union List of Nineteenth-Century Jewish Serials Published in the United States,” lists all known Jewish newspapers, serials, yearbooks, and annual reports in the United States with an inception date prior to 1901, regardless of language, and even if issues of these serials no longer exist, or if the serials were merely projected for publication by their would-be sponsors. Included in this section are relevant periodicals with a conversionist or antisemitic focus.
A third section, a supplement, adds to the first edition of Judaica Americana, expanding the project with additional materials identified by Singerman in the years since the first publication. Judaica Americana has been enlarged by more than 3,000 entries drawn from a broad range of genres, including creative writing, the Wandering Jew theme, foreign literature in translation, stereotype-laden dime novels, foreign travel accounts, city and county histories, American memoirs and biographies, phrenology and racial “science,” urban sociology, children’s literature and school readers, humor books, music scores and songsters, missionary accounts, also prophetic millenarian texts of which there is no shortage. Additional success with identifying Jewish-interest material embedded in sermon collections, federal documents, almanacs, and annual gift books has been made; other researchers are invited to continue probing in these potentially-rich target areas. Areas for further investigation include broadsides, Jewish social clubs, fraternal orders, and benevolent societies, playbills and event programs, penny songs and song collections, state, county, and city documents, also Masonic lodge histories and biography
Automata-theoretic protocol programming
Parallel programming has become essential for writing scalable programs on general hardware. Conceptually, every parallel program consists of workers, which implement primary units of sequential computation, and protocols, which implement the rules of interaction that workers must abide by. As programmers have been writing sequential code for decades, programming workers poses no new fundamental challenges. What is new---and notoriously difficult---is programming of protocols. In this thesis, I study an approach to protocol programming where programmers implement their workers in an existing general-purpose language (GPL), while they implement their protocols in a complementary domain-specific language (DSL). DSLs for protocols enable programmers to express interaction among workers at a higher level of abstraction than the level of abstraction supported by today's GPLs, thereby addressing a number of protocol programming issues with today's GPLs. In particular, in this thesis, I develop a DSL for protocols based on a theory of formal automata and their languages. The specific automata that I consider, called constraint automata, have transition labels with a richer structure than alphabet symbols in classical automata theory. Exactly these richer transition labels make constraint automata suitable for modeling protocols.UBL - phd migration 201
Minimisation des perturbations et parallélisation pour la planification et l'ordonnancement
Nous étudions dans cette thèse deux approches réduisant le temps de traitement nécessaire pour résoudre des problèmes de planification et d'ordonnancement dans un contexte de programmation par contraintes. Nous avons expérimenté avec plusieurs milliers de processeurs afin de résoudre le problème de planification et d'ordonnancement des opérations de rabotage du bois d'oeuvre. Ces problèmes sont d'une grande importance pour les entreprises, car ils permettent de mieux gérer leur production et d'économiser des coûts reliés à leurs opérations. La première approche consiste à effectuer une parallélisation de l'algorithme de résolution du problème. Nous proposons une nouvelle technique de parallélisation (nommée PDS) des stratégies de recherche atteignant quatre buts : le respect de l'ordre de visite des noeuds de l'arbre de recherche tel que défini par l'algorithme séquentiel, l'équilibre de la charge de travail entre les processeurs, la robustesse aux défaillances matérielles et l'absence de communications entre les processeurs durant le traitement. Nous appliquons cette technique pour paralléliser la stratégie de recherche Limited Discrepancy-based Search (LDS) pour ainsi obtenir Parallel Limited Discrepancy-Based Search (PLDS). Par la suite, nous démontrons qu'il est possible de généraliser cette technique en l'appliquant à deux autres stratégies de recherche : Depth-Bounded discrepancy Search (DDS) et Depth-First Search (DFS). Nous obtenons, respectivement, les stratégies Parallel Discrepancy-based Search (PDDS) et Parallel Depth-First Search (PDFS). Les algorithmes parallèles ainsi obtenus créent un partage intrinsèque de la charge de travail : la différence de charge de travail entre les processeurs est bornée lorsqu'une branche de l'arbre de recherche est coupée. En utilisant des jeux de données de partenaires industriels, nous avons pu améliorer les meilleures solutions connues. Avec la deuxième approche, nous avons élaboré une méthode pour minimiser les changements effectués à un plan de production existant lorsque de nouvelles informations, telles que des commandes additionnelles, sont prises en compte. Replanifier entièrement les activités de production peut mener à l'obtention d'un plan de production très différent qui mène à des coûts additionnels et des pertes de temps pour les entreprises. Nous étudions les perturbations causéees par la replanification à l'aide de trois métriques de distances entre deux plans de production : la distance de Hamming, la distance d'édition et la distance de Damerau-Levenshtein. Nous proposons trois modèles mathématiques permettant de minimiser ces perturbations en incluant chacune de ces métriques comme fonction objectif au moment de la replanification. Nous appliquons cette approche au problème de planification et ordonnancement des opérations de finition du bois d'oeuvre et nous démontrons que cette approche est plus rapide qu'une replanification à l'aide du modèle d'origine.We study in this thesis two approaches that reduce the processing time needed to solve planning and ordering problems in a constraint programming context. We experiment with multiple thousands of processors on the planning and scheduling problem of wood-finish operations. These issues are of a great importance for businesses, because they can better manage their production and save costs related to their operations. The first approach consists in a parallelization of the problem solving algorithm. We propose a new parallelization technique (named PDS) of the search strategies, that reaches four goals: conservation of the nodes visit order in the search tree as defined by the sequential algorithm, balancing of the workload between the processors, robustness against hardware failures, and absence of communication between processors during the treatment. We apply this technique to parallelize the Limited Discrepancy-based (LDS) search strategy to obtain Parallel Limited Discrepancy-Based Search (PLDS). We then show that this technique can be generalized by parallelizing two other search strategies: Depth-Bounded discrepancy Search (DDS) and Depth-First Search (DFS). We obtain, respectively, Parallel Discrepancy-based Search (PDDS) and Parallel Depth-First Search (PDFS). The algorithms obtained this way create an intrinsic workload balance: the imbalance of the workload among the processors is bounded when a branch of the search tree is pruned. By using datasets coming from industrial partners, we are able to improve the best known solutions. With the second approach, we elaborated a method to minimize the changes done to an existing production plan when new information, such as additional orders, are taken into account. Completely re-planning the production activities can lead to a very different production plan which create additional costs and loss of time for businesses. We study the perturbations caused by the re-planification with three distance metrics: Hamming distance, Edit distance, and Damerau-Levenshtein Distance. We propose three mathematical models that allow to minimize these perturbations by including these metrics in the objective function when replanning. We apply this approach to the planning and scheduling problem of wood-finish operations and we demonstrate that this approach outperforms the use of the original model
Automata-Theoretic Protocol Programming (With Proofs)
In the early 2000s, hardware manufacturers shifted their attention from manufacturing faster---yet purely sequential---unicore processors to manufacturing slower---yet increasingly parallel---multicore processors. In the wake of this shift, parallel programming became essential for writing scalable programs on general hardware. Conceptually, every parallel program consists of workers, which implement primary units of sequential computation, and protocols, which implement the rules of interaction that workers must abide by. As programmers have been writing sequential code for decades, programming workers poses no new fundamental challenges. What is new---and notoriously difficult---is programming of protocols.
In this thesis, I study an approach to protocol programming where programmers implement their workers in an existing general-purpose language (GPL), while they implement their protocols in a complementary domain-specific language (DSL). DSLs for protocols enable programmers to express interaction among workers at a higher level of abstraction than the level of abstraction supported by today's GPLs, thereby addressing a number of protocol programming issues with today's GPLs. In particular, in this thesis, I develop a DSL for protocols based on a theory of formal automata and their languages. The specific automata that I consider, called constraint automata, have transition labels with a richer structure than alphabet symbols in classical automata theory. Exactly these richer transition labels make constraint automata suitable for modeling protocols.
Constraint automata constitute the (denot
Automata-theoretic protocol programming : parallel computation, threads and their interaction, optimized compilation, [at a] high level of abstraction
In the early 2000s, hardware manufacturers shifted their attention from manufacturing faster—yet purely sequential—unicore processors to manufacturing slower—yet increasingly parallel—multicore processors. In the wake of this shift, parallel programming became essential for writing scalable programs on general hardware. Conceptually, every parallel program consists of workers, which implement primary units of sequential computation, and protocols, which implement the rules of interaction that workers must abide by. As programmers have been writing sequential code for decades, programmingand mutual exclusion may serve as a target for compilation. To demonstrate the practical feasibility of the GPL+DSL approach to protocol programming, I study the performance of the implemented compiler and its optimizations through a number of experiments, including the Java version of the NAS Parallel Benchmarks. The experimental results in these benchmarks show that, with all four optimizations in place, compiler-generated protocol code can competewith hand-crafted protocol code. workers poses no new fundamental challenges. What is new—and notoriously difficult—is programming of protocols. In this thesis, I study an approach to protocol programming where programmers implement their workers in an existing general-purpose language (GPL), while they implement their protocols in a complementary domain-specific language (DSL). DSLs for protocols enable programmers to express interaction among workers at a higher level of abstraction than the level of abstraction supported by today’s GPLs, thereby addressing a number of protocol programming issues with today’s GPLs. In particular, in this thesis, I develop a DSL for protocols based on a theory of formal automata and their languages. The specific automata that I consider, called constraint automata, have transition labels with a richer structure than alphabet symbols in classical automata theory. Exactly these richer transition labels make constraint automata suitable for modeling protocols.Constraint automata constitute the (denotational) semantics of the DSL presented in this thesis. On top of this semantics, I use two complementary syntaxes: an existing graphical syntax (based on the coordination language Reo) and a novel textual syntax. The main contribution of this thesis, then, consists of a compiler and four of its optimizations, all formalized and proven correct at the semantic level of constraint automata, using bisimulation. In addition to these theoretical contributions, I also present an implementation of the compiler and its optimizations, which supports Java as the complementary GPL, as plugins for Eclipse. Nothing in the theory developed in this thesis depends on Java, though; any language that supports some form of threading.<br/
Detecting and Explaining Conflicts in Attributed Feature Models
Product configuration systems are often based on a variability model. The
development of a variability model is a time consuming and error-prone process.
Considering the ongoing development of products, the variability model has to
be adapted frequently. These changes often lead to mistakes, such that some
products cannot be derived from the model anymore, that undesired products are
derivable or that there are contradictions in the variability model. In this
paper, we propose an approach to discover and to explain contradictions in
attributed feature models efficiently in order to assist the developer with the
correction of mistakes. We use extended feature models with attributes and
arithmetic constraints, translate them into a constraint satisfaction problem
and explore those for contradictions. When a contradiction is found, the
constraints are searched for a set of contradicting relations by the
QuickXplain algorithm.Comment: In Proceedings FMSPLE 2015, arXiv:1504.0301
Zwischenbericht der Projektgruppe Transportoptimierung
Im vorliegenden Zwischenbericht der Projektgruppe 'Transportoptimierung' wird die Entwicklung des Projekts von der Anforderungsanalyse bis zum Entwurf dokumentiert.
Im Rahmen der Projektgruppe soll das Programm TROSS (TRansport-Optimierung für Soziale Serviceanbieter) zur Verwaltung und Optimierung von sozialen Fahrdiensten entwickelt werden, das dann beim DRK in Stuttgart eingesetzt wird. Das Programm soll alle für das DRK in Zusammenhang mit seinen Fahrdiensten wichtigen Daten erfassen und verwalten. Die momentan manuell durchgeführte Planung soll in Zukunft computergestützt ablaufen können, was den Planern beim DRK einen besseren Überblick über die Dienste und Einsatz von Mitarbeitern und Fahrzeugen liefern soll. Außerdem ist vorgesehen, die Optimierung von Fahrdiensten auch automatisch auszuführen, um aufwendige Handarbeit einzusparen oder gar bessere Ergebnisse zu bekommen