269 research outputs found
Multidisciplinary perspectives on Artificial Intelligence and the law
This open access book presents an interdisciplinary, multi-authored, edited collection of chapters on Artificial Intelligence (âAIâ) and the Law. AI technology has come to play a central role in the modern data economy. Through a combination of increased computing power, the growing availability of data and the advancement of algorithms, AI has now become an umbrella term for some of the most transformational technological breakthroughs of this age. The importance of AI stems from both the opportunities that it offers and the challenges that it entails. While AI applications hold the promise of economic growth and efficiency gains, they also create significant risks and uncertainty. The potential and perils of AI have thus come to dominate modern discussions of technology and ethics â and although AI was initially allowed to largely develop without guidelines or rules, few would deny that the law is set to play a fundamental role in shaping the future of AI. As the debate over AI is far from over, the need for rigorous analysis has never been greater. This book thus brings together contributors from different fields and backgrounds to explore how the law might provide answers to some of the most pressing questions raised by AI. An outcome of the CatĂłlica Research Centre for the Future of Law and its interdisciplinary working group on Law and Artificial Intelligence, it includes contributions by leading scholars in the fields of technology, ethics and the law.info:eu-repo/semantics/publishedVersio
Learning to Represent Patches
Patch representation is crucial in automating various software engineering
tasks, like determining patch accuracy or summarizing code changes. While
recent research has employed deep learning for patch representation, focusing
on token sequences or Abstract Syntax Trees (ASTs), they often miss the
change's semantic intent and the context of modified lines. To bridge this gap,
we introduce a novel method, Patcherizer. It delves into the intentions of
context and structure, merging the surrounding code context with two innovative
representations. These capture the intention in code changes and the intention
in AST structural modifications pre and post-patch. This holistic
representation aptly captures a patch's underlying intentions. Patcherizer
employs graph convolutional neural networks for structural intention graph
representation and transformers for intention sequence representation. We
evaluated Patcherizer's embeddings' versatility in three areas: (1) Patch
description generation, (2) Patch accuracy prediction, and (3) Patch intention
identification. Our experiments demonstrate the representation's efficacy
across all tasks, outperforming state-of-the-art methods. For example, in patch
description generation, Patcherizer excels, showing an average boost of 19.39%
in BLEU, 8.71% in ROUGE-L, and 34.03% in METEOR scores
LIPIcs, Volume 261, ICALP 2023, Complete Volume
LIPIcs, Volume 261, ICALP 2023, Complete Volum
Making Presentation Math Computable
This Open-Access-book addresses the issue of translating mathematical expressions from LaTeX to the syntax of Computer Algebra Systems (CAS). Over the past decades, especially in the domain of Sciences, Technology, Engineering, and Mathematics (STEM), LaTeX has become the de-facto standard to typeset mathematical formulae in publications. Since scientists are generally required to publish their work, LaTeX has become an integral part of today's publishing workflow. On the other hand, modern research increasingly relies on CAS to simplify, manipulate, compute, and visualize mathematics. However, existing LaTeX import functions in CAS are limited to simple arithmetic expressions and are, therefore, insufficient for most use cases. Consequently, the workflow of experimenting and publishing in the Sciences often includes time-consuming and error-prone manual conversions between presentational LaTeX and computational CAS formats. To address the lack of a reliable and comprehensive translation tool between LaTeX and CAS, this thesis makes the following three contributions. First, it provides an approach to semantically enhance LaTeX expressions with sufficient semantic information for translations into CAS syntaxes. Second, it demonstrates the first context-aware LaTeX to CAS translation framework LaCASt. Third, the thesis provides a novel approach to evaluate the performance for LaTeX to CAS translations on large-scaled datasets with an automatic verification of equations in digital mathematical libraries. This is an open access book
Blockchain security and applications
Cryptocurrencies, such as Bitcoin and Ethereum, have proven to be highly successful. In a cryptocurrency system, transactions and ownership data are stored digitally in a ledger that uses blockchain technology. This technology has the potential to revolutionize the future of financial transactions and decentralized applications. Blockchains have a layered architecture that enables their unique method of authenticating transactions. In this research, we examine three layers, each with its own distinct functionality: the network layer, consensus layer, and application layer. The network layer is responsible for exchanging data via a peer-to-peer (P2P) network. In this work, we present a practical yet secure network design. We also study the security and performance of the network and how it affects the overall security and performance of blockchain systems. The consensus layer is in charge of generating and ordering the blocks, as well as guaranteeing that everyone agrees. We study the existing Proof-of-stake (PoS) protocols, which follow a single-extension design framework. We present an impossibility result showing that those single-extension protocols cannot achieve standard security properties (e.g., common prefix) and the best possible unpredictability if the honest players control less than 73\% stake. To overcome this, we propose a new multi-extension design framework. The application layer consists of programs (e.g., smart contracts) that users can use to build decentralized applications. We construct a protocol on the application layer to enhance the security of federated learning
Learning to Represent Patches
Patch representation is crucial in automating various software engineering tasks, like determining patch accuracy or summarizing code changes. While recent research has employed deep learning for patch representation, focusing on token sequences or Abstract Syntax Trees (ASTs), they often miss the change's semantic intent and the context of modified lines. To bridge this gap, we introduce a novel method, Patcherizer. It delves into the intentions of context and structure, merging the surrounding code context with two innovative representations. These capture the intention in code changes and the intention in AST structural modifications pre and post-patch. This holistic representation aptly captures a patch's underlying intentions. Patcherizer employs graph convolutional neural networks for structural intention graph representation and transformers for intention sequence representation. We evaluated Patcherizer's embeddings' versatility in three areas: (1) Patch description generation, (2) Patch accuracy prediction, and (3) Patch intention identification. Our experiments demonstrate the representation's efficacy across all tasks, outperforming state-of-the-art methods. For example, in patch description generation, Patcherizer excels, showing an average boost of 19.39% in BLEU, 8.71% in ROUGE-L, and 34.03% in METEOR scores
Differential evolution of non-coding DNA across eukaryotes and its close relationship with complex multicellularity on Earth
Here, I elaborate on the hypothesis that complex multicellularity (CM, sensu Knoll) is a major evolutionary transition (sensu Szathmary), which has convergently evolved a few times in Eukarya only: within red and brown algae, plants, animals, and fungi. Paradoxically, CM seems to correlate with the expansion of non-coding DNA (ncDNA) in the genome rather than with genome size or the total number of genes. Thus, I investigated the correlation between genome and organismal complexities across 461 eukaryotes under a phylogenetically controlled framework. To that end, I introduce the first formal definitions and criteria to distinguish âunicellularityâ, âsimpleâ (SM) and âcomplexâ multicellularity. Rather than using the limited available estimations of unique cell types, the 461 species were classified according to our criteria by reviewing their life cycle and body plan development from literature. Then, I investigated the evolutionary association between genome size and 35 genome-wide features (introns and exons from protein-coding genes, repeats and intergenic regions) describing the coding and ncDNA complexities of the 461 genomes. To that end, I developed âGenomeContentâ, a program that systematically retrieves massive multidimensional datasets from gene annotations and calculates over 100 genome-wide statistics. R-scripts coupled to parallel computing were created to calculate >260,000 phylogenetic controlled pairwise correlations. As previously reported, both repetitive and non-repetitive DNA are found to be scaling strongly and positively with genome size across most eukaryotic lineages. Contrasting previous studies, I demonstrate that changes in the length and repeat composition of introns are only weakly or moderately associated with changes in genome size at the global phylogenetic scale, while changes in intron abundance (within and across genes) are either not or only very weakly associated with changes in genome size. Our evolutionary correlations are robust to: different phylogenetic regression methods, uncertainties in the tree of eukaryotes, variations in genome size estimates, and randomly reduced datasets. Then, I investigated the correlation between the 35 genome-wide features and the cellular complexity of the 461 eukaryotes with phylogenetic Principal Component Analyses. Our results endorse a genetic distinction between SM and CM in Archaeplastida and Metazoa, but not so clearly in Fungi. Remarkably, complex multicellular organisms and their closest ancestral relatives are characterized by high intron-richness, regardless of genome size. Finally, I argue why and how a vast expansion of non-coding RNA (ncRNA) regulators rather than of novel protein regulators can promote the emergence of CM in Eukarya. As a proof of concept, I co-developed a novel âceRNA-motif pipelineâ for the prediction of âcompeting endogenousâ ncRNAs (ceRNAs) that regulate microRNAs in plants. We identified three candidate ceRNAs motifs: MIM166, MIM171 and MIM159/319, which were found to be conserved across land plants and be potentially involved in diverse developmental processes and stress responses. Collectively, the findings of this dissertation support our hypothesis that CM on Earth is a major evolutionary transition promoted by the expansion of two major ncDNA classes, introns and regulatory ncRNAs, which might have boosted the irreversible commitment of cell types in certain lineages by canalizing the timing and kinetics of the eukaryotic transcriptome.:Cover page
Abstract
Acknowledgements
Index
1. The structure of this thesis
1.1. Structure of this PhD dissertation
1.2. Publications of this PhD dissertation
1.3. Computational infrastructure and resources
1.4. Disclosure of financial support and information use
1.5. Acknowledgements
1.6. Author contributions and use of impersonal and personal pronouns
2. Biological background
2.1. The complexity of the eukaryotic genome
2.2. The problem of counting and defining âgenesâ in eukaryotes
2.3. The âfunctionâ concept for genes and âdark matterâ
2.4. Increases of organismal complexity on Earth through multicellularity
2.5. Multicellularity is a âfitness transitionâ in individuality
2.6. The complexity of cell differentiation in multicellularity
3. Technical background
3.1. The Phylogenetic Comparative Method (PCM)
3.2. RNA secondary structure prediction
3.3. Some standards for genome and gene annotation
4. What is in a eukaryotic genome? GenomeContent provides a good answer
4.1. Background
4.2. Motivation: an interoperable tool for data retrieval of gene annotations
4.3. Methods
4.4. Results
4.5. Discussion
5. The evolutionary correlation between genome size and ncDNA
5.1. Background
5.2. Motivation: estimating the relationship between genome size and ncDNA
5.3. Methods
5.4. Results
5.5. Discussion
6. The relationship between non-coding DNA and Complex Multicellularity
6.1. Background
6.2. Motivation: How to define and measure complex multicellularity across eukaryotes?
6.3. Methods
6.4. Results
6.5. Discussion
7. The ceRNA motif pipeline: regulation of microRNAs by target mimics
7.1. Background
7.2. A revisited protocol for the computational analysis of Target Mimics
7.3. Motivation: a novel pipeline for ceRNA motif discovery
7.4. Methods
7.5. Results
7.6. Discussion
8. Conclusions and outlook
8.1. Contributions and lessons for the bioinformatics of large-scale comparative analyses
8.2. Intron features are evolutionarily decoupled among themselves and from genome size throughout Eukarya
8.3. âComplex multicellularityâ is a major evolutionary transition
8.4. Role of RNA throughout the evolution of life and complex multicellularity on Earth
9. Supplementary Data
Bibliography
Curriculum Scientiae
SelbstÀndigkeitserklÀrung (declaration of authorship
Z-Numbers-Based Approach to Hotel Service Quality Assessment
In this study, we are analyzing the possibility of using Z-numbers for
measuring the service quality and decision-making for quality improvement in the
hotel industry. Techniques used for these purposes are based on consumer evalu-
ations - expectations and perceptions. As a rule, these evaluations are expressed
in crisp numbers (Likert scale) or fuzzy estimates. However, descriptions of the
respondent opinions based on crisp or fuzzy numbers formalism not in all cases
are relevant. The existing methods do not take into account the degree of con-
fidence of respondents in their assessments. A fuzzy approach better describes
the uncertainties associated with human perceptions and expectations. Linguis-
tic values are more acceptable than crisp numbers. To consider the subjective
natures of both service quality estimates and confidence degree in them, the two-
component Z-numbers Z = (A, B) were used. Z-numbers express more adequately
the opinion of consumers. The proposed and computationally efficient approach
(Z-SERVQUAL, Z-IPA) allows to determine the quality of services and iden-
tify the factors that required improvement and the areas for further development.
The suggested method was applied to evaluate the service quality in small and
medium-sized hotels in Turkey and Azerbaijan, illustrated by the example
De l'apprentissage faiblement supervisé au catalogage en ligne
Applied mathematics and machine computations have raised a lot of hope since the recent success of supervised learning. Many practitioners in industries have been trying to switch from their old paradigms to machine learning. Interestingly, those data scientists spend more time scrapping, annotating and cleaning data than fine-tuning models. This thesis is motivated by the following question: can we derive a more generic framework than the one of supervised learning in order to learn from clutter data? This question is approached through the lens of weakly supervised learning, assuming that the bottleneck of data collection lies in annotation. We model weak supervision as giving, rather than a unique target, a set of target candidates. We argue that one should look for an âoptimisticâ function that matches most of the observations. This allows us to derive a principle to disambiguate partial labels. We also discuss the advantage to incorporate unsupervised learning techniques into our framework, in particular manifold regularization approached through diffusion techniques, for which we derived a new algorithm that scales better with input dimension then the baseline method. Finally, we switch from passive to active weakly supervised learning, introducing the âactive labelingâ framework, in which a practitioner can query weak information about chosen data. Among others, we leverage the fact that one does not need full information to access stochastic gradients and perform stochastic gradient descent.Les mathĂ©matiques appliquĂ©es et le calcul nourrissent beaucoup dâespoirs Ă la suite des succĂšs rĂ©cents de lâapprentissage supervisĂ©. Dans lâindustrie, beaucoup dâingĂ©nieurs cherchent Ă remplacer leurs anciens paradigmes de pensĂ©e par lâapprentissage machine. Ătonnamment, ces ingĂ©nieurs passent plus de temps Ă collecter, annoter et nettoyer des donnĂ©es quâĂ raffiner des modĂšles. Ce phĂ©nomĂšne motive la problĂ©matique de cette thĂšse: peut-on dĂ©finir un cadre thĂ©orique plus gĂ©nĂ©ral que lâapprentissage supervisĂ© pour apprendre grĂące Ă des donnĂ©es hĂ©tĂ©rogĂšnes? Cette question est abordĂ©e via le concept de supervision faible, faisant lâhypothĂšse que le problĂšme que posent les donnĂ©es est leur annotation. On modĂ©lise la supervision faible comme lâaccĂšs, pour une entrĂ©e donnĂ©e, non pas dâune sortie claire, mais dâun ensemble de sorties potentielles. On plaide pour lâadoption dâune perspective « optimiste » et lâapprentissage dâune fonction qui vĂ©rifie la plupart des observations. Cette perspective nous permet de dĂ©finir un principe pour lever lâambiguĂŻtĂ© des informations faibles. On discute Ă©galement de lâimportance dâincorporer des techniques sans supervision dâapprĂ©hension des donnĂ©es dâentrĂ©e dans notre thĂ©orie, en particulier de comprĂ©hension de la variĂ©tĂ© sous-jacente via des techniques de diffusion, pour lesquelles on propose un algorithme rĂ©aliste afin dâĂ©viter le flĂ©au de la dimension, Ă lâinverse de ce qui existait jusquâalors. Enfin, nous nous attaquons Ă la question de collecte active dâinformations faibles, dĂ©finissant le problĂšme de « catalogage en ligne », oĂč un intendant doit acquĂ©rir une maximum dâinformations fiables sur ses donnĂ©es sous une contrainte de budget. Entre autres, nous tirons parti du fait que pour obtenir un gradient stochastique et effectuer une descente de gradient, il nây a pas besoin de supervision totale
Leveraging Formulae and Text for Improved Math Retrieval
Large collections containing millions of math formulas are available online. Retrieving math expressions from these collections is challenging. Users can use formula, formula+text, or math questions to express their math information needs. The structural complexity of formulas requires specialized processing. Despite the existence of math search systems and online community question-answering websites for math, little is known about mathematical information needs. This research first explores the characteristics of math searches using a general search engine. The findings show how math searches are different from general searches. Then, test collections for math-aware search are introduced. The ARQMath test collections have two main tasks: 1) finding answers for math questions and 2) contextual formula search. In each test collection (ARQMath-1 to -3) the same collection is used, Math Stack Exchange posts from 2010 to 2018, introducing different topics for each task. Compared to the previous test collections, ARQMath has a much larger number of diverse topics, and improved evaluation protocol. Another key role of this research is to leverage text and math information for improved math information retrieval. Three formula search models that only use the formula, with no context are introduced. The first model is an n-gram embedding model using both symbol layout tree and operator tree representations. The second model uses tree-edit distance to re-rank the results from the first model. Finally, a learning-to-rank model that leverages full-tree, sub-tree, and vector similarity scores is introduced. To use context, Math Abstract Meaning Representation (MathAMR) is introduced, which generalizes AMR trees to include math formula operations and arguments. This MathAMR is then used for contextualized formula search using a fine-tuned Sentence-BERT model. The experiments show tree-edit distance ranking achieves the current state-of-the-art results on contextual formula search task, and the MathAMR model can be beneficial for re-ranking. This research also addresses the answer retrieval task, introducing a two-step retrieval model in which similar questions are first found and then answers previously given to those similar questions are ranked. The proposed model, fine-tunes two Sentence-BERT models, one for finding similar questions and another one for ranking the answers. For Sentence-BERT model, raw text as well as MathAMR are used
- âŠ