191 research outputs found

    Rasa-ptbr-boilerplate : FLOSS project that enables brazilian portuguese chatbot development by non-experts

    Get PDF
    Trabalho de Conclusão de Curso (graduação)—Universidade de Brasília, Faculdade UnB Gama (FGA), Engenharia de Software, 2019.Chatbots possuem a capacidade de conversar com pessoas por meio de imitação do comportamento humano. Atualmente, chatbots são capazes de desempenhar tarefas simples como responder perguntas sobre um determinado contexto e desempenhar tarefas complexas como o gerenciamento completo de residências. No entanto, o desenvolvimento de um projeto de chatbot requer uma equipe completa formada por vários especialistas, que podem consumir tempo e recursos. É comum projetos de chatbots terem requisitos de software semelhantes e apenas se difenciar no domínio da solução específico o que poderia resultar na reutilização de software de código aberto (OSS) relacionado à chatbots. Neste trabalho, é examinado como os projetos de chatbot podem se beneficiar da reutilização no nível do projeto (reutilização de caixa preta). Foi demonstrado que é possível combinar estrategicamente a arquitetura e os diálogos com a utilização do modelo de processo CRISP-DM em novos contextos e propósitos de conversação. A principal contribuição deste trabalho é a apresentação de um projeto de chatbot chamado Rasa-ptbr-boilerplate com configurações e integrações de tecnologias voltado para a reutilização de forma que não especialistas sejam capazes de desenvolver um chatbot como caixa-preta.Chatbots have the ability to talk to people through the imitation of human behavior. Currently, chatbots are able to perform simple tasks such as answering questions about a particular context and performing complex tasks such as complete home management. However, the development of a chatbot project requires a full team of many experts, which can consume time and resources. It is common for chatbot projects to have similar software requirements and only to differ in the domain of the specific solution which could result in the re-use of open source software (OSS) related to chatbots. In this work, it is examined how chatbot projects can benefit from reuse at the project level (black box reuse). It has been shown that it is possible to strategically combine the architecture and dialogues with the use of CRISPDM process model in new contexts and conversational purposes. The main contribution of this work is the presentation of a chatbot project called Rasa-ptbr-boilerplate with configurations and integrations of technologies aimed at the reuse so that non-specialists are able to develop a chatbot as a black box

    Free and Open Source Software Licensing Requirements and Copyright Infringement Involving Artificial Intelligence Technologies

    Get PDF
    Es wurde viel über den urheberrechtlichen Schutz von KI-Output und der KI-Programmierung diskutiert. In dieser Arbeit soll aufgezeigt werden, wie wichtig es ist, den Urheberrechtsstatus der Daten zu berücksichtigen, auf denen große Sprachmodell-KIs (sogenannte LLMs) trainiert werden. Die Gefahr für das Urheberrecht, die von KI ausgeht, wird durch die Copilot-Sammelklage demonstriert, die im Jahr 2022 in den USA eingereicht wurde. Anhand einer Analyse des urheberrechtlichen Rahmens der EU und der USA wird in dieser Arbeit das Problem erörtert, das beim Training von KI auf öffentlich verfügbarem FOSS-lizenziertem Code entsteht. Der Grundgedanke dieser Arbeit ist, dass FOSS ein wesentlicher Bestandteil der Softwareentwicklung ist und als solcher vor einer möglichen Ausbeutung durch Big Tech geschützt werden muss. In dieser Arbeit wird festgestellt, dass Copilot gegen die in Lizenz bestimmte Verpflichtungen zur Namensnennung verstößt, was die Grundsätze der FOSS-Bewegung untergräbt. Die rechtliche Analyse ergab jedoch, dass in der EU die Ausnahmeregelung für Text- und Data-Mining (TDM) in den Artikeln 3 und 4 der Richtlinie über das Urheberrecht im digitalen Binnenmarkt höchstwahrscheinlich auf Copilot und viele andere LLMs Anwendung finden wird. Obwohl es dem Nutzungsvorbehalt gemäß Art. 4(3) unterliegt, bedeutet dies, dass Copilot in der EU nicht als Verstoß gegen das Urheberrecht angesehen wird. In den USA ist es unwahrscheinlich, dass die Fair-Use-Ausnahme Anwendung findet, da eine ganzheitlichere Bewertung zulässig ist. Die normativen Debatten zu diesem Thema spiegeln die Schwierigkeit wider, einen Ausgleich zwischen konkurrierenden politischen Interessen zu finden. Diese Arbeit soll zeigen, dass die Bedeutung von FOSS und die Förderung der Qualität und Zugänglichkeit von Software vom Gesetzgeber berücksichtigt werden sollte.There has been much discussion about the copyright protection of AI output and the AI programming itself. This thesis seeks to demonstrate the importance of considering the copyright status of the data on which large language model AIs are trained. The Copilot class action lawsuit which emerged in the US in 2022 serves as a good illustration of this dilemma, which this thesis leverages. By analysing the EU and US copyright frameworks, this thesis discusses the problem posed by AI machine learning training on publicly available software code protected by free open-source software (FOSS) licenses. The cornerstone of this thesis is that FOSS is integral for software development and, as such, requires protection from potential exploitation by Big Tech. The thesis analyses the thirteen licences available in the data on which Copilot trains to conclude that eleven of the licences stipulate attribution as a condition of use. Yet, LLM’s programming makes it impossible to track how its output arises, creating a paradox which undermines the principles around which the FOSS movement is centred. Despite this, the legal analysis showed that in the EU, the text and data mining exception (TDM) in articles 3 and 4 of the Directive on Copyright in the Digital Single Market will most likely apply to Copilot and many other LLMs. While it is subject to the author’s opt-out right from article 4(3), it effectively means that in the EU, Copilot will not be considered in violation of copyright. In the US, the fair use exception is unlikely to apply as a more holistic evaluation of factors is permitted. The final determination, however, remains to be made by the courts. The normative debates surrounding this topic reflect the difficulty of balancing the competing policy interests. However, this thesis seeks to demonstrate that the importance of FOSS and the effort to promote the quality and accessibility of software should be borne in mind by policymakers

    On Matching Binary to Source Code

    Get PDF
    Reverse engineering of executable binary programs has diverse applications in computer security and forensics, and often involves identifying parts of code that are reused from third party software projects. Identification of code clones by comparing and fingerprinting low-level binaries has been explored in various pieces of work as an effective approach for accelerating the reverse engineering process. Binary clone detection across different environments and computing platforms bears significant challenges, and reasoning about sequences of low-level machine in- structions is a tedious and time consuming process. Because of these reasons, the ability of matching reused functions to their source code is highly advantageous, de- spite being rarely explored to date. In this thesis, we systematically assess the feasibility of automatic binary to source matching to aid the reverse engineering process. We highlight the challenges, elab- orate on the shortcomings of existing proposals, and design a new approach that is targeted at addressing the challenges while delivering more extensive and detailed results in a fully automated fashion. By evaluating our approach, we show that it is generally capable of uniquely matching over 50% of reused functions in a binary to their source code in a source database with over 500,000 functions, while narrowing down over 75% of reused functions to at most five candidates in most cases. Finally, we investigate and discuss the limitations and provide directions for future work

    Patent data driven innovation logic

    Get PDF
    Innovation research is conventionally conducted with creativity techniques such as TRIZ, Mind Mapping, Brainstorming, etc. (Dewulf, Baillie 1998). Patent research is typically used to research novelty or prior art, and legal studies. This thesis is at the intersection of creativity techniques, and patent data analysis. It describes how to utilise patent data for distilling Innovation Logic and conducting innovation research. Using the patent research tool PatentInspiration (© AULIVE Software NV), the 4 different stages of the Innovation Logic approach have been subjected to text analysis in patent literature. The specific text patterns were identified and documented on several case studies, with one case study across the whole thesis: the toothbrush. The opportunities and limitations of Patent Data Driven Innovation Research have been documented and discussed. This methodology has been demonstrated within a proposed structural approach to problem solving, technology marketing and innovation research. Furthermore, the potential of artificial idea generation and artificial creativity was examined and debated for the purpose of computer aided creativity. This thesis examines and confirms three claims: CLAIM 1: PROPERTIES AND FUNCTIONS CAN BE ADJECTIVES AND VERBS IN PATENT LITERATURE CLAIM 2: PATENT DATA ANALYSIS AUGMENTS THE FULL INNOVATION LOGIC PROCESS CLAIM 3: ARTIFICIAL INNOVATION METHODS CAN BE FUELED BY PATENT DATA Patent data can be text mined, acting as a global brain consisting of over 100 million invention documents. It is possible to use this existing data to reverse engineer thinking methodologies, allowing scientists and engineers to solve new problems, invent new products or processes, or find new markets for existing technologies. Patent Data Driven Innovation Logic will demonstrate a systematic innovation approach that combines the force of contemporary data mining methods on patent literature, with a structured innovation research methodology.Open Acces

    Methods of Disambiguating and De-anonymizing Authorship in Large Scale Operational Data

    Get PDF
    Operational data from software development, social networks and other domains are often contaminated with incorrect or missing values. Examples include misspelled or changed names, multiple emails belonging to the same person and user profiles that vary in different systems. Such digital traces are extensively used in research and practice to study collaborating communities of various kinds. To achieve a realistic representation of the networks that represent these communities, accurate identities are essential. In this work, we aim to identify, model, and correct identity errors in data from open-source software repositories, which include more than 23M developer IDs and nearly 1B Git commits (developer activity records). Our investigation into the nature and prevalence of identity errors in software activity data reveals that they are different and occur at much higher rates than other domains. Existing techniques relying on string comparisons can only disambiguate Synonyms, but not Homonyms, which are common in software activity traces. Therefore, we introduce measures of behavioral fingerprinting to improve the accuracy of Synonym resolution, and to disambiguate Homonyms. Fingerprints are constructed from the traces of developers’ activities, such as, the style of writing in commit messages, the patterns in files modified and projects participated in by developers, and the patterns related to the timing of the developers’ activity. Furthermore, to address the lack of training data necessary for the supervised learning approaches that are used in disambiguation, we design a specific active learning procedure that minimizes the manual effort necessary to create training data in the domain of developer identity matching. We extensively evaluate the proposed approach, using over 16,000 OpenStack developers in 1200 projects, against commercial and most recent research approaches, and further on recent research on a much larger sample of over 2,000,000 IDs. Results demonstrate that our method is significantly better than both the recent research and commercial methods. We also conduct experiments to demonstrate that such erroneous data have significant impact on developer networks. We hope that the proposed approach will expedite research progress in the domain of software engineering, especially in applications for which graphs of social networks are critical
    corecore