191 research outputs found
Rasa-ptbr-boilerplate : FLOSS project that enables brazilian portuguese chatbot development by non-experts
Trabalho de Conclusão de Curso (graduação)—Universidade de Brasília, Faculdade UnB Gama (FGA), Engenharia de Software, 2019.Chatbots possuem a capacidade de conversar com pessoas por meio de imitação do comportamento humano. Atualmente, chatbots são capazes de desempenhar tarefas simples como responder perguntas sobre um determinado contexto e desempenhar tarefas complexas como o gerenciamento completo de residências. No entanto, o desenvolvimento de um projeto de chatbot requer uma equipe completa formada por vários especialistas, que podem consumir tempo e recursos. É comum projetos de chatbots terem requisitos de software semelhantes e apenas se difenciar no domínio da solução específico o que poderia resultar na reutilização de software de código aberto (OSS) relacionado à chatbots. Neste trabalho, é examinado como os projetos de chatbot podem se beneficiar da reutilização no nível do projeto (reutilização de caixa preta). Foi demonstrado que é possível combinar estrategicamente a arquitetura e os diálogos com a utilização do modelo de processo CRISP-DM em novos contextos e propósitos de conversação. A principal contribuição deste trabalho é a apresentação de um projeto de chatbot chamado Rasa-ptbr-boilerplate com configurações e integrações de tecnologias voltado para a reutilização de forma que não especialistas sejam capazes de desenvolver um chatbot como caixa-preta.Chatbots have the ability to talk to people through the imitation of human behavior. Currently, chatbots are able to perform simple tasks such as answering questions about a particular context and performing complex tasks such as complete home management. However, the development of a chatbot project requires a full team of many experts, which can consume time and resources. It is common for chatbot projects to have similar software requirements and only to differ in the domain of the specific solution which could result in the re-use of open source software (OSS) related to chatbots. In this work, it is examined how chatbot projects can benefit from reuse at the project level (black box reuse). It has been shown that it is possible to strategically combine the architecture and dialogues with the use of CRISPDM process model in new contexts and conversational purposes. The main contribution of this work is the presentation of a chatbot project called Rasa-ptbr-boilerplate with configurations and integrations of technologies aimed at the reuse so that non-specialists are able to develop a chatbot as a black box
Free and Open Source Software Licensing Requirements and Copyright Infringement Involving Artificial Intelligence Technologies
Es wurde viel über den urheberrechtlichen Schutz von KI-Output und der KI-Programmierung diskutiert. In dieser Arbeit soll aufgezeigt werden, wie wichtig es ist, den Urheberrechtsstatus der Daten zu berücksichtigen, auf denen große Sprachmodell-KIs (sogenannte LLMs) trainiert werden. Die Gefahr für das Urheberrecht, die von KI ausgeht, wird durch die Copilot-Sammelklage demonstriert, die im Jahr 2022 in den USA eingereicht wurde.
Anhand einer Analyse des urheberrechtlichen Rahmens der EU und der USA wird in dieser Arbeit das Problem erörtert, das beim Training von KI auf öffentlich verfügbarem FOSS-lizenziertem Code entsteht. Der Grundgedanke dieser Arbeit ist, dass FOSS ein wesentlicher Bestandteil der Softwareentwicklung ist und als solcher vor einer möglichen Ausbeutung durch Big Tech geschützt werden muss. In dieser Arbeit wird festgestellt, dass Copilot gegen die in Lizenz bestimmte Verpflichtungen zur Namensnennung verstößt, was die Grundsätze der FOSS-Bewegung untergräbt.
Die rechtliche Analyse ergab jedoch, dass in der EU die Ausnahmeregelung für Text- und Data-Mining (TDM) in den Artikeln 3 und 4 der Richtlinie über das Urheberrecht im digitalen Binnenmarkt höchstwahrscheinlich auf Copilot und viele andere LLMs Anwendung finden wird. Obwohl es dem Nutzungsvorbehalt gemäß Art. 4(3) unterliegt, bedeutet dies, dass Copilot in der EU nicht als Verstoß gegen das Urheberrecht angesehen wird. In den USA ist es unwahrscheinlich, dass die Fair-Use-Ausnahme Anwendung findet, da eine ganzheitlichere Bewertung zulässig ist.
Die normativen Debatten zu diesem Thema spiegeln die Schwierigkeit wider, einen Ausgleich zwischen konkurrierenden politischen Interessen zu finden. Diese Arbeit soll zeigen, dass die Bedeutung von FOSS und die Förderung der Qualität und Zugänglichkeit von Software vom Gesetzgeber berücksichtigt werden sollte.There has been much discussion about the copyright protection of AI output and the AI programming itself. This thesis seeks to demonstrate the importance of considering the copyright status of the data on which large language model AIs are trained. The Copilot class action lawsuit which emerged in the US in 2022 serves as a good illustration of this dilemma, which this thesis leverages.
By analysing the EU and US copyright frameworks, this thesis discusses the problem posed by AI machine learning training on publicly available software code protected by free open-source software (FOSS) licenses. The cornerstone of this thesis is that FOSS is integral for software development and, as such, requires protection from potential exploitation by Big Tech. The thesis analyses the thirteen licences available in the data on which Copilot trains to conclude that eleven of the licences stipulate attribution as a condition of use. Yet, LLM’s programming makes it impossible to track how its output arises, creating a paradox which undermines the principles around which the FOSS movement is centred.
Despite this, the legal analysis showed that in the EU, the text and data mining exception (TDM) in articles 3 and 4 of the Directive on Copyright in the Digital Single Market will most likely apply to Copilot and many other LLMs. While it is subject to the author’s opt-out right from article 4(3), it effectively means that in the EU, Copilot will not be considered in violation of copyright. In the US, the fair use exception is unlikely to apply as a more holistic evaluation of factors is permitted. The final determination, however, remains to be made by the courts.
The normative debates surrounding this topic reflect the difficulty of balancing the competing policy interests. However, this thesis seeks to demonstrate that the importance of FOSS and the effort to promote the quality and accessibility of software should be borne in mind by policymakers
Recommended from our members
Beyond Similar Code: Leveraging Social Coding Websites
Programmers often write code with similarity to existing code written somewhere. Code search tools can help developers find similar solutions and identify possible improvements. For code search tools, good search results rely on valid data collection. Social coding websites, such as Question & Answer forum Stack Overflow (SO) and project repository GitHub, are popular destinations when programmers look for how to achieve certain programming tasks. Over the years, SO and GitHub have accumulated an enormous knowledge base of, and around, code. Since these software artifacts are publicly available, it is possible to leverage them in code search tools. This dissertation explores the opportunities of leveraging software artifacts from the social coding websites in searching for not just similar, but related, code. Programmers query SO and GitHub extensively to search for suitable code for reuse, however, not much is known about the usability or quality of the available code from each website. This dissertation first investigates under what circumstances the software artifacts found in social coding websites can be leveraged for purposes other than their immediate use by developers. It points out a number of problems that need to be addressed before those artifacts can be leveraged for code search and development tools. Specifically, triviality, fragility, and duplication, dominate these artifacts. However, when these problems are addressed, there is still a considerable amount of good quality artifacts that can be leveraged.SO and GitHub are not only two separate data resources, moreover, they together, belong to a larger system of software development process: the same users that rely on facilities of GitHub often seeks support on SO for their problems, and return to GitHub to apply the knowledge acquired. This dissertation further studies the crossover of software artifacts between SO and GitHub, and categorizes the adaptations from a SO code snippet to its GitHub counterparts. Existing search tools only recommend other code locations that are syntactically or semantically similar to the given code but do not reason about other kinds of relevant code that a developer should also pay attention to, e.g., auxiliary code to accomplish a complete task. With the good quality software artifacts and crossover between the two systems available, this dissertation presents two approaches that leverage these artifacts in searching for related code. Aroma indexes GitHub projects, takes a partial code snippet as input, searches the corpus for methods containing the partial code snippet, and clusters and intersects the results of the search to recommend. Aroma is evaluated on randomly selected queries created from the GitHub corpus, as well as queries derived from SO code snippets. It recommends related code for error checking and handling, objects configuring, etc. Furthermore, a user study is conducted where industrial developers are asked to complete programming tasks using Aroma and provide feedback. The results indicate that Aroma is capable of retrieving and recommending relevant code snippets efficiently. CodeAid reuses the crossover between SO and GitHub and recommends related code outside of a method body. For each SO snippet as a query, CodeAid retrieves the co-occurring code fragments for its GitHub counterparts and clusters them to recommend common ones. 74% of the common co-occurring code fragments represent related functionality that should be included in code search results. Three major types of relevancy--complementary, supplementary, and alternative methods, are identified
On Matching Binary to Source Code
Reverse engineering of executable binary programs has diverse applications in computer security and forensics, and often involves identifying parts of code that are reused from third party software projects. Identification of code clones by comparing and fingerprinting low-level binaries has been explored in various pieces of work as an effective approach for accelerating the reverse engineering process.
Binary clone detection across different environments and computing platforms bears significant challenges, and reasoning about sequences of low-level machine in- structions is a tedious and time consuming process. Because of these reasons, the ability of matching reused functions to their source code is highly advantageous, de- spite being rarely explored to date.
In this thesis, we systematically assess the feasibility of automatic binary to source matching to aid the reverse engineering process. We highlight the challenges, elab- orate on the shortcomings of existing proposals, and design a new approach that is targeted at addressing the challenges while delivering more extensive and detailed results in a fully automated fashion. By evaluating our approach, we show that it is generally capable of uniquely matching over 50% of reused functions in a binary to their source code in a source database with over 500,000 functions, while narrowing down over 75% of reused functions to at most five candidates in most cases. Finally, we investigate and discuss the limitations and provide directions for future work
Patent data driven innovation logic
Innovation research is conventionally conducted with creativity techniques such as TRIZ, Mind Mapping, Brainstorming, etc. (Dewulf, Baillie 1998). Patent research is typically used to research novelty or prior art, and legal studies.
This thesis is at the intersection of creativity techniques, and patent data analysis. It describes how to utilise patent data for distilling Innovation Logic and conducting innovation research.
Using the patent research tool PatentInspiration (© AULIVE Software NV), the 4 different stages of the Innovation Logic approach have been subjected to text analysis in patent literature. The specific text patterns were identified and documented on several case studies, with one case study across the whole thesis: the toothbrush. The opportunities and limitations of Patent Data Driven Innovation Research have been documented and discussed.
This methodology has been demonstrated within a proposed structural approach to problem solving, technology marketing and innovation research. Furthermore, the potential of artificial idea generation and artificial creativity was examined and debated for the purpose of computer aided creativity.
This thesis examines and confirms three claims:
CLAIM 1: PROPERTIES AND FUNCTIONS CAN BE ADJECTIVES AND VERBS IN PATENT LITERATURE
CLAIM 2: PATENT DATA ANALYSIS AUGMENTS THE FULL INNOVATION LOGIC PROCESS
CLAIM 3: ARTIFICIAL INNOVATION METHODS CAN BE FUELED BY PATENT DATA
Patent data can be text mined, acting as a global brain consisting of over 100 million invention documents. It is possible to use this existing data to reverse engineer thinking methodologies, allowing scientists and engineers to solve new problems, invent new products or processes, or find new markets for existing technologies. Patent Data Driven Innovation Logic will demonstrate a systematic innovation approach that combines the force of contemporary data mining methods on patent literature, with a structured innovation research methodology.Open Acces
Recommended from our members
Mining software repositories to determine the impact of team factors on the structural attributes of software
This thesis was submitted for the award of PhD and was awarded by Brunel University LondonSoftware development is intrinsically a human activity and the role of the development team has been established as among the most decisive of all project success factors. Prior research has proven empirically that team size and stability are linked to stakeholder satisfaction, team productivity and fault-proneness. Team size is usually considered a measure of the number of developers that modify the source code of a project while team stability is typically a function of the cumulative time that each team member has worked with their fellow team members. There is, however, limited research investigating the impact of these factors on software maintainability - a crucial aspect given that up to 80% of development budgets are consumed in the maintenance phase of the lifecycle. This research sheds light on how these aspects of team composition influence the structural attributes of the developed software that, in turn, drive the maintenance costs of software. This thesis asserts that new and broader insights can be gained by measuring these internal attributes of the software rather than the more traditional approach of measuring its external attributes. This can also enable practitioners to measure and monitor key indicators throughout the development lifecycle taking remedial action where appropriate. Within this research the GoogleCode open-source forge is mined and a sample of 1,480 Java projects are selected for further study. Using the Chidamber and Kemerer design metrics suite, the impact of development team size and stability on the internal structural attributes of software is isolated and quantified. Drawing on prior research correlating these internal attributes with external attributes, the impact on maintainability is deduced. This research finds that those structural attributes that have been established to correlate to fault-proneness - coupling, cohesion and modularity - show degradation as team sizes increase or team stability decreases. That degradation in the internal attributes of the software is associated with a deterioration in the sub-attributes of maintainability; changeability, understandability, testability and stability
Methods of Disambiguating and De-anonymizing Authorship in Large Scale Operational Data
Operational data from software development, social networks and other domains are often contaminated with incorrect or missing values. Examples include misspelled or changed names, multiple emails belonging to the same person and user profiles that vary in different systems. Such digital traces are extensively used in research and practice to study collaborating communities of various kinds. To achieve a realistic representation of the networks that represent these communities, accurate identities are essential. In this work, we aim to identify, model, and correct identity errors in data from open-source software repositories, which include more than 23M developer IDs and nearly 1B Git commits (developer activity records). Our investigation into the nature and prevalence of identity errors in software activity data reveals that they are different and occur at much higher rates than other domains. Existing techniques relying on string comparisons can only disambiguate Synonyms, but not Homonyms, which are common in software activity traces. Therefore, we introduce measures of behavioral fingerprinting to improve the accuracy of Synonym resolution, and to disambiguate Homonyms. Fingerprints are constructed from the traces of developers’ activities, such as, the style of writing in commit messages, the patterns in files modified and projects participated in by developers, and the patterns related to the timing of the developers’ activity. Furthermore, to address the lack of training data necessary for the supervised learning approaches that are used in disambiguation, we design a specific active learning procedure that minimizes the manual effort necessary to create training data in the domain of developer identity matching. We extensively evaluate the proposed approach, using over 16,000 OpenStack developers in 1200 projects, against commercial and most recent research approaches, and further on recent research on a much larger sample of over 2,000,000 IDs. Results demonstrate that our method is significantly better than both the recent research and commercial methods. We also conduct experiments to demonstrate that such erroneous data have significant impact on developer networks. We hope that the proposed approach will expedite research progress in the domain of software engineering, especially in applications for which graphs of social networks are critical
- …