21 research outputs found

    Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

    Full text link
    Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance. Additionally, we present a comprehensive empirical study on data generation encompassing vital aspects like bias, diversity, and efficiency, and highlight three key observations: firstly, synthetic datasets generated by simple prompts exhibit significant biases, such as regional bias; secondly, attribute diversity plays a pivotal role in enhancing model performance; lastly, attributed prompts achieve the performance of simple class-conditional prompts while utilizing only 5\% of the querying cost of ChatGPT associated with the latter. We release the generated dataset and used prompts to facilitate future research. The data and code will be available on \url{https://github.com/yueyu1030/AttrPrompt}.Comment: Work in progress. A shorter version is accepted to the ICML DMLR worksho

    Insights into software development approaches: mining Q &A repositories

    Get PDF
    © 2023 The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY), https://creativecommons.org/licenses/by/4.0/Context: Software practitioners adopt approaches like DevOps, Scrum, and Waterfall for high-quality software development. However, limited research has been conducted on exploring software development approaches concerning practitioners’ discussions on Q &A forums. Objective: We conducted an empirical study to analyze developers’ discussions on Q &A forums to gain insights into software development approaches in practice. Method: We analyzed 13,903 developers’ posts across Stack Overflow (SO), Software Engineering Stack Exchange (SESE), and Project Management Stack Exchange (PMSE) forums. A mixed method approach, consisting of the topic modeling technique (i.e., Latent Dirichlet Allocation (LDA)) and qualitative analysis, is used to identify frequently discussed topics of software development approaches, trends (popular, difficult topics), and the challenges faced by practitioners in adopting different software development approaches. Findings: We identified 15 frequently mentioned software development approaches topics on Q &A sites and observed an increase in trends for the top-3 most difficult topics requiring more attention. Finally, our study identified 49 challenges faced by practitioners while deploying various software development approaches, and we subsequently created a thematic map to represent these findings. Conclusions: The study findings serve as a useful resource for practitioners to overcome challenges, stay informed about current trends, and ultimately improve the quality of software products they develop.Peer reviewe

    Challenges and Barriers of Using Low Code Software for Machine Learning

    Full text link
    As big data grows ubiquitous across many domains, more and more stakeholders seek to develop Machine Learning (ML) applications on their data. The success of an ML application usually depends on the close collaboration of ML experts and domain experts. However, the shortage of ML engineers remains a fundamental problem. Low-code Machine learning tools/platforms (aka, AutoML) aim to democratize ML development to domain experts by automating many repetitive tasks in the ML pipeline. This research presents an empirical study of around 14k posts (questions + accepted answers) from Stack Overflow (SO) that contained AutoML-related discussions. We examine how these topics are spread across the various Machine Learning Life Cycle (MLLC) phases and their popularity and difficulty. This study offers several interesting findings. First, we find 13 AutoML topics that we group into four categories. The MLOps topic category (43% questions) is the largest, followed by Model (28% questions), Data (27% questions), Documentation (2% questions). Second, Most questions are asked during Model training (29%) (i.e., implementation phase) and Data preparation (25%) MLLC phase. Third, AutoML practitioners find the MLOps topic category most challenging, especially topics related to model deployment & monitoring and Automated ML pipeline. These findings have implications for all three AutoML stakeholders: AutoML researchers, AutoML service vendors, and AutoML developers. Academia and Industry collaboration can improve different aspects of AutoML, such as better DevOps/deployment support and tutorial-based documentation

    Designing and Deploying Internet of Things Applications in the Industry: An Empirical Investigation

    Get PDF
    RÉSUMÉ : L’Internet des objets (IdO) a pour objectif de permettre la connectivité à presque tous les objets trouvés dans l’espace physique. Il étend la connectivité aux objets de tous les jours et o˙re la possibilité de surveiller, de suivre, de se connecter et d’intéragir plus eÿcacement avec les actifs industriels. Dans l’industrie de nos jours, les réseaux de capteurs connectés surveillent les mouvements logistiques, fabriquent des machines et aident les organisations à améliorer leur eÿcacité et à réduire les coûts. Cependant, la conception et l’implémentation d’un réseau IdO restent, aujourd’hui, une tâche particulièrement diÿcile. Nous constatons un haut niveau de fragmentation dans le paysage de l’IdO, les développeurs se complaig-nent régulièrement de la diÿculté à intégrer diverses technologies avec des divers objets trouvés dans les systèmes IdO et l’absence des directives et/ou des pratiques claires pour le développement et le déploiement d’application IdO sûres et eÿcaces. Par conséquent, analyser et comprendre les problèmes liés au développement et au déploiement de l’IdO sont primordiaux pour permettre à l’industrie d’exploiter son plein potentiel. Dans cette thèse, nous examinons les interactions des spécialistes de l’IdO sur le sites Web populaire, Stack Overflow et Stack Exchange, afin de comprendre les défis et les problèmes auxquels ils sont confrontés lors du développement et du déploiement de di˙érentes appli-cations de l’IdO. Ensuite, nous examinons le manque d’interopérabilité entre les techniques développées pour l’IdO, nous étudions les défis que leur intégration pose et nous fournissons des directives aux praticiens intéressés par la connexion des réseaux et des dispositifs de l’IdO pour développer divers services et applications. D’autre part, la sécurité étant essen-tielle au succès de cette technologie, nous étudions les di˙érentes menaces et défis de sécurité sur les di˙érentes couches de l’architecture des systèmes de l’IdO et nous proposons des contre-mesures. Enfin, nous menons une série d’expériences qui vise à comprendre les avantages et les incon-vénients des déploiements ’serverful’ et ’serverless’ des applications de l’IdO afin de fournir aux praticiens des directives et des recommandations fondées sur des éléments probants relatifs à de tels déploiements. Les résultats présentés représentent une étape très importante vers une profonde compréhension de ces technologies très prometteuses. Nous estimons que nos recommandations et nos suggestions aideront les praticiens et les bâtisseurs technologiques à améliorer la qualité des logiciels et des systèmes de l’IdO. Nous espérons que nos résultats pourront aider les communautés et les consortiums de l’IdO à établir des normes et des directives pour le développement, la maintenance, et l’évolution des logiciels de l’IdO.----------ABSTRACT : Internet of Things (IoT) aims to bring connectivity to almost every object found in the phys-ical space. It extends connectivity to everyday things, opens up the possibility to monitor, track, connect, and interact with industrial assets more eÿciently. In the industry nowadays, we can see connected sensor networks monitor logistics movements, manufacturing machines, and help organizations improve their eÿciency and reduce costs as well. However, designing and implementing an IoT network today is still a very challenging task. We are witnessing a high level of fragmentation in the IoT landscape and developers regularly complain about the diÿculty to integrate diverse technologies of various objects found in IoT systems, and the lack of clear guidelines and–or practices for developing and deploying safe and eÿcient IoT applications. Therefore, analyzing and understanding issues related to the development and deployment of the Internet of Things is utterly important to allow the industry to utilize its fullest potential. In this thesis, we examine IoT practitioners’ discussions on the popular Q&A websites, Stack Overflow and Stack Exchange, to understand the challenges and issues that they face when developing and deploying di˙erent IoT applications. Next, we examine the lack of interoper-ability among technologies developed for IoT and study the challenges that their integration poses and provide guidelines for practitioners interested in connecting IoT networks and de-vices to develop various services and applications. Since security issues are center to the success of this technology, we also investigate di˙erent security threats and challenges across di˙erent layers of the architecture of IoT systems and propose countermeasures. Finally, we conduct a series of experiments to understand the advantages and trade-o˙s of serverful and serverless deployments of IoT applications in order to provide practitioners with evidence-based guidelines and recommendations on such deployments. The results presented in this thesis represent a first important step towards a deep understanding of these very promising technologies. We believe that our recommendations and suggestions will help practitioners and technology builders improve the quality of IoT software and systems. We also hope that our results can help IoT communities and consortia establish standards and guidelines for the development, maintenance, and evolution of IoT software and systems

    Information bias and trust in bitcoin speculation

    Get PDF
    The Internet pervades modern life, offering up opportunities to connect, inform and be informed. As the range and number of sources for information online explode, how people select and interpret information has become a pertinent area for study, not least in light of the prevalence of fake-news. People are well known to act upon information they believe to be trustworthy and where the decision to act incurs risk, an inability to accurately select and assess the credibility of information presents a challenge. Bitcoin, the nascent crypto-currency, presents a domain within which profound financial risk abounds. Even for those armed with experience and knowledge there are numerous challenges to assessing risk, especially as sources of Bitcoin information can be observed to be partisan and of questionable accuracy. Within the domain of bitcoin speculation, this thesis asks the central research question of: are people able to select and correctly evaluate information they might rely upon to make decisions? In addressing this research question, this thesis offers - through the application of a psychological model of informational trust to bitcoin speculators - two fundamental contributions: Firstly, that these users are able to identify relevant news without a reliance upon confirmation bias. Secondly, that a notable percentage of users are not evaluating the credibility of online news by expertly interpreting the fundamentals of information but, rather deferring their trust to either the source news website or a more broad trust of information on the Internet. For these users, chance or luck may mean that they are basing their decisions upon factually accurate news. But this is a position which makes them particularly vulnerable to fake-news where it is spread via sources which they might trust. This position of susceptibility provides evidence to support further security research of both the prevalence of, and counter-measures for fake-news

    Information & Records Management and Blockchain Technology: Understanding its Potential

    Get PDF
    This MSc dissertation researched the extent to which Blockchain technology is or might become a useful tool for information and records management (IRM). In undertaking this research, I had three aims in mind. Those were: • To explain the state of knowledge and use of Blockchain technology currently being employed within IRM around the world; • To investigate why Blockchain technology was or was not being used in the IRM community/profession; and • To explore whether there is potential for further use of Blockchain technology in IRM. This topic was selected because there is very little academic or practitioner writing on the role of Blockchain within an IRM context. The aims of this research are investigated through quantitative research methods via an online questionnaire to survey IRM professionals about their knowledge and use of Blockchain and the drivers and obstacles to such knowledge and use or their lack of such knowledge and use. My research found that Blockchain technology is a little used tool as very few people actually work with it or have experienced it as a records management tool. At this point in time it is too early to draw definitive conclusions about the degree to which Blockchain is or might become a critical tool for IRM
    corecore