455 research outputs found
A Simple and Effective Method of Cross-Lingual Plagiarism Detection
We present a simple cross-lingual plagiarism detection method applicable to a
large number of languages. The presented approach leverages open multilingual
thesauri for candidate retrieval task and pre-trained multilingual BERT-based
language models for detailed analysis. The method does not rely on machine
translation and word sense disambiguation when in use, and therefore is
suitable for a large number of languages, including under-resourced languages.
The effectiveness of the proposed approach is demonstrated for several existing
and new benchmarks, achieving state-of-the-art results for French, Russian, and
Armenian languages
Academic writing for IT students
This textbook is intended for Master and PhD Information Technology students (B1-C1 level of English proficiency). The instructions of how to write a research paper in English and the relevant exercises are given. The peculiarities of each section of a paper are presented. The exercises are based on real science materials taken from peer-reviewed journals. The subject area covers a wide scope of different Information Technology domains
Security considerations in the open source software ecosystem
Open source software plays an important role in the software supply chain, allowing stakeholders to
utilize open source components as building blocks in their software, tooling, and infrastructure. But
relying on the open source ecosystem introduces unique challenges, both in terms of security and trust,
as well as in terms of supply chain reliability.
In this dissertation, I investigate approaches, considerations, and encountered challenges of stakeholders in the context of security, privacy, and trustworthiness of the open source software supply
chain. Overall, my research aims to empower and support software experts with the knowledge and
resources necessary to achieve a more secure and trustworthy open source software ecosystem. In the
first part of this dissertation, I describe a research study investigating the security and trust practices
in open source projects by interviewing 27 owners, maintainers, and contributors from a diverse set
of projects to explore their behind-the-scenes processes, guidance and policies, incident handling, and
encountered challenges, finding that participants’ projects are highly diverse in terms of their deployed
security measures and trust processes, as well as their underlying motivations. More on the consumer
side of the open source software supply chain, I investigated the use of open source components in
industry projects by interviewing 25 software developers, architects, and engineers to understand their
projects’ processes, decisions, and considerations in the context of external open source code, finding
that open source components play an important role in many of the industry projects, and that most
projects have some form of company policy or best practice for including external code. On the side of
end-user focused software, I present a study investigating the use of software obfuscation in Android
applications, which is a recommended practice to protect against plagiarism and repackaging. The
study leveraged a multi-pronged approach including a large-scale measurement, a developer survey, and
a programming experiment, finding that only 24.92% of apps are obfuscated by their developer, that
developers do not fear theft of their own apps, and have difficulties obfuscating their own apps. Lastly,
to involve end users themselves, I describe a survey with 200 users of cloud office suites to investigate
their security and privacy perceptions and expectations, with findings suggesting that users are generally
aware of basic security implications, but lack technical knowledge for envisioning some threat models.
The key findings of this dissertation include that open source projects have highly diverse security
measures, trust processes, and underlying motivations. That the projects’ security and trust needs are
likely best met in ways that consider their individual strengths, limitations, and project stage, especially
for smaller projects with limited access to resources. That open source components play an important
role in industry projects, and that those projects often have some form of company policy or best
practice for including external code, but developers wish for more resources to better audit included
components.
This dissertation emphasizes the importance of collaboration and shared responsibility in building and maintaining the open source software ecosystem, with developers, maintainers, end users,
researchers, and other stakeholders alike ensuring that the ecosystem remains a secure, trustworthy, and
healthy resource for everyone to rely on
An investigation into the environmental sustainability of the South African ornamental horticultural industry
The ornamental horticultural industry makes use of natural resources to grow plants and produce allied products to sell to consumers, landscapers, retail garden centres, hardware stores, supermarkets, and government, but at what cost to the environment?
The aim of this work was to determine the current environmental awareness of growers and garden centre retailers within the ornamental horticultural industry in South Africa. Followed by an investigation into the current business practices that promote sustainable natural resource use and management as well as the obstacles and challenges that the industry faces with implementing legislation and recommendations of best practices. The study was conducted over an 18-month period and 41 growers and retail garden centres in eight of the provinces in South Africa (Appendix 10) participated in research. In each case, the study participant was asked to complete the questionnaire and where possible, a site visit was conducted and / or a semi-structured interview as well as participatory observations followed to give a comprehensive overview of the sustainability practices of the businesses. These results were then compared to international best practices and similar research conducted globally by the ornamental horticultural industry. A review of international best practices in the ornamental horticultural industry showed six environmental resources namely soil, water, fertilizers, pesticides, energy, and waste. This was seen to be common to most studies involved in the production, growth, maintenance and sales of plants and allied products. This information was used to compile a best management practice manual for South African ornamental horticulture with guidelines and practical examples for conserving and managing natural resource usage and reducing the environmental impacts of the industry.
Much research has been done on the exploitation and degradation of resources due to urbanisation, industrial activities, and agricultural practices. The resources are essential to the ornamental horticultural industry but if exploited or misused, can have detrimental effects on the environmental productivity of the industry and ultimately the “Sustainable Development Goals” prescribed by the United Nations. The linking of the relevant sustainable development goals to the 9 key factors of the green economy strategized by the South African government will enable the ornamental horticultural industry to play a greater part in the green and circular economy by providing nature-based solutions to environmental problems that it is facing such as climate change and pollution.Environmental SciencesD. Phil. (Environmental Management
Constitutions of Value
Gathering an interdisciplinary range of cutting-edge scholars, this book addresses legal constitutions of value.
Global value production and transnational value practices that rely on exploitation and extraction have left us with toxic commons and a damaged planet. Against this situation, the book examines law’s fundamental role in institutions of value production and valuation. Utilising pathbreaking theoretical approaches, it problematizes mainstream efforts to redeem institutions of value production by recoupling them with progressive values. Aiming beyond radical critique, the book opens up the possibility of imagining and enacting new and different value practices.
This wide-ranging and accessible book will appeal to international lawyers, socio-legal scholars, those working at the intersections of law and economy and others, in politics, economics, environmental studies and elsewhere, who are concerned with rethinking our current ideas of what has value, what does not, and whether and how value may be revalued
A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4
Large language models (LLMs) are a special class of pretrained language
models obtained by scaling model size, pretraining corpus and computation.
LLMs, because of their large size and pretraining on large volumes of text
data, exhibit special abilities which allow them to achieve remarkable
performances without any task-specific training in many of the natural language
processing tasks. The era of LLMs started with OpenAI GPT-3 model, and the
popularity of LLMs is increasing exponentially after the introduction of models
like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models,
including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs). With
the ever-rising popularity of GLLMs, especially in the research community,
there is a strong need for a comprehensive survey which summarizes the recent
research progress in multiple dimensions and can guide the research community
with insightful future research directions. We start the survey paper with
foundation concepts like transformers, transfer learning, self-supervised
learning, pretrained language models and large language models. We then present
a brief overview of GLLMs and discuss the performances of GLLMs in various
downstream tasks, specific domains and multiple languages. We also discuss the
data labelling and data augmentation abilities of GLLMs, the robustness of
GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with
multiple insightful future research directions. To summarize, this
comprehensive survey paper will serve as a good resource for both academic and
industry people to stay updated with the latest research related to GPT-3
family large language models.Comment: Preprint under review, 58 page
Analyzing the Unanalyzable: an Application to Android Apps
In general, software is unreliable. Its behavior can deviate from users’ expectations because of bugs, vulnerabilities, or even malicious code. Manually vetting software is a challenging, tedious, and highly-costly task that does not scale. To alleviate excessive costs and analysts’ burdens, automated static analysis techniques have been proposed by both the research and practitioner communities making static analysis a central topic in software engineering. In the meantime, mobile apps have considerably grown in importance. Today, most humans carry software in their pockets, with the Android operating system leading the market. Millions of apps have been proposed to the public so far, targeting a wide range of activities such as games, health, banking, GPS, etc. Hence, Android apps collect and manipulate a considerable amount of sensitive information, which puts users’ security and privacy at risk. Consequently, it is paramount to ensure that apps distributed through public channels (e.g., the Google Play) are free from malicious code. Hence, the research and practitioner communities have put much effort into devising new automated techniques to vet Android apps against malicious activities over the last decade. Analyzing Android apps is, however, challenging. On the one hand, the Android framework proposes constructs that can be used to evade dynamic analysis by triggering the malicious code only under certain circumstances, e.g., if the device is not an emulator and is currently connected to power. Hence, dynamic analyses can -easily- be fooled by malicious developers by making some code fragments difficult to reach. On the other hand, static analyses are challenged by Android-specific constructs that limit the coverage of off-the-shell static analyzers. The research community has already addressed some of these constructs, including inter-component communication or lifecycle methods. However, other constructs, such as implicit calls (i.e., when the Android framework asynchronously triggers a method in the app code), make some app code fragments unreachable to the static analyzers, while these fragments are executed when the app is run. Altogether, many apps’ code parts are unanalyzable: they are either not reachable by dynamic analyses or not covered by static analyzers. In this manuscript, we describe our contributions to the research effort from two angles: ① statically detecting malicious code that is difficult to access to dynamic analyzers because they are triggered under specific circumstances; and ② statically analyzing code not accessible to existing static analyzers to improve the comprehensiveness of app analyses. More precisely, in Part I, we first present a replication study of a state-of-the-art static logic bomb detector to better show its limitations. We then introduce a novel hybrid approach for detecting suspicious hidden sensitive operations towards triaging logic bombs. We finally detail the construction of a dataset of Android apps automatically infected with logic bombs. In Part II, we present our work to improve the comprehensiveness of Android apps’ static analysis. More specifically, we first show how we contributed to account for atypical inter-component communication in Android apps. Then, we present a novel approach to unify both the bytecode and native in Android apps to account for the multi-language trend in app development. Finally, we present our work to resolve conditional implicit calls in Android apps to improve static and dynamic analyzers
LASSO – an observatorium for the dynamic selection, analysis and comparison of software
Mining software repositories at the scale of 'big code' (i.e., big data) is a challenging activity. As well as finding a suitable software corpus and making it programmatically accessible through an index or database, researchers and practitioners have to establish an efficient analysis infrastructure and precisely define the metrics and data extraction approaches to be applied. Moreover, for analysis results to be generalisable, these tasks have to be applied at a large enough scale to have statistical significance, and if they are to be repeatable, the artefacts need to be carefully maintained and curated over time. Today, however, a lot of this work is still performed by human beings on a case-by-case basis, with the level of effort involved often having a significant negative impact on the generalisability and repeatability of studies, and thus on their overall scientific value.
The general purpose, 'code mining' repositories and infrastructures that have emerged in recent years represent a significant step forward because they automate many software mining tasks at an ultra-large scale and allow researchers and practitioners to focus on defining the questions they would like to explore at an abstract level. However, they are currently limited to static analysis and data extraction techniques, and thus cannot support (i.e., help automate) any studies which involve the execution of software systems. This includes experimental validations of techniques and tools that hypothesise about the behaviour (i.e., semantics) of software, or data analysis and extraction techniques that aim to measure dynamic properties of software.
In this thesis a platform called LASSO (Large-Scale Software Observatorium) is introduced that overcomes this limitation by automating the collection of dynamic (i.e., execution-based) information about software alongside static information. It features a single, ultra-large scale corpus of executable software systems created by amalgamating existing Open Source software repositories and a dedicated DSL for defining abstract selection and analysis pipelines. Its key innovations are integrated capabilities for searching for selecting software systems based on their exhibited behaviour and an 'arena' that allows their responses to software tests to be compared in a purely data-driven way. We call the platform a 'software observatorium' since it is a place where the behaviour of large numbers of software systems can be observed, analysed and compared
- …