5,660 research outputs found

    A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

    Full text link
    Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.Comment: 49 pages, 10 figures, 6 table

    A novel gluten knowledge base of potential biomedical and health-related interactions extracted from the literature: using machine learning and graph analysis methodologies to reconstruct the bibliome

    Get PDF
    Background In return for their nutritional properties and broad availability, cereal crops have been associated with different alimentary disorders and symptoms, with the majority of the responsibility being attributed to gluten. Therefore, the research of gluten-related literature data continues to be produced at ever-growing rates, driven in part by the recent exploratory studies that link gluten to non-traditional diseases and the popularity of gluten-free diets, making it increasingly difficult to access and analyse practical and structured information. In this sense, the accelerated discovery of novel advances in diagnosis and treatment, as well as exploratory studies, produce a favourable scenario for disinformation and misinformation. Objectives Aligned with, the European Union strategy “Delivering on EU Food Safety and Nutrition in 2050″ which emphasizes the inextricable links between imbalanced diets, the increased exposure to unreliable sources of information and misleading information, and the increased dependency on reliable sources of information; this paper presents GlutKNOIS, a public and interactive literature-based database that reconstructs and represents the experimental biomedical knowledge extracted from the gluten-related literature. The developed platform includes different external database knowledge, bibliometrics statistics and social media discussion to propose a novel and enhanced way to search, visualise and analyse potential biomedical and health-related interactions in relation to the gluten domain. Methods For this purpose, the presented study applies a semi-supervised curation workflow that combines natural language processing techniques, machine learning algorithms, ontology-based normalization and integration approaches, named entity recognition methods, and graph knowledge reconstruction methodologies to process, classify, represent and analyse the experimental findings contained in the literature, which is also complemented by data from the social discussion. Results and conclusions In this sense, 5814 documents were manually annotated and 7424 were fully automatically processed to reconstruct the first online gluten-related knowledge database of evidenced health-related interactions that produce health or metabolic changes based on the literature. In addition, the automatic processing of the literature combined with the knowledge representation methodologies proposed has the potential to assist in the revision and analysis of years of gluten research. The reconstructed knowledge base is public and accessible at https://sing-group.org/glutknois/Fundação para a Ciência e a Tecnologia | Ref. UIDB/50006/2020Xunta de Galicia | Ref. ED481B-2019-032Xunta de Galicia | Ref. ED431G2019/06Xunta de Galicia | Ref. ED431C 2022/03Universidade de Vigo/CISU

    Specificity of the innate immune responses to different classes of non-tuberculous mycobacteria

    Get PDF
    Mycobacterium avium is the most common nontuberculous mycobacterium (NTM) species causing infectious disease. Here, we characterized a M. avium infection model in zebrafish larvae, and compared it to M. marinum infection, a model of tuberculosis. M. avium bacteria are efficiently phagocytosed and frequently induce granuloma-like structures in zebrafish larvae. Although macrophages can respond to both mycobacterial infections, their migration speed is faster in infections caused by M. marinum. Tlr2 is conservatively involved in most aspects of the defense against both mycobacterial infections. However, Tlr2 has a function in the migration speed of macrophages and neutrophils to infection sites with M. marinum that is not observed with M. avium. Using RNAseq analysis, we found a distinct transcriptome response in cytokine-cytokine receptor interaction for M. avium and M. marinum infection. In addition, we found differences in gene expression in metabolic pathways, phagosome formation, matrix remodeling, and apoptosis in response to these mycobacterial infections. In conclusion, we characterized a new M. avium infection model in zebrafish that can be further used in studying pathological mechanisms for NTM-caused diseases

    A Deep Learning Approach to Evaluating Disease Risk in Coronary Bifurcations

    Full text link
    Cardiovascular disease represents a large burden on modern healthcare systems, requiring significant resources for patient monitoring and clinical interventions. It has been shown that the blood flow through coronary arteries, shaped by the artery geometry unique to each patient, plays a critical role in the development and progression of heart disease. However, the popular and well tested risk models such as Framingham and QRISK3 current cardiovascular disease risk models are not able to take these differences when predicting disease risk. Over the last decade, medical imaging and image processing have advanced to the point that non-invasive high-resolution 3D imaging is routinely performed for any patient suspected of coronary artery disease. This allows for the construction of virtual 3D models of the coronary anatomy, and in-silico analysis of blood flow within the coronaries. However, several challenges still exist which preclude large scale patient-specific simulations, necessary for incorporating haemodynamic risk metrics as part of disease risk prediction. In particular, despite a large amount of available coronary medical imaging, extraction of the structures of interest from medical images remains a manual and laborious task. There is significant variation in how geometric features of the coronary arteries are measured, which makes comparisons between different studies difficult. Modelling blood flow conditions in the coronary arteries likewise requires manual preparation of the simulations and significant computational cost. This thesis aims to solve these challenges. The "Automated Segmentation of Coronary Arteries (ASOCA)" establishes a benchmark dataset of coronary arteries and their associated 3D reconstructions, which is currently the largest openly available dataset of coronary artery models and offers a wide range of applications such as computational modelling, 3D printed for experiments, developing, and testing medical devices such as stents, and Virtual Reality applications for education and training. An automated computational modelling workflow is developed to set up, run and postprocess simulations on the Left Main Bifurcation and calculate relevant shape metrics. A convolutional neural network model is developed to replace the computational fluid dynamics process, which can predict haemodynamic metrics such as wall shear stress in minutes, compared to several hours using traditional computational modelling reducing the computation and labour cost involved in performing such simulations

    Monitoring the spread of antibiotic resistance in wastewater

    Get PDF
    BACKGROUND: Antibiotic resistant bacterial infections are causing a growing amount of morbidity and mortality. Effective control and prevention relies on good data on the current burden of antibiotic resistance (ABR). Traditional ABR surveillance from phenotypic, passive, hospital-based testing may not adequately represent the resistome of the general population. Wastewater metagenomics has been proposed as a new type of surveillance to overcome this limitation. It generates rich, quantitative information on the bacterial species and resistance genes of a whole community. Large wastewater metagenomic datasets are now available to monitor and explore drivers of ABR in the community. However, questions remain about how to collect, analyse, and interpret these novel datasets. In this thesis, I aimed to 1) address key unknowns in wastewater data, including sources of resistance, environmental resistance dynamics, and what statistical models describe the distribution of the data well, and 2) investigate global and local patterns in wastewater resistance and identify potential community and hospital drivers. METHODS: I used a systematic review to find evidence in the literature for dissemination of ABR from hospitals to wastewater. I next developed a compartmental transmission model to investigate environmental resistance dynamics and its impact on human ABR levels. I implemented a multi-response statistical model to correlate hospital-based surveillance (EARS-Net) data with resistance gene abundance in sewage samples from around the world analysed with metagenomics by the Global Sewage Surveillance Project. Finally, I used a paired sampling design and multiple statistical methods to compare the resistome of sewage from hospitals, communities, and wastewater treatment plants (WWTPS) in Scotland. I also investigated the links between ABR in humans and antibiotic consumption in the modelling and data analysis chapters. RESULTS: I found increasing evidence in primary research that resistant bacteria and resistance genes can be disseminated from hospital patients to wastewater and into natural water sources. Modelling the dynamics of ABR in an environmental reservoir indicated that the environment can theoretically influence human ABR levels as much as or more than an animal reservoir, and mitigate intervention impacts. Combining EARS-Net and sewage metagenomic data indicated that some types of ABR are positively correlated in sewage and hospitals (such as aminoglycosides), but many are not (such as vancomycin and aminopenicillins). The paired sampling study demonstrated that hospital and community sewage resistomes are distinct, and WWTPs mostly reflect community sewage resistomes. I found mixed evidence for an impact of antimicrobial consumption on human ABR levels. Overall, the impact of antibiotic consumption at the population level appears to be small in these datasets. CONCLUSIONS: Wastewater metagenomics is a valuable way of monitoring ABR in the community. It can indicate the composition of the reservoir of ABR in the general population and what drives it. However, hospital rather than mixed municipal effluent may need to be collected to monitor clinical resistance patterns. To make the most of this new source of data more flexible modelling frameworks that account for wastewater metagenomics specific factors such as high dimensionality and overdispersion. Comparing resistance patterns in hospitals to community sewage implied that patients and/or the hospital environment may present unique and strong selection pressures for resistance. Finally, we also show that differential antibiotic consumption alone cannot explain the observed patterns in resistance abundance on the national or international level

    Endogenous measures for contextualising large-scale social phenomena: a corpus-based method for mediated public discourse

    Get PDF
    This work presents an interdisciplinary methodology for developing endogenous measures of group membership through analysis of pervasive linguistic patterns in public discourse. Focusing on political discourse, this work critiques the conventional approach to the study of political participation, which is premised on decontextualised, exogenous measures to characterise groups. Considering the theoretical and empirical weaknesses of decontextualised approaches to large-scale social phenomena, this work suggests that contextualisation using endogenous measures might provide a complementary perspective to mitigate such weaknesses. This work develops a sociomaterial perspective on political participation in mediated discourse as affiliatory action performed through language. While the affiliatory function of language is often performed consciously (such as statements of identity), this work is concerned with unconscious features (such as patterns in lexis and grammar). This work argues that pervasive patterns in such features that emerge through socialisation are resistant to change and manipulation, and thus might serve as endogenous measures of sociopolitical contexts, and thus of groups. In terms of method, the work takes a corpus-based approach to the analysis of data from the Twitter messaging service whereby patterns in users’ speech are examined statistically in order to trace potential community membership. The method is applied in the US state of Michigan during the second half of 2018—6 November having been the date of midterm (i.e. non-Presidential) elections in the United States. The corpus is assembled from the original posts of 5,889 users, who are nominally geolocalised to 417 municipalities. These users are clustered according to pervasive language features. Comparing the linguistic clusters according to the municipalities they represent finds that there are regular sociodemographic differentials across clusters. This is understood as an indication of social structure, suggesting that endogenous measures derived from pervasive patterns in language may indeed offer a complementary, contextualised perspective on large-scale social phenomena

    Pediatric and Adolescent Nephrology Facing the Future: Diagnostic Advances and Prognostic Biomarkers in Everyday Practice

    Get PDF
    The Special Issue entitled “Pediatric and adolescent nephrology facing the future: diagnostic advances and prognostic biomarkers in everyday practice” contains articles written in the era when COVID-19 had not yet been a major clinical problem in children. Now that we know its multifaceted clinical course, complications concerning the kidneys, and childhood-specific post-COVID pediatric inflammatory multisystem syndrome (PIMS), the value of diagnostic and prognostic biomarkers in the pediatric area should be appreciated, and their importance ought to increase

    LASSO – an observatorium for the dynamic selection, analysis and comparison of software

    Full text link
    Mining software repositories at the scale of 'big code' (i.e., big data) is a challenging activity. As well as finding a suitable software corpus and making it programmatically accessible through an index or database, researchers and practitioners have to establish an efficient analysis infrastructure and precisely define the metrics and data extraction approaches to be applied. Moreover, for analysis results to be generalisable, these tasks have to be applied at a large enough scale to have statistical significance, and if they are to be repeatable, the artefacts need to be carefully maintained and curated over time. Today, however, a lot of this work is still performed by human beings on a case-by-case basis, with the level of effort involved often having a significant negative impact on the generalisability and repeatability of studies, and thus on their overall scientific value. The general purpose, 'code mining' repositories and infrastructures that have emerged in recent years represent a significant step forward because they automate many software mining tasks at an ultra-large scale and allow researchers and practitioners to focus on defining the questions they would like to explore at an abstract level. However, they are currently limited to static analysis and data extraction techniques, and thus cannot support (i.e., help automate) any studies which involve the execution of software systems. This includes experimental validations of techniques and tools that hypothesise about the behaviour (i.e., semantics) of software, or data analysis and extraction techniques that aim to measure dynamic properties of software. In this thesis a platform called LASSO (Large-Scale Software Observatorium) is introduced that overcomes this limitation by automating the collection of dynamic (i.e., execution-based) information about software alongside static information. It features a single, ultra-large scale corpus of executable software systems created by amalgamating existing Open Source software repositories and a dedicated DSL for defining abstract selection and analysis pipelines. Its key innovations are integrated capabilities for searching for selecting software systems based on their exhibited behaviour and an 'arena' that allows their responses to software tests to be compared in a purely data-driven way. We call the platform a 'software observatorium' since it is a place where the behaviour of large numbers of software systems can be observed, analysed and compared

    Challenges and perspectives of hate speech research

    Get PDF
    This book is the result of a conference that could not take place. It is a collection of 26 texts that address and discuss the latest developments in international hate speech research from a wide range of disciplinary perspectives. This includes case studies from Brazil, Lebanon, Poland, Nigeria, and India, theoretical introductions to the concepts of hate speech, dangerous speech, incivility, toxicity, extreme speech, and dark participation, as well as reflections on methodological challenges such as scraping, annotation, datafication, implicity, explainability, and machine learning. As such, it provides a much-needed forum for cross-national and cross-disciplinary conversations in what is currently a very vibrant field of research
    corecore