67,373 research outputs found

    Drawing Elena Ferrante's Profile. Workshop Proceedings, Padova, 7 September 2017

    Get PDF
    Elena Ferrante is an internationally acclaimed Italian novelist whose real identity has been kept secret by E/O publishing house for more than 25 years. Owing to her popularity, major Italian and foreign newspapers have long tried to discover her real identity. However, only a few attempts have been made to foster a scientific debate on her work. In 2016, Arjuna Tuzzi and Michele Cortelazzo led an Italian research team that conducted a preliminary study and collected a well-founded, large corpus of Italian novels comprising 150 works published in the last 30 years by 40 different authors. Moreover, they shared their data with a select group of international experts on authorship attribution, profiling, and analysis of textual data: Maciej Eder and Jan Rybicki (Poland), Patrick Juola (United States), Vittorio Loreto and his research team, Margherita Lalli and Francesca Tria (Italy), George Mikros (Greece), Pierre Ratinaud (France), and Jacques Savoy (Switzerland). The chapters of this volume report the results of this endeavour that were first presented during the international workshop Drawing Elena Ferrante's Profile in Padua on 7 September 2017 as part of the 3rd IQLA-GIAT Summer School in Quantitative Analysis of Textual Data. The fascinating research findings suggest that Elena Ferrante\u2019s work definitely deserves \u201cmany hands\u201d as well as an extensive effort to understand her distinct writing style and the reasons for her worldwide success

    FlashProfile: A Framework for Synthesizing Data Profiles

    Get PDF
    We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words, etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153153 tasks over 7575 large real datasets, we observe a median profiling time of only 0.7\sim\,0.7\,s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201

    Building a Corpus of 2L English for Automatic Assessment: the CLEC Corpus

    Get PDF
    In this paper we describe the CLEC corpus, an ongoing project set up at the University of Cádiz with the purpose of building up a large corpus of English as a 2L classified according to CEFR proficiency levels and formed to train statistical models for automatic proficiency assessment. The goal of this corpus is twofold: on the one hand it will be used as a data resource for the development of automatic text classification systems and, on the other, it has been used as a means of teaching innovation techniques

    Detecting Real-World Influence Through Twitter

    Get PDF
    In this paper, we investigate the issue of detecting the real-life influence of people based on their Twitter account. We propose an overview of common Twitter features used to characterize such accounts and their activity, and show that these are inefficient in this context. In particular, retweets and followers numbers, and Klout score are not relevant to our analysis. We thus propose several Machine Learning approaches based on Natural Language Processing and Social Network Analysis to label Twitter users as Influencers or not. We also rank them according to a predicted influence level. Our proposals are evaluated over the CLEF RepLab 2014 dataset, and outmatch state-of-the-art ranking methods.Comment: 2nd European Network Intelligence Conference (ENIC), Sep 2015, Karlskrona, Swede

    Using geographic profiling to locate elusive nocturnal animals: A case study with spectral tarsiers

    Get PDF
    © 2015 The Zoological Society of London. Estimates of biodiversity, population size, population density and habitat use have important implications for management of both species and habitats, yet are based on census data that can be extremely difficult to collect. Traditional assessment techniques are often limited by time and money and by the difficulties of working in certain habitats, and species become more difficult to find as population size decreases. Particular difficulties arise when studying elusive species with cryptic behaviours. Here, we show how geographic profiling (GP) - a statistical tool originally developed in criminology to prioritize large lists of suspects in cases of serial crime - can be used to address these problems. We ask whether GP can be used to locate sleeping sites of spectral tarsiers Tarsius tarsier in Sulawesi, Southeast Asia, using as input the positions at which tarsier vocalizations were recorded in the field. This novel application of GP is potentially of value as tarsiers are cryptic and nocturnal and can easily be overlooked in habitat assessments (e.g. in dense rainforest). Our results show that GP provides a useful tool for locating sleeping sites of this species, and indeed analysis of a preliminary dataset during field work strongly suggested the presence of a sleeping tree at a previously unknown location; two sleeping trees were subsequently found within 5m of the predicted site. We believe that GP can be successfully applied to locating the nests, dens or roosts of elusive animals such as tarsiers, potentially improving estimates of population size with important implications for management of both species and habitats.We thank Operation Wallacea for supporting S.C.F. in thisproject and for providing logistical support for the fieldwork,and Aidan Kelsey for invaluable assistance in the field. Wethank the Indonesian Institute of Sciences (LIPI) andKementerian Riset dan Teknologi Republik Indonesia(RISTEK) for providing permission to undertake the work(RISTEK permit no. 211/SIP/FRP/SM/VI/2013, and BalaiKonservasi Sumber Daya Alam (BKSDA) for theirassistance

    Terminal restriction fragment length polymorphism is an “old school” reliable technique for swift microbial community screening in anaerobic digestion

    Get PDF
    The microbial community in anaerobic digestion has been analysed through microbial fingerprinting techniques, such as terminal restriction fragment length polymorphism (TRFLP), for decades. In the last decade, high-throughput 16S rRNA gene amplicon sequencing has replaced these techniques, but the time-consuming and complex nature of high-throughput techniques is a potential bottleneck for full-scale anaerobic digestion application, when monitoring community dynamics. Here, the bacterial and archaeal TRFLP profiles were compared with 16S rRNA gene amplicon profiles (Illumina platform) of 25 full-scale anaerobic digestion plants. The α-diversity analysis revealed a higher richness based on Illumina data, compared with the TRFLP data. This coincided with a clear difference in community organisation, Pareto distribution, and co-occurrence network statistics, i.e., betweenness centrality and normalised degree. The β-diversity analysis showed a similar clustering profile for the Illumina, bacterial TRFLP and archaeal TRFLP data, based on different distance measures and independent of phylogenetic identification, with pH and temperature as the two key operational parameters determining microbial community composition. The combined knowledge of temporal dynamics and projected clustering in the β-diversity profile, based on the TRFLP data, distinctly showed that TRFLP is a reliable technique for swift microbial community dynamics screening in full-scale anaerobic digestion plants

    A Distance-Based Test of Association Between Paired Heterogeneous Genomic Data

    Full text link
    Due to rapid technological advances, a wide range of different measurements can be obtained from a given biological sample including single nucleotide polymorphisms, copy number variation, gene expression levels, DNA methylation and proteomic profiles. Each of these distinct measurements provides the means to characterize a certain aspect of biological diversity, and a fundamental problem of broad interest concerns the discovery of shared patterns of variation across different data types. Such data types are heterogeneous in the sense that they represent measurements taken at very different scales or described by very different data structures. We propose a distance-based statistical test, the generalized RV (GRV) test, to assess whether there is a common and non-random pattern of variability between paired biological measurements obtained from the same random sample. The measurements enter the test through distance measures which can be chosen to capture particular aspects of the data. An approximate null distribution is proposed to compute p-values in closed-form and without the need to perform costly Monte Carlo permutation procedures. Compared to the classical Mantel test for association between distance matrices, the GRV test has been found to be more powerful in a number of simulation settings. We also report on an application of the GRV test to detect biological pathways in which genetic variability is associated to variation in gene expression levels in ovarian cancer samples, and present results obtained from two independent cohorts
    corecore