3,083 research outputs found

    Clustering by compression

    Full text link
    We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal but uses the non-computable notion of Kolmogorov complexity. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates universality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (binary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure

    The Effects of the ADD Label on Teachers\u27 Attitudes and Expectations

    Get PDF
    Attention Deficit Disorder (ADD) is rapidly becoming an important educational issue. Although much research has been conducted into the effects of labelling and teachers\u27 attitudes and expectations on children\u27s academic and social behaviour, little research has been conducted into the relationship between the label \u27ADD\u27 and teachers\u27 attitudes and expectations. The main purpose of this study was to determine the effects of the ADD label on teachers\u27 attitudes and expectations for children with ADD. In addition, the effects of teachers personal characteristics on their attitudes and expectations for children with ADD, and teachers perceptions of issues surrounding ADD were investigated. The study was conducted utilising self-report data collected from instruments consisting of one of two vignettes describing the typical ADD behaviours of a hypothetical child, and a Likert-type rating scale. Primary school teachers exposed to the vignette containing the ADD label formed the experimental group, while those who completed the vignette without the ADD label formed the control group. The results revealed the ADD label and teachers\u27 personal characteristics had no effect on their attitudes and expectations regarding children with ADD. The results also showed teachers feel they need more resources (e.g., information, teaching strategies, support) in order to meet the needs of children with learning and behaviour disorders such as ADD

    Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm

    Get PDF
    This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution

    Citations versus expert opinions: Citation analysis of Featured Reviews of the American Mathematical Society

    Get PDF
    Peer review and citation metrics are two means of gauging the value of scientific research, but the lack of publicly available peer review data makes the comparison of these methods difficult. Mathematics can serve as a useful laboratory for considering these questions because as an exact science, there is a narrow range of reasons for citations. In mathematics, virtually all published articles are post-publication reviewed by mathematicians in Mathematical Reviews (MathSciNet) and so the data set was essentially the Web of Science mathematics publications from 1993 to 2004. For a decade, especially important articles were singled out in Mathematical Reviews for featured reviews. In this study, we analyze the bibliometrics of elite articles selected by peer review and by citation count. We conclude that the two notions of significance described by being a featured review article and being highly cited are distinct. This indicates that peer review and citation counts give largely independent determinations of highly distinguished articles. We also consider whether hiring patterns of subfields and mathematicians' interest in subfields reflect subfields of featured review or highly cited articles. We reexamine data from two earlier studies in light of our methods for implications on the peer review/citation count relationship to a diversity of disciplines.Comment: 21 pages, 3 figures, 4 table

    The construction of a Business English curriculum, relevant to the workplace, and making use of word processing in place of handwriting

    Get PDF
    Since the Thailand economic crisis in 1997 there has been a sense of urgency expressed in many areas of the society that businesses must modernize their practices and focus more on international trade and communication. Two important components of the changes required are better use of Information and Communication Technologies (ICT) and better use of the English language for business communication. In the education arena this has translated into the need to provide graduates with better skills in the use of English and computers. These two skill areas come together naturally in the study of Business English. In Thailand Rajabhat Institutes have a major responsibility for the training of business professionals and for the improvement of local communities. Therefore research is required to determine how best Thai Rajabhat may improve the provision of Business English to better service the needs of employing organizations and the local community. This study set out to conduct research to address this area of concern

    The elephant in the record: On the multiplicity of data recording work

    Get PDF
    6noopenThis article focuses on the production side of clinical data work, or data recording work, and in particular, on its multiplicity in terms of data variability. We report the findings from two case studies aimed at assessing the multiplicity that can be observed when the same medical phenomenon is recorded by multiple competent experts, yet the recorded data enable the knowledgeable management of illness trajectories. Often framed in terms of the latent unreliability of medical data, and then treated as a problem to solve, we argue that practitioners in the health informatics field must gain a greater awareness of the natural variability of data inscribing work, assess it, and design solutions that allow actors on both sides of clinical data work, that is, the production and care, as well as the primary and secondary uses of data to aptly inform each other’s practices.openCabitza F.; Locoro A.; Alderighi C.; Rasoini R.; Compagnone D.; Berjano P.Cabitza, F.; Locoro, A.; Alderighi, C.; Rasoini, R.; Compagnone, D.; Berjano, P

    Feasibility of Melville Marginalia Authorship Differentiation

    Get PDF
    We examine the feasibility of using image processing techniques to determine differentiation in authorship of historical pencil marks. Pencil marks with unattributed and attributed authorship are segmented from digital images of historical books. Analysis is performed on five features that are extracted from the vertical pencil marks, with those features used as a basis for authorship of marks. These marks consist of single stroke marks that are interspersed in the same document. We describe the challenges of the digital format that we were given and the steps taken in using autonomous segmentation to save pixel locations of marks. Five mark features are chosen and extracted: Average Intensity, Stroke Width, Blurriness, Stroke Curvature, and Stroke Angle. Features are then analyzed with the use of different histograms, 2D scatter plots of feature space, and comparing and contrasting the two groups of marks. C-means clustering is performed on the feature spaces of both groups. Semi-supervised clustering is used to test if we can predict the clustering. We then use two forms of cluster validity, Davies-Bouldin Index and Silhouette, in order to v produce a confidence value on the number of clusters and their membership. Then we look at the histograms and 2D scatter plots with the Melville’s Marginalia Online attributed and unattributed labels applied. Extracting features show patterns and trends within the marks that could be used to group marks. Specifically, Stroke Curvature became a dominant feature that showed promises of differentiating marks created by different authors. Extracting features has the potential to be used with high confidence in separating marks by author
    • …
    corecore