3,083 research outputs found
Clustering by compression
We present a new method for clustering based on compression. The method
doesn't use subject-specific features or background knowledge, and works as
follows: First, we determine a universal similarity distance, the normalized
compression distance or NCD, computed from the lengths of compressed data files
(singly and in pairwise concatenation). Second, we apply a hierarchical
clustering method. The NCD is universal in that it is not restricted to a
specific application area, and works across application area boundaries. A
theoretical precursor, the normalized information distance, co-developed by one
of the authors, is provably optimal but uses the non-computable notion of
Kolmogorov complexity. We propose precise notions of similarity metric, normal
compressor, and show that the NCD based on a normal compressor is a similarity
metric that approximates universality. To extract a hierarchy of clusters from
the distance matrix, we determine a dendrogram (binary tree) by a new quartet
method and a fast heuristic to implement it. The method is implemented and
available as public software, and is robust under choice of different
compressors. To substantiate our claims of universality and robustness, we
report evidence of successful application in areas as diverse as genomics,
virology, languages, literature, music, handwritten digits, astronomy, and
combinations of objects from completely different domains, using statistical,
dictionary, and block sorting compressors. In genomics we presented new
evidence for major questions in Mammalian evolution, based on
whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta
hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure
Recommended from our members
Information Overload: An Overview
For almost as long as there has been recorded information, there has been a perception that humanity has been overloaded by it. Concerns about 'too much to read' have been expressed for many centuries, and made more urgent since the arrival of ubiquitous digital information in the late twentieth century. The historical perspective is a necessary corrective to the often, and wrongly, held view that it is associated solely with the modern digital information environment, and with social media in particular. However, as society fully experiences Floridi's Fourth Revolution, and moves into hyper-history (with society dependent on, and defined by, information and communication technologies) and the infosphere (a information environment distinguished by a seamless blend of online and offline information actvity), individuals and societies are dependent on, and formed by, information in an unprecedented way, information overload needs to be taken more seriously than ever. Overload has been claimed to be both the major issue of our time, and a complete non-issue. It has been cited as an important factor in, inter alia, science, medicine, education, politics, governance, business and marketing, planning for smart cities, access to news, personal data tracking, home life, use of social media, and online shopping, and has even influenced literature The information overload phenomenon has been known by many different names, including: information overabundance, infobesity, infoglut, data smog, information pollution, information fatigue, social media fatigue, social media overload, information anxiety, library anxiety, infostress, infoxication, reading overload, communication overload, cognitive overload, information violence, and information assault. There is no single generally accepted definition, but it can best be understood as that situation which arises when there is so much relevant and potentially useful information available that it becomes a hindrance rather than a help. Its essential nature has not changed with changing technology, though its causes and proposed solutions have changed much. The best ways of avoiding overload, individually and socially, appear to lie in a variety of coping strategies, such as filtering, withdrawing, queuing, and 'satisficing'. Better design of information systems, effective personal information management, and the promotion of digital and media literacies, also have a part to play. Overload may perhaps best be overcome by seeking a mindful balance in consuming information, and in finding understanding
The Effects of the ADD Label on Teachers\u27 Attitudes and Expectations
Attention Deficit Disorder (ADD) is rapidly becoming an important educational issue. Although much research has been conducted into the effects of labelling and teachers\u27 attitudes and expectations on children\u27s academic and social behaviour, little research has been conducted into the relationship between the label \u27ADD\u27 and teachers\u27 attitudes and expectations. The main purpose of this study was to determine the effects of the ADD label on teachers\u27 attitudes and expectations for children with ADD. In addition, the effects of teachers personal characteristics on their attitudes and expectations for children with ADD, and teachers perceptions of issues surrounding ADD were investigated. The study was conducted utilising self-report data collected from instruments consisting of one of two vignettes describing the typical ADD behaviours of a hypothetical child, and a Likert-type rating scale. Primary school teachers exposed to the vignette containing the ADD label formed the experimental group, while those who completed the vignette without the ADD label formed the control group. The results revealed the ADD label and teachers\u27 personal characteristics had no effect on their attitudes and expectations regarding children with ADD. The results also showed teachers feel they need more resources (e.g., information, teaching strategies, support) in order to meet the needs of children with learning and behaviour disorders such as ADD
Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm
This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution
Citations versus expert opinions: Citation analysis of Featured Reviews of the American Mathematical Society
Peer review and citation metrics are two means of gauging the value of
scientific research, but the lack of publicly available peer review data makes
the comparison of these methods difficult. Mathematics can serve as a useful
laboratory for considering these questions because as an exact science, there
is a narrow range of reasons for citations. In mathematics, virtually all
published articles are post-publication reviewed by mathematicians in
Mathematical Reviews (MathSciNet) and so the data set was essentially the Web
of Science mathematics publications from 1993 to 2004. For a decade, especially
important articles were singled out in Mathematical Reviews for featured
reviews. In this study, we analyze the bibliometrics of elite articles selected
by peer review and by citation count. We conclude that the two notions of
significance described by being a featured review article and being highly
cited are distinct. This indicates that peer review and citation counts give
largely independent determinations of highly distinguished articles. We also
consider whether hiring patterns of subfields and mathematicians' interest in
subfields reflect subfields of featured review or highly cited articles. We
reexamine data from two earlier studies in light of our methods for
implications on the peer review/citation count relationship to a diversity of
disciplines.Comment: 21 pages, 3 figures, 4 table
The construction of a Business English curriculum, relevant to the workplace, and making use of word processing in place of handwriting
Since the Thailand economic crisis in 1997 there has been a sense of urgency expressed in many areas of the society that businesses must modernize their practices and focus more on international trade and communication. Two important components of the changes required are better use of Information and Communication Technologies (ICT) and better use of the English language for business communication. In the education arena this has translated into the need to provide graduates with better skills in the use of English and computers. These two skill areas come together naturally in the study of Business English. In Thailand Rajabhat Institutes have a major responsibility for the training of business professionals and for the improvement of local communities. Therefore research is required to determine how best Thai Rajabhat may improve the provision of Business English to better service the needs of employing organizations and the local community. This study set out to conduct research to address this area of concern
The elephant in the record: On the multiplicity of data recording work
6noopenThis article focuses on the production side of clinical data work, or data recording work, and in particular, on its multiplicity in terms of data variability. We report the findings from two case studies aimed at assessing the multiplicity that can be observed when the same medical phenomenon is recorded by multiple competent experts, yet the recorded data enable the knowledgeable management of illness trajectories. Often framed in terms of the latent unreliability of medical data, and then treated as a problem to solve, we argue that practitioners in the health informatics field must gain a greater awareness of the natural variability of data inscribing work, assess it, and design solutions that allow actors on both sides of clinical data work, that is, the production and care, as well as the primary and secondary uses of data to aptly inform each other’s practices.openCabitza F.; Locoro A.; Alderighi C.; Rasoini R.; Compagnone D.; Berjano P.Cabitza, F.; Locoro, A.; Alderighi, C.; Rasoini, R.; Compagnone, D.; Berjano, P
Feasibility of Melville Marginalia Authorship Differentiation
We examine the feasibility of using image processing techniques to determine differentiation in authorship of historical pencil marks. Pencil marks with unattributed and attributed authorship are segmented from digital images of historical books. Analysis is performed on five features that are extracted from the vertical pencil marks, with those features used as a basis for authorship of marks. These marks consist of single stroke marks that are interspersed in the same document. We describe the challenges of the digital format that we were given and the steps taken in using autonomous segmentation to save pixel locations of marks. Five mark features are chosen and extracted: Average Intensity, Stroke Width, Blurriness, Stroke Curvature, and Stroke Angle. Features are then analyzed with the use of different histograms, 2D scatter plots of feature space, and comparing and contrasting the two groups of marks. C-means clustering is performed on the feature spaces of both groups. Semi-supervised clustering is used to test if we can predict the clustering. We then use two forms of cluster validity, Davies-Bouldin Index and Silhouette, in order to v produce a confidence value on the number of clusters and their membership. Then we look at the histograms and 2D scatter plots with the Melville’s Marginalia Online attributed and unattributed labels applied. Extracting features show patterns and trends within the marks that could be used to group marks. Specifically, Stroke Curvature became a dominant feature that showed promises of differentiating marks created by different authors. Extracting features has the potential to be used with high confidence in separating marks by author
- …