1,785 research outputs found

    On the accuracy of language trees

    Get PDF
    Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages. From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases. In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.Comment: 36 pages, 14 figure

    Engineering data compendium. Human perception and performance. User's guide

    Get PDF
    The concept underlying the Engineering Data Compendium was the product of a research and development program (Integrated Perceptual Information for Designers project) aimed at facilitating the application of basic research findings in human performance to the design and military crew systems. The principal objective was to develop a workable strategy for: (1) identifying and distilling information of potential value to system design from the existing research literature, and (2) presenting this technical information in a way that would aid its accessibility, interpretability, and applicability by systems designers. The present four volumes of the Engineering Data Compendium represent the first implementation of this strategy. This is the first volume, the User's Guide, containing a description of the program and instructions for its use

    Whose Tweet? Authorship analysis of micro-blogs and other short-form messages

    Get PDF
    Approaches to authorship attribution have traditionally been constrained by the size of the message to which they can be successfully applied, making them unsuitable for analysing shorter messages such as SMS Text Messages, micro-blogs (e.g. Twitter) or Instant Messaging. Having many potential authors of a number of texts (as in, for example, an online context) has also proved problematic for traditional descriptive methods, which have tended to be successfully applied in cases where there is a small and closed set of possible authors. This paper reports the findings of a project which aimed to develop and automate techniques from forensic linguistics that have been successfully applied to the analysis of short message content in criminal cases. Using data drawn from UK-focused online groups within Twitter, the research extends the applicability of Grant’s (2007; 2010) stylistic and statistical techniques for the analysis of authorship of short texts into the online environment. Initial identification of distinctive textual features commonly found within short messages allows for the development of a taxonomy which can then be used when calculating the ‘distance’ between messages containing instances of these feature types. The end result is an automated process with a high level of success in assigning tweets to the correct author. The research has the potential to extend the scope of reliable and valid authorship analysis into hitherto unexplored contexts. Given the relative anonymity of the internet and the availability of cloaking technology, linguistic research of this nature represents a crucial contribution to the investigative toolkit

    Analysis and Forecasting of Trending Topics in Online Media Streams

    Full text link
    Among the vast information available on the web, social media streams capture what people currently pay attention to and how they feel about certain topics. Awareness of such trending topics plays a crucial role in multimedia systems such as trend aware recommendation and automatic vocabulary selection for video concept detection systems. Correctly utilizing trending topics requires a better understanding of their various characteristics in different social media streams. To this end, we present the first comprehensive study across three major online and social media streams, Twitter, Google, and Wikipedia, covering thousands of trending topics during an observation period of an entire year. Our results indicate that depending on one's requirements one does not necessarily have to turn to Twitter for information about current events and that some media streams strongly emphasize content of specific categories. As our second key contribution, we further present a novel approach for the challenging task of forecasting the life cycle of trending topics in the very moment they emerge. Our fully automated approach is based on a nearest neighbor forecasting technique exploiting our assumption that semantically similar topics exhibit similar behavior. We demonstrate on a large-scale dataset of Wikipedia page view statistics that forecasts by the proposed approach are about 9-48k views closer to the actual viewing statistics compared to baseline methods and achieve a mean average percentage error of 45-19% for time periods of up to 14 days.Comment: ACM Multimedia 201

    Publishing in paleontology

    Get PDF
    La estructura de la publicación paleontológica depende básicamente del hecho de que la Paleontologia a) representa un tema muy amplio peroemplea relativamente pocos especialistas, b) necesita tanto representación ideográfica masiva como un aumento proporcionalde discusiónnomotética y c) estádividida entre las ciencias de la tierra y de la vida. La publicación se lleva aún a cabo a traves de series anticuadas que incluyen temas variados, y los paleontólogos empiezan lentamente a comprender la  necesidad de una presentación estructurada y de la canalización de los resultados de la investigación. Los volúmenes de Symposios contribuyen considerablemente al deterioro de la publicación de la  Paleontologia debido a su insuficiente circulación, al inadecuado control de calidad y a la insuficiente  accesibilidad a los articulas a través de servicios secundarios. La divulgación insuficiente es, no obstante,  admirablemente compensada a través de la circulación de separatas canalizada por catálogos y noticiarios. La publicación sinoptica ofrece una solución inminente al problema económico de  Ia Paleontologia ideográfica, pero no gana terreno. No obstante, el enterramiento de la Paleontologia ideográfica en la ((literatura gris» aún no ha finalizado. La disminucion de las exigencias de la educación escolar acarrea repercusiones en el estilo literario, la tenninologia y la nomenclatura. El  internacionalismo gana terreno y ha de ser promovido. La Paleontologia ideográfica avanzará mas lentamente que otras ramas de las Ciencias Naturales en la adaptación de la impresión en papel a lasmicrofichas y a la comunicación electrónica. Esto esdebido a la necesidad inherente de ilustraciones adecuadas y de comparación simultánea, e igualmente a la falta de procedimientos para el tratamiento del material sucesivamente modernizado y de las exigencias de códigos de nomenclatura biológica
    • …
    corecore