90 research outputs found

    Deep R Programming

    Full text link
    Deep R Programming is a comprehensive course on one of the most popular languages in data science (statistical computing, graphics, machine learning, data wrangling and analytics). It introduces the base language in-depth and is aimed at ambitious students, practitioners, and researchers who would like to become independent users of this powerful environment. This textbook is a non-profit project. Its online and PDF versions are freely available at . This early draft is distributed in the hope that it will be useful.Comment: Draft: v0.2.1 (2023-04-27

    A Framework for Benchmarking Clustering Algorithms

    Full text link
    The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at

    stringi: Fast and Portable Character String Processing in R

    Get PDF
    Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician's or data scientist's repertoire to complement their numerical computing and data wrangling skills

    Clustering with minimum spanning trees: How good can it be?

    Full text link
    Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they can be meaningful in data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can overall be very competitive. Next, instead of proposing yet another algorithm that performs well on a limited set of examples, we review, study, extend, and generalise existing, the state-of-the-art MST-based partitioning schemes, which leads to a few new and interesting approaches. It turns out that the Genie method and the information-theoretic approaches often outperform the non-MST algorithms such as k-means, Gaussian mixtures, spectral clustering, BIRCH, and classical hierarchical agglomerative procedures

    Hierarchical Clustering with OWA-based Linkages, the Lance-Williams Formula, and Dendrogram Inversions

    Full text link
    Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance-Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions

    The use of fuzzy relations in the assessment of information resources producers' performance

    Get PDF
    Abstract. The producers assessment problem has many important practical instances: it is an abstract model for intelligent systems evaluating e.g. the quality of computer software repositories, web resources, social networking services, and digital libraries. Each producer's performance is determined according not only to the overall quality of the items he/she outputted, but also to the number of such items (which may be different for each agent). Recent theoretical results indicate that the use of aggregation operators in the process of ranking and evaluation producers may not necessarily lead to fair and plausible outcomes. Therefore, to overcome some weaknesses of the most often applied approach, in this preliminary study we encourage the use of a fuzzy preference relation-based setting and indicate why it may provide better control over the assessment process

    Vector valued information measures and integration with respect to fuzzy vector capacities

    Full text link
    [EN] Integration with respect to vector-valued fuzzy measures is used to define and study information measuring tools. Motivated by some current developments in Information Science, we apply the integration of scalar functions with respect to vector-valued fuzzy measures, also called vector capacities. Bartle-Dunford-Schwartz integration (for the additive case) and Choquet type integration (for the non-additive case) are considered, showing that these formalisms can be used to define and develop vector-valued impact measures. Examples related to existing bibliometric tools as well as to new measuring indices are given.The authors would like to thank both Prof. Dr. Olvido Delgado and the referee for their valuable comments and suggestions which helped to prepare the manuscript. The first author gratefully acknowledges the support of the Ministerio de Economia, Industria y Competitividad (Spain) under project MTM2016-77054-C2-1-P.Sánchez Pérez, EA.; Szwedek, R. (2019). Vector valued information measures and integration with respect to fuzzy vector capacities. Fuzzy Sets and Systems. 355:1-25. https://doi.org/10.1016/j.fss.2018.05.004S12535

    Mathematical properties of weighted impact factors based on measures of prestige of the citing journals

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s11192-015-1741-0An abstract construction for general weighted impact factors is introduced. We show that the classical weighted impact factors are particular cases of our model, but it can also be used for defining new impact measuring tools for other sources of information as repositories of datasets providing the mathematical support for a new family of altmet- rics. Our aim is to show the main mathematical properties of this class of impact measuring tools, that hold as consequences of their mathematical structure and does not depend on the definition of any given index nowadays in use. In order to show the power of our approach in a well-known setting, we apply our construction to analyze the stability of the ordering induced in a list of journals by the 2-year impact factor (IF2). We study the change of this ordering when the criterium to define it is given by the numerical value of a new weighted impact factor, in which IF2 is used for defining the weights. We prove that, if we assume that the weight associated to a citing journal increases with its IF2, then the ordering given in the list by the new weighted impact factor coincides with the order defined by the IF2. We give a quantitative bound for the errors committed. We also show two examples of weighted impact factors defined by weights associated to the prestige of the citing journal for the fields of MATHEMATICS and MEDICINE, GENERAL AND INTERNAL, checking if they satisfy the increasing behavior mentioned above.Ferrer Sapena, A.; Sánchez Pérez, EA.; González, LM.; Peset Mancebo, MF.; Aleixandre Benavent, R. (2015). Mathematical properties of weighted impact factors based on measures of prestige of the citing journals. Scientometrics. 105(3):2089-2108. https://doi.org/10.1007/s11192-015-1741-0S208921081053Ahlgren, P., & Waltman, L. (2014). The correlation between citation-based and expert-based assessments of publication channels: SNIP and SJR vs. Norwegian quality assessments. Journal of Informetrics, 8, 985–996.Aleixandre Benavent, R., Valderrama Zurián, J. C., & González Alcaide, G. (2007). Scientific journals impact factor: Limitations and alternative indicators. El Profesional de la Información, 16(1), 4–11.Altmann, K. G., & Gorman, G. E. (1998). The usefulness of impact factor in serial selection: A rank and mean analysis using ecology journals. Library Acquisitions-Practise and Theory, 22, 147–159.Arnold, D. N., & Fowler, K. K. (2011). Nefarious numbers. Notices of the American Mathematical Society, 58(3), 434–437.Beliakov, G., & James, S. (2012). Using linear programming for weights identification of generalized bonferroni means in R. In: Proceedings of MDAI 2012 modeling decisions for artificial intelligence. Lecture Notes in Computer Science, Vol. 7647, pp. 35–44.Beliakov, G., & James, S. (2011). Citation-based journal ranks: The use of fuzzy measures. Fuzzy Sets and Systems, 167, 101–119.Buela-Casal, G. (2003). Evaluating quality of articles and scientific journals. Proposal of weighted impact factor and a quality index. Psicothema, 15(1), 23–25.Dorta-Gonzalez, P., & Dorta-Gonzalez, M. I. (2013). Comparing journals from different fields of science and social science through a JCR subject categories normalized impact factor. Scientometrics, 95(2), 645–672.Dorta-Gonzalez, P., Dorta-Gonzalez, M. I., Santos-Penate, D. R., & Suarez-Vega, R. (2014). Journal topic citation potential and between-field comparisons: The topic normalized impact factor. Journal of Informetrics, 8(2), 406–418.Egghe, L., & Rousseau, R. (2002). A general frame-work for relative impact indicators. Canadian Journal of Information and Library Science, 27(1), 29–48.Gagolewski, M., & Mesiar, R. (2014). Monotone measures and universal integrals in a uniform framework for the scientific impact assessment problem. Information Sciences, 263, 166–174.Garfield, E. (2006). The history and meaning of the journal impact factor. JAMA, 295(1), 90–93.Habibzadeh, F., & Yadollahie, M. (2008). Journal weighted impact factor: A proposal. Journal of Informetrics, 2(2), 164–172.Klement, E., Mesiar, R., & Pap, E. (2010). A universal integral as common frame for Choquet and Sugeno integral. IEEE Transaction on Fuzzy System, 18, 178–187.Leydesdorff, L., & Opthof, T. (2010). Scopus’s source normalized impact per paper (SNIP) versus a journal impact factor based on fractional counting of citations. Journal of the American Society for Information Science and Technology, 61, 2365–2369.Li, Y. R., Radicchi, F., Castellano, C., & Ruiz-Castillo, J. (2013). Quantitative evaluation of alternative field normalization procedures. Journal of Informetrics, 7(3), 746–755.Moed, H. F. (2010). Measuring contextual citation impact of scientific journals. Journal of Informetrics, 4, 265–277.NISO. (2014). Alternative metrics initiative phase 1. White paper. http://www.niso.org/apps/group-public/download.php/13809/Altmetrics-project-phase1-white-paperOwlia, P., Vasei, M., Goliaei, B., & Nassiri, I. (2011). Normalized impact factor (NIF): An adjusted method for calculating the citation rate of biomedical journals. Journal of Biomedical Informatics, 44(2), 216–220.Pinski, G., & Narin, F. (1976). Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing and Management, 12, 297–312.Pinto, A. C., & Andrade, J. B. (1999). Impact factor of scientific journals: What is the meaning of this parameter? Quimica Nova, 22, 448–453.Raghunathan, M. S., & Srinivas, V. (2001). Significance of impact factor with regard to mathematics journals. Current Science, 80(5), 605.Ruiz Castillo, J., & Waltman, L. (2015). Field-normalized citation impact indicators using algorithmically constructed classification systems of science. Journal of Informetrics, 9, 102–117.Saha, S., Saint, S., & Christakis, D. A. (2003). Impact factor: A valid measure of journal quality? Journal of the Medical Library Association, 91, 42–46.Torra, V., & Narukawa, Y. (2008). The h-index and the number of citations: Two fuzzy integrals. IEEE Transaction on Fuzzy System, 16, 795–797.Torres-Salinas, D., & Jimenez-Contreras, E. (2010). Introduction and comparative study of the new scientific journals citation indicators in journal citation reports and scopus. El Profesional de la Información, 19, 201–207.Waltman, L., & van Eck, N. J. (2008). Some comments on the journal weighted impact factor proposed by Habibzadeh and Yadollahie. Journal of Informetrics, 2(4), 369–372.Waltman, L., van Eck, N. J., van Leeuwen, T. N., & Visser, M. S. (2013). Some modifications to the SNIP journal impact indicator. Journal of Informetrics, 7, 272–285.Zitt, M. (2011). Behind citing-side normalization of citations: some properties of the journal impact factor. Scientometrics, 89, 329–344.Zitt, M., & Small, H. (2008). Modifying the journal impact factor by fractional citation weighting: The audience factor. Journal of the American Society for Information Science and Technology, 59, 1856–1860.Zyczkowski, K. (2010). Citation graph, weighted impact factors and performance indices. Scientometrics, 85(1), 301–315
    corecore