Search CORE

7,125 research outputs found

Prompting the data transformation activities for cluster analysis on collections of documents

Author: Cerquitelli T.
Chiusano S.
Di Corso E.
Ventura F.
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2017
Field of study

In this work we argue towards a new self-learning engine able to suggest to the analyst good transformation methods and weighting schemas for a given data collection. This new generation of systems, named SELF-DATA (SELF-learning DAta TrAnsformation) relies on an engine capable of exploring different data weighting schemas (e.g., normalized term frequencies, logarithmic entropy) and data transformation methods (e.g., PCA, LSI) before applying a given data mining algorithm (e.g., cluster analysis), evaluating and comparing solutions through different quality indices (e.g., weighted Silhouette), and presenting the 3-top solutions to the analyst. SELF-DATA will also include a knowledge database storing results of experiments on previously processed datasets, and a classification algorithm trained on the knowledge base content to forecast the best methods for future analyses. SELF-DATA’s current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. The preliminary validation performed on 4 collections of documents highlights that the TF-IDF and logarithmic entropy weighting methods are effective to measure item relevance with sparse datasets, and the LSI method outperforms PCA in the presence of a larger feature domain

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Useful ToPIC: Self-tuning strategies to enhance Latent Dirichlet Allocation

Author: Cerquitelli Tania
DI CORSO Evelina
Proto Stefano
Ventura Francesco
Publication venue: IEEE
Publication date
Field of study

ToPIC (Tuning of Parameters for Inference of Concepts) is a distributed self-tuning engine whose aim is to cluster collections of textual data into correlated groups of documents through a topic modeling methodology (i.e., LDA). ToPIC includes automatic strategies to relieve the end-user of the burden of selecting proper values for the overall analytics process. ToPIC's current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. As a case study, ToPIC has been validated on three real collections of textual documents characterized by different distributions. The experimental results show the effectiveness and efficiency of the proposed solution in analyzing collections of documents without tuning algorithm parameters and in discovering cohesive and well-separated groups of documents with a similar topic

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Simplifying Text Mining Activities: Scalable and Self-Tuning Methodology for Topic Detection and Characterization

Author: Bartolomeo Vacchetti
Evelina Di Corso
Paolo Bethaz
Stefano Proto
Tania Cerquitelli
Publication venue: 'MDPI AG'
Publication date: 01/01/2022
Field of study

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity Quantification

Author: Cao Lele
Catovic Armin
Granroth-Wilding Mark
McCornack Andrew
Rocha Dhiana Deva Cavacanti
Stahl Richard Anselmo
von Ehrenheim Vilhelm
Publication venue
Publication date: 25/09/2023
Field of study

In the investment industry, it is often essential to carry out fine-grained company similarity quantification for a range of purposes, including market mapping, competitor analysis, and mergers and acquisitions. We propose and publish a knowledge graph, named CompanyKG, to represent and learn diverse company features and relations. Specifically, 1.17 million companies are represented as nodes enriched with company description embeddings; and 15 different inter-company relations result in 51.06 million weighted edges. To enable a comprehensive assessment of methods for company similarity quantification, we have devised and compiled three evaluation tasks with annotated test sets: similarity prediction, competitor retrieval and similarity ranking. We present extensive benchmarking results for 11 reproducible predictive methods categorized into three groups: node-only, edge-only, and node+edge. To the best of our knowledge, CompanyKG is the first large-scale heterogeneous graph dataset originating from a real-world investment platform, tailored for quantifying inter-company similarity.Comment: Paper (13 pages, 5 figures and 2 tables) + Appendix (18 pages, 4 figures and 5 tables

arXiv.org e-Print Archive

Text miner's little helper: scalable self-tuning methodologies for knowledge exploration

Author: DI CORSO Evelina
Publication venue: Politecnico di Torino
Publication date
Field of study

L'abstract è presente nell'allegato / the abstract is in the attachmen

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Agents and artefacts in the emerging electric vehicle space

Author: Alboni Fabrizio
Bonifati Giovanni
Carreto Sanginés Jorge
Pavone Pasquale
Russo Margherita
Simonazzi Annamaria
Publication venue: 'Inderscience Publishers'
Publication date: 01/01/2022
Field of study

After COP 21, the targets for reducing CO2 emissions have boosted the commitment of governments and companies to developing alternative technologies for the mobility of people and goods. Electric vehicles are at the heart of this transformation, which is profoundly affecting the characteristics of agents and artefacts. The aim of the paper is to identify the relevant domains of this transformation, and to identify what characterises the space of the agents and artefacts of the electric vehicle and their interactions, as oriented by the public policies promoted by the various countries. The paper presents the results of a multidimensional textual analysis of the news published in English by electrive.com, a daily newsletter covering a wide range of relevant information on developments in electric transport in Europe and beyond. These results are a preliminary step for the analysis of the social, economic, organisational and technological changes related to sustainable mobility.After COP 21, the targets for reducing CO2 emissions have boosted the commitment of governments and companies to developing alternative technologies for the mobility of people and goods. Electric vehicles are at the heart of this transformation, which is profoundly affecting the characteristics of agents and artefacts. The aim of the paper is to identify the relevant domains of this transformation, and to identify what characterises the space of the agents and artefacts of the electric vehicle and their interactions, as oriented by the public policies promoted by the various countries. The paper presents the results of a multidimensional textual analysis of the news published in English by electrive.com, a daily newsletter covering a wide range of relevant information on developments in electric transport in Europe and beyond. These results are a preliminary step for the analysis of the social, economic, organisational and technological changes related to sustainable mobility

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

NMC Horizon Report: 2017 Library Edition

Author
Publication venue: University of Applied Sciences (HTW) Chur
Publication date: 06/06/2017
Field of study

What is on the five-year horizon for academic and research libraries? Which trends and technology developments will drive transformation? What are the critical challenges and how can we strategize solutions? These questions regarding technology adoption and educational change steered the discussions of 77 experts to produce the NMC Horizon Report: 2017 Library Edition, in partnership with the University of Applied Sciences (HTW) Chur, Technische Informationsbibliothek (TIB), ETH Library, and the Association of College & Research Libraries (ACRL). Six key trends, six significant challenges, and six developments in technology profiled in this report are poised to impact library strategies, operations, and services with regards to learning, creative inquiry, research, and information management. The three sections of this report constitute a reference and technology planning guide for librarians, library leaders, library staff, policymakers, and technologists

IssueLab

The NMC Horizon Report : 2015 Library Edition

Author: New Media Consortium
Publication venue
Publication date: 01/01/2015
Field of study

ÉDUQ

Practicing Integrity

Author
Publication venue
Publication date: 22/01/2021
Field of study

The IT University of Copenhagen's Repository