36 research outputs found
Text miner's little helper: scalable self-tuning methodologies for knowledge exploration
L'abstract è presente nell'allegato / the abstract is in the attachmen
Characterizing Thermal Energy Consumption through Exploratory Data Mining Algorithms
Nowadays large volumes of energy data are continuously collected through a variety of meters from
dierent smart-city environments. Such data have a great potential to influence the overall energy balance of our
communities by optimizing building energy consumption and by enhancing people's awareness of energy wasting. This paper presents FARTEC, a data mining engine based on exploratory and unsupervised data mining
algorithms to characterize building energy consumption together with meteorological conditions. FARTEC exploits a joint approach coupling cluster analysis and association rules. First, a partitional clustering algorithm is
applied to weather conditions to discover groups of thermal energy consumption that occurred in similar weather conditions. Each computed cluster is then locally characterized through a set of association rules to ease the
manual inspection of the most interesting correlations between thermal consumption and weather conditions. FARTEC also includes a categorization of the rules into a few groups according to their meaning. Each group
is determined by the data features appearing in the rule. The experimental evaluation performed on real datasets demonstrates the effectiveness of the proposed approach in discovering interesting knowledge items to raise
people's awareness of their energy consumption
Analyzing spatial data from twitter during a disaster
Social media can be an invaluable help in a mass emergency, but the information handling can be challenging. One major concern is identifying posts related to the area, or pinning them on a map. This exploratory study analyzes the spatial data coming with tweets during two natural disasters, an earthquake and a hurricane. Geo-tagged tweets confirm to be a small fraction of all tweets and disasters within a limited region appear to be a niche topic in the whole stream. The results can help researchers and practitioners in the design of tools to identify these messages
All in a twitter: Self-tuning strategies for a deeper understanding of a crisis tweet collection
Natural disasters have become more frequent during the past 20 years due to significant climate changes. These natural events are hotly debated on social networks like Twitter and a huge amount of short text messages are continuously and promptly exchanged with personal opinions, descriptions of the natural events and their corresponding consequences. The analysis of these large and complex data could help policy-makers to better understand the event as well as to set priorities. However, the correct configuration of the tweet mining process is still challenging due to variable data distribution and the availability of a large number of algorithms with different specific parameters. The analyst need to perform a large number of experiments to identify the best configuration for the overall knowledge discovery process. Innovative, scalable, and parameter-free solutions need to be explored to streamline the analytics process. This paper presents an enhanced version of PASTA (a distributed self-tuning engine) applied to a crisis tweet collection to group a corpus of tweets into cohesive and well-separated clusters with minimal analyst intervention. Experimental results performed on real data collected during natural disasters show the effectiveness of PASTA in discovering interesting groups of correlated tweets without selecting neither the algorithms nor their parameters
Towards automated visualisation of scientic literature
Nowadays, an exponential growth in biological data has been recorded, including both structured and unstructured data. One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured textual corpora to effectively support the decision making process. Since the emergence of topic modelling, new and interesting approaches to compactly represent the content of a document collection have been proposed. However, the effective exploitation of the proposed strategies requires a lot of expertise. This paper presents a new scalable and exploratory data visualisation engine, named ACE-HEALTH (AutomatiC Exploration of textual collections for HEALTH-care), whose target is to easily analyse medical document collections through the Latent Dirichlet Allocation. To streamline the analytics process and enhance the effectiveness of data and knowledge exploration, a variety of data visualisation techniques have been integrated in the engine to provide navigable informative dashboards without requiring any a-priori knowledge on the analytics techniques. Preliminary results obtained on a real PubMed collection show the effectiveness of ACE-HEALTH in correctly capturing the high-level overview of textual medical collections through innovative visualisation techniques
Useful ToPIC: Self-tuning strategies to enhance Latent Dirichlet Allocation
ToPIC (Tuning of Parameters for Inference of Concepts) is a distributed self-tuning engine whose aim is to cluster collections of textual data into correlated groups of documents through a topic modeling methodology (i.e., LDA). ToPIC includes automatic strategies to relieve the end-user of the burden of selecting proper values for the overall analytics process. ToPIC's current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. As a case study, ToPIC has been validated on three real collections of textual documents characterized by different distributions. The experimental results show the effectiveness and efficiency of the proposed solution in analyzing collections of documents without tuning algorithm parameters and in discovering cohesive and well-separated groups of documents with a similar topic
Exploring energy performance certificates through visualization
Energy Performance Certificates (EPCs) provide interesting information on the standard-based calculation of energy performance, thermo-physical and geometrical related properties of a building. Because of the volume of available data (issued as open data) and the heterogeneity of the attributes, the exploration of these energy-related data collections is challenging. This paper presents INDICE (INformative DynamiC dashboard Engine), a new data visualization framework able to automatically explore large collections of EPCs. INDICE explores EPCs through both querying and analytics tasks, and intuitively presents the output through informative dashboards. The latter include dynamic and interactive maps along with different informative charts allowing different stakeholders (e.g., domain and non-domain expert users) to explore and interpret the extracted knowledge at different spatial granularity levels. The objective of INDICE is to create energy maps useful for the characterization of the energy performance of buildings located in different areas. The experimental evaluation, performed on a real set of EPCs related to a major Italian region in the North West of Italy, demonstrates the effectiveness of INDICE in exploring an EPC dataset through different data and knowledge visualization techniques