99 research outputs found
Leveraging full-text article exploration for citation analysis
Scientific articles often include in-text citations quoting from external sources. When the cited source is an article, the citation context can be analyzed by exploring the article full-text. To quickly access the key information, researchers are often interested in identifying the sections of the cited article that are most pertinent to the text surrounding the citation in the citing article. This paper first performs a data-driven analysis of the correlation between the textual content of the sections of the cited article and the text snippet where the citation is placed. The results of the correlation analysis show that the title and abstract of the cited article are likely to include content highly similar to the citing snippet. However, the subsequent sections of the paper often include cited text snippets as well. Hence, there is a need to understand the extent to which an exploration of the full-text of the cited article would be beneficial to gain insights into the citing snippet, considering also the fact that the full-text access could be restricted. To this end, we then propose a classification approach to automatically predicting whether the cited snippets in the full-text of the paper contain a significant amount of new content beyond abstract and title. The proposed approach could support researchers in leveraging full-text article exploration for citation analysis. The experiments conducted on real scientific articles show promising results: the classifier has a 90% chance to correctly distinguish between the full-text exploration and only title and abstract cases
Predicting student academic performance by means of associative classification
The Learning Analytics community has recently paid particular attention to early predict learners’ performance. An established approach entails training classification models from past learner-related data in order to predict the exam success rate of a student well before the end of the course. Early predictions allow teachers to put in place targeted actions, e.g., supporting at-risk students to avoid exam failures or course dropouts. Although several machine learning and data mining solutions have been proposed to learn accurate predictors from past data, the interpretability and explainability of the best performing models is often limited. Therefore, in most cases, the reasons behind classifiers’ decisions remain unclear. This paper proposes an Explainable Learning Analytics solution to analyze learner-generated data acquired by our technical university, which relies on a blended learning model. It adopts classification techniques to early predict the success rate of about 5000 students who were enrolled in the first year courses of our university. It proposes to apply associative classifiers at different time points and to explore the characteristics of the models that led to assign pass or fail success rates. Thanks to their inherent interpretability, associative models can be manually explored by domain experts with the twofold aim at validating classifier outcomes through local rule-based explanations and identifying at-risk/successful student profiles by interpreting the global rule-based model. The results of an in-depth empirical evaluation demonstrate that associative models (i) perform as good as the best performing classification models, and (ii) give relevant insights into the per-student success rate assignments
Speech Analysis of Language Varieties in Italy
Italy exhibits rich linguistic diversity across its territory due to the distinct regional languages spoken in different areas. Recent advances in self-supervised learning provide new opportunities to analyze Italy's linguistic varieties using speech data alone. This includes the potential to leverage representations learned from large amounts of data to better examine nuances between closely related linguistic varieties. In this study, we focus on automatically identifying the geographic region of origin of speech samples drawn from Italy's diverse language varieties. We leverage self-supervised learning models to tackle this task and analyze differences and similarities between Italy's regional languages. In doing so, we also seek to uncover new insights into the relationships among these diverse yet closely related varieties, which may help linguists understand their interconnected evolution and regional development over time and space. To improve the discriminative ability of learned representations, we evaluate several supervised contrastive learning objectives, both as pre-training steps and additional fine-tuning objectives. Experimental evidence shows that pre-trained self-supervised models can effectively identify regions from speech recording. Additionally, incorporating contrastive objectives during fine-tuning improves classification accuracy and yields embeddings that distinctly separate regional varieties, demonstrating the value of combining self-supervised pre-training and contrastive learning for this task
Machine learning supported next-maintenance prediction for industrial vehicles
Industrial and construction vehicles require tight periodic maintenance operations. Their schedule depends on vehicle characteristics and usage. The latter can be accurately monitored through various on-board devices, enabling the application of Machine Learning techniques to analyze vehicle usage patterns and design predictive analytics. This paper presents a data-driven application to automatically schedule the periodic maintenance operations of industrial vehicles. It aims to predict, for each vehicle and date, the actual remaining days until the next maintenance is due. Our Machine Learning solution is designed to address the following challenges: (i) the non-stationarity of the per-vehicle utilization time series, which limits the effectiveness of classic scheduling policies, and (ii) the potential lack of historical data for those vehicles that have recently been added to the fleet, which hinders the learning of accurate predictors from past data. Preliminary results collected in a real industrial scenario demonstrate the effectiveness of the proposed solution on heterogeneous vehicles. The system we propose here is currently under deployment, enabling further tests and tunings
NEMICO: Mining network data through cloud-based data mining techniques
Thanks to the rapid advances in Internet-based applications, data acquisition and storage technologies, petabyte-sized network data collections are becoming more and more common, thus prompting the need for scalable data analysis solutions. By leveraging today’s ubiquitous many-core computer architectures and the increasingly popular cloud computing paradigm, the applicability of data mining algorithms to these large volumes of network data can be scaled up to gain interesting insights. This paper proposes NEMICO, a comprehensive Big Data mining system targeted to network traffic flow analyses (e.g., traffic flow characterization, anomaly detection, multiplelevel pattern mining). NEMICO comprises new approaches that contribute to a paradigm-shift in distributed data mining by addressing most challenging issues related to Big Data, such as data sparsity, horizontal scaling, and parallel computation
Heterogeneous industrial vehicle usage predictions: A real case
Predicting future vehicle usage based on the analysis of CAN bus data is a popular data mining application. Many of the usage indicators, like the utilization hours, are non-stationary time series. To predict their values, recent approaches based on Machine Learning combine multiple data features describing engine status, travels, and roads. While most of the proposed solutions address cars and trucks usage prediction, a smaller body of work has been devoted to industrial and construction vehicles, which are usually characterized by more complex and heterogeneous usage
patterns. This paper describes a real case study performed on a 4-year CAN bus dataset collecting usage data about 2 250 construction vehicles of various types and models. We apply a statistics-based approach to select the most discriminating data features. Separately for each vehicle, we train regression algorithms on historical data enriched with contextual information. The achieved results demonstrate the effectiveness of the proposed solution
- …