225 research outputs found
Evaluating Soccer Match Prediction Models: A Deep Learning Approach and Feature Optimization for Gradient-Boosted Trees
Machine learning models have become increasingly popular for predicting the
results of soccer matches, however, the lack of publicly-available benchmark
datasets has made model evaluation challenging. The 2023 Soccer Prediction
Challenge required the prediction of match results first in terms of the exact
goals scored by each team, and second, in terms of the probabilities for a win,
draw, and loss. The original training set of matches and features, which was
provided for the competition, was augmented with additional matches that were
played between 4 April and 13 April 2023, representing the period after which
the training set ended, but prior to the first matches that were to be
predicted (upon which the performance was evaluated). A CatBoost model was
employed using pi-ratings as the features, which were initially identified as
the optimal choice for calculating the win/draw/loss probabilities. Notably,
deep learning models have frequently been disregarded in this particular task.
Therefore, in this study, we aimed to assess the performance of a deep learning
model and determine the optimal feature set for a gradient-boosted tree model.
The model was trained using the most recent five years of data, and three
training and validation sets were used in a hyperparameter grid search. The
results from the validation sets show that our model had strong performance and
stability compared to previously published models from the 2017 Soccer
Prediction Challenge for win/draw/loss prediction
Boosting gets full Attention for Relational Learning
More often than not in benchmark supervised ML, tabular data is flat, i.e.
consists of a single (rows, columns) file, but cases abound in the
real world where observations are described by a set of tables with structural
relationships. Neural nets-based deep models are a classical fit to incorporate
general topological dependence among description features (pixels, words,
etc.), but their suboptimality to tree-based models on tabular data is still
well documented. In this paper, we introduce an attention mechanism for
structured data that blends well with tree-based models in the training context
of (gradient) boosting. Each aggregated model is a tree whose training involves
two steps: first, simple tabular models are learned descending tables in a
top-down fashion with boosting's class residuals on tables' features. Second,
what has been learned progresses back bottom-up via attention and aggregation
mechanisms, progressively crafting new features that complete at the end the
set of observation features over which a single tree is learned, boosting's
iteration clock is incremented and new class residuals are computed.
Experiments on simulated and real-world domains display the competitiveness of
our method against a state of the art containing both tree-based and neural
nets-based models
Machine Learning Applications to Predict Road Crash and Soccer Game Outcomes
Machine learning has become a cutting-edge and widely studied data science field of study in recent years across many industries and disciplines. In this thesis, two problems (1- crash severity prediction, 2- soccer game outcome prediction.) were investigated by using a set of machine learning approaches, namely: Ridge regression, Lasso Regression, Support Vector Machine (SVM), Neural Network (NN), Random Forest (RF).
The first study is focused on investigating the critical factors affecting crash severity on a comprehensive time-series state-wide traffic crash data. The dataset covers crashes occurred in the state of Connecticut between 1995 and 2014. Traffic crashes are an increasing cause of death and injury in the world. The overall purposes of the first study were to propose, develop, and implement machine learning approaches in predicting the severity levels of human beings involved in the crashes and investigating the important crash predictors contributing to the injury severity. The predictor variables included road and vehicle conditions, characteristics of drivers and passengers, and environmental conditions. Results indicate that RF provided the best prediction accuracy of 73.85% in correctly classifying a crash based on its severity: fatal, injury, or property damage only. In addition to the overall comparison of proposed machine learning approaches in terms of accuracy, the prediction results were combined with the economic loss of each severity level to provide managerial insights on estimating the financial consequences of traffic crashes. RF provided the importance of each predictor in affecting the severity levels of involved human beings. The ejection status of the driver or passenger was found to be as the most crucial factor leading to the most severe injuries. Besides, a time series analysis of the 20-years crash data was conducted. The analysis results demonstrated that the prediction accuracy of RF increased with period, and the importance of some predictors also changed. From the perspective of policy making, strict inspection on drunk driving and drug use could lead to substantial road safety improvement. Ejection status is the essential risk factors that affect fatal and incapacitating severity level. The use of seat belts significantly reduces the risk of passengers being ejected out of the vehicle when the crash occurred.
In the second study, recent five-season game data of three major leagues were scraped from whoscore.com. The Leagues were two top European leagues, Spanish La Liga, English Premier League (EPL), and one US League, Major League Soccer (MLS). The purpose of the study was to develop a statistically credible machine learning approaches to predict a soccer game outcome and investigate the significance of predictors (game statistics). Different from previous closely-related studies, the proposed machine learning models were not only applied to the combined dataset of the three leagues but also were studied separately on each league to compare the prediction performance and important predictors. The best prediction performance was achieved by NN with an accuracy of 85.71% (+/- 0.73%) of the combined dataset. For each league, RF had the best performance. RF also provided the importance of each predictor. The results presented that the home-field advantage was more evident in the MLS games than in the other two Europe leagues. The home team or away team factor was the most critical predictor that affected the MLS games. Although it was also an important predictor for La Liga and EPL games, the most influential predictor was the difference in the number of shots on target between the home team and away team. For the three leagues, the number of crosses was the most significant pass type, and the difference in the rate of card per foul was the most crucial card situation. The referee primarily determines the difference in the rate of card per foul. For the Europe leagues, the difference in the number of counter attacks and open plays were consequential attempt types affecting a game result in La Liga and EPL, while in the MLS, the difference in the number of set-piece was the most crucial predictor variable.
Overall, the results of the two studies indicated that the proposed machine learning approaches yielded effective prediction performance for crash severity and soccer outcomes’ prediction. RF had slightly superior prediction performance among the five machine learning models for both studies. Even though the two problem domains were from different industries or policy making area, the proposed machine learning approaches effectively dealt with the complexity of the data in terms of dimensionality and time-series nature
Using social media big data for tourist demand forecasting: A new machine learning analytical approach
This study explores the possibility of using a machine learning approach to analysing social media big data for tourism demand forecasting. We demonstrate how to extract the main topics discussed on Twitter and calculate the mean sentiment score for each topic as the proxy of the general attitudes towards those topics, which are then used for predicting tourist arrivals. We choose Sydney, Australia as the case for testing the performance and validity of our proposed forecasting framework. The study reveals key topics discussed in social media that can be used to predict tourist arrivals in Sydney. The study has both theoretical implications for tourist behavioural research and practical implications for destination marketing
When Moneyball Meets the Beautiful Game: A Predictive Analytics Approach to Exploring Key Drivers for Soccer Player Valuation
To measure the market value of a professional soccer (i.e., association football) player is of great interest to soccer clubs. Several gaps emerge from the existing soccer transfer market research. Economics literature only tests the underlying hypotheses between a player’s market value or wage and a few economic factors. Finance literature provides very theoretical pricing frameworks. Sports science literature uncovers numerous pertinent attributes and skills but gives limited insights into valuation practice. The overarching research question of this work is: what are the key drivers of player valuation in the soccer transfer market? To lay the theoretical foundations of player valuation, this work synthesizes the literature in market efficiency and equilibrium conditions, pricing theories and risk premium, and sports science. Predictive analytics is the primary methodology in conjunction with open-source data and exploratory analysis. Several machine learning algorithms are evaluated based on the trade-offs between predictive accuracy and model interpretability. XGBoost, the best model for player valuation, yields the lowest RMSE and the highest adjusted R2. SHAP values identify the most important features in the best model both at a collective level and at an individual level. This work shows a handful of fundamental economic and risk factors have more substantial effect on player valuation than a large number of sports science factors. Within sports science factors, general physiological and psychological attributes appear to be more important than soccer-specific skills. Theoretically, this work proposes a conceptual framework for soccer player valuation that unifies sports business research and sports science research. Empirically, the predictive analytics methodology deepens our understanding of the value drivers of soccer players. Practically, this work enhances transparency and interpretability in the valuation process and could be extended into a player recommender framework for talent scouting. In summary, this work has demonstrated that the application of analytics can improve decision-making efficiency in player acquisition and profitability of soccer clubs
Machine learning methods in sport injury prediction and prevention: a systematic review
Purpose: Injuries are common in sports and can have signifcant physical, psychological and fnancial consequences.
Machine learning (ML) methods could be used to improve injury prediction and allow proper approaches to injury
prevention. The aim of our study was therefore to perform a systematic review of ML methods in sport injury prediction and prevention.
Methods: A search of the PubMed database was performed on March 24th 2020. Eligible articles included original
studies investigating the role of ML for sport injury prediction and prevention. Two independent reviewers screened
articles, assessed eligibility, risk of bias and extracted data. Methodological quality and risk of bias were determined by
the Newcastle–Ottawa Scale. Study quality was evaluated using the GRADE working group methodology.
Results: Eleven out of 249 studies met inclusion/exclusion criteria. Diferent ML methods were used (tree-based
ensemble methods (n=9), Support Vector Machines (n=4), Artifcial Neural Networks (n=2)). The classifcation
methods were facilitated by preprocessing steps (n=5) and optimized using over- and undersampling methods
(n=6), hyperparameter tuning (n=4), feature selection (n=3) and dimensionality reduction (n=1). Injury predictive
performance ranged from poor (Accuracy=52%, AUC=0.52) to strong (AUC=0.87, f1-score=85%).
Conclusions: Current ML methods can be used to identify athletes at high injury risk and be helpful to detect the
most important injury risk factors. Methodological quality of the analyses was sufcient in general, but could be further improved. More efort should be put in the interpretation of the ML models
Exploring Large Language Models for Human Mobility Prediction under Public Events
Public events, such as concerts and sports games, can be major attractors for
large crowds, leading to irregular surges in travel demand. Accurate human
mobility prediction for public events is thus crucial for event planning as
well as traffic or crowd management. While rich textual descriptions about
public events are commonly available from online sources, it is challenging to
encode such information in statistical or machine learning models. Existing
methods are generally limited in incorporating textual information, handling
data sparsity, or providing rationales for their predictions. To address these
challenges, we introduce a framework for human mobility prediction under public
events (LLM-MPE) based on Large Language Models (LLMs), leveraging their
unprecedented ability to process textual data, learn from minimal examples, and
generate human-readable explanations. Specifically, LLM-MPE first transforms
raw, unstructured event descriptions from online sources into a standardized
format, and then segments historical mobility data into regular and
event-related components. A prompting strategy is designed to direct LLMs in
making and rationalizing demand predictions considering historical mobility and
event features. A case study is conducted for Barclays Center in New York City,
based on publicly available event information and taxi trip data. Results show
that LLM-MPE surpasses traditional models, particularly on event days, with
textual data significantly enhancing its accuracy. Furthermore, LLM-MPE offers
interpretable insights into its predictions. Despite the great potential of
LLMs, we also identify key challenges including misinformation and high costs
that remain barriers to their broader adoption in large-scale human mobility
analysis
- …