    OddAssist - An eSports betting recommendation system

    It is globally accepted that sports betting has been around for as long as the sport itself. Back in the 1st century, circuses hosted chariot races and fans would bet on who they thought would emerge victorious. With the evolution of technology, sports evolved and, mainly, the bookmakers evolved. Due to the mass digitization, these houses are now available online, from anywhere, which makes this market inherently more tempting. In fact, this transition has propelled the sports betting industry into a multi-billion-dollar industry that can rival the sports industry. Similarly, younger generations are increasingly attached to the digital world, including electronic sports – eSports. In fact, young men are more likely to follow eSports than traditional sports. Counter-Strike: Global Offensive, the videogame on which this dissertation focuses, is one of the pillars of this industry and during 2022, 15 million dollars were distributed in tournament prizes and there was a peak of 2 million concurrent viewers. This factor, combined with the digitization of bookmakers, make the eSports betting market extremely appealing for exploring machine learning techniques, since young people who follow this type of sports also find it easy to bet online. In this dissertation, a betting recommendation system is proposed, implemented, tested, and validated, which considers the match history of each team, the odds of several bookmakers and the general feeling of fans in a discussion forum. The individual machine learning models achieved great results by themselves. More specifically, the match history model managed an accuracy of 66.66% with an expected calibration error of 2.10% and the bookmaker odds model, with an accuracy of 65.05% and a calibration error of 2.53%. Combining the models through stacking increased the accuracy to 67.62% but worsened the expected calibration error to 5.19%. On the other hand, merging the datasets and training a new, stronger model on that data improved the accuracy to 66.81% and had an expected calibration error of 2.67%. The solution is thoroughly tested in a betting simulation encapsulating 2500 matches. The system’s final odd is compared with the odds of the bookmakers and the expected long-term return is computed. A bet is made depending on whether it is above a certain threshold. This strategy called positive expected value betting was used at multiple thresholds and the results were compared. While the stacking solution did not perform in a betting environment, the match history model prevailed with profits form 8% to 90%; the odds model had profits ranging from 13% to 211%; and the dataset merging solution profited from 11% to 77%, all depending on the minimum expected value thresholds. Therefore, from this work resulted several machine learning approaches capable of profiting from Counter Strike: Global Offensive bets long-term.É globalmente aceite que as apostas desportivas existem há tanto tempo quanto o próprio desporto. Mesmo no primeiro século, os circos hospedavam corridas de carruagens e os fãs apostavam em quem achavam que sairia vitorioso, semelhante às corridas de cavalo de agora. Com a evolução da tecnologia, os desportos foram evoluindo e, principalmente, evoluíram as casas de apostas. Devido à onda de digitalização em massa, estas casas passaram a estar disponíveis online, a partir de qualquer sítio, o que torna este mercado inerentemente mais tentador. De facto, esta transição propulsionou a indústria das apostas desportivas para uma indústria multibilionária que agora pode mesmo ser comparada à indústria dos desportos. De forma semelhante, gerações mais novas estão cada vez mais ligadas ao digital, incluindo desportos digitais – eSports. Counter-Strike: Global Offensive, o videojogo sobre o qual esta dissertação incide, é um dos grandes impulsionadores desta indústria e durante 2022, 15 milhões de dólares foram distribuídos em prémios de torneios e houve um pico de espectadores concorrentes de 2 milhões. Embora esta realidade não seja tão pronunciada em Portugal, em vários países, jovens adultos do sexo masculino, têm mais probabilidade de acompanharem eSports que desportos tradicionais. Este fator, aliado à digitalização das casas de apostas, tornam o mercado de apostas em eSports muito apelativo para a exploração técnicas de aprendizagem automática, uma vez que os jovens que acompanham este tipo de desportos têm facilidade em apostar online. Nesta dissertação é proposto, implementado, testado e validado um sistema de recomendação de apostas que considera o histórico de resultados de cada equipa, as cotas de várias casas de apostas e o sentimento geral dos fãs num fórum de discussão – HLTV. Deste modo, foram inicialmente desenvolvidos 3 sistemas de aprendizagem automática. Para avaliar os sistemas criados, foi considerado o período de outubro de 2020 até março de 2023, o que corresponde a 2500 partidas. Porém, sendo o período de testes tão extenso, existe muita variação na competitividade das equipas. Deste modo, para evitar que os modelos ficassem obsoletos durante este período de teste, estes foram re-treinados no mínimo uma vez por mês durante a duração do período de testes. O primeiro sistema de aprendizagem automática incide sobre a previsão a partir de resultados anteriores, ou seja, o histórico de jogos entre as equipas. A melhor solução foi incorporar os jogadores na previsão, juntamente com o ranking da equipa e dando mais peso aos jogos mais recentes. Esta abordagem, utilizando regressão logística teve uma taxa de acerto de 66.66% com um erro expectável de calibração de 2.10%. O segundo sistema compila as cotas das várias casas de apostas e faz previsões com base em padrões das suas variações. Neste caso, incorporar as casas de aposta tendo atingido uma taxa de acerto de 65.88% utilizando regressão logística, porém, era um modelo pior calibrado que o modelo que utilizava a média das cotas utilizando gradient boosting machine, que exibiu uma taxa de acerto de 65.06%, mas melhores métricas de calibração, com um erro expectável de 2.53%. O terceiro sistema, baseia-se no sentimento dos fãs no fórum HLTV. Primeiramente, é utilizado o GPT 3.5 para extrair o sentimento de cada comentário, com uma taxa geral de acerto de 84.28%. No entanto, considerando apenas os comentários classificados como conclusivos, a taxa de acerto é de 91.46%. Depois de classificados, os comentários são depois passados a um modelo support vector machine que incorpora o comentador e a sua taxa de acerto nas partidas anteriores. Esta solução apenas previu corretamente 59.26% dos casos com um erro esperado de calibração de 3.22%. De modo a agregar as previsões destes 3 modelos, foram testadas duas abordagens. Primeiramente, foi testado treinar um novo modelo a partir das previsões dos restantes (stacking), obtendo uma taxa de acerto de 67.62%, mas com um erro de calibração esperado de 5.19%. Na segunda abordagem, por outro lado, são agregados os dados utilizados no treino dos 3 modelos individuais, e é treinado um novo modelo com base nesse conjunto de dados mais complexo. Esta abordagem, recorrendo a support vector machine, obteve uma taxa de acerto mais baixa, 66.81% mas um erro esperado de calibração mais baixo, 2.67%. Por fim, as abordagens são postas à prova através de um simulador de apostas, onde sistema cada faz uma previsão e a compara com a cota oferecia pelas casas de apostas. A simulação é feita para vários patamares de retorno mínimo esperado, onde os sistemas apenas apostam caso a taxa esperada de retorno da cota seja superior à do patamar. Esta cota final é depois comparada com as cotas das casas de apostas e, caso exista uma casa com uma cota superior, uma aposta é feita. Esta estratégia denomina-se de apostas de valor esperado positivo, ou seja, apostas cuja cota é demasiado elevada face à probabilidade de se concretizar e que geram lucros a longo termo. Nesta simulação, os melhores resultados, para uma taxa de mínima de 5% foram os modelos criados a partir das cotas das casas de apostas, com lucros entre os 13% e os 211%; o dos dados históricos que lucrou entre 8% e 90%; e por fim, o modelo composto, com lucros entre os 11% e os 77%. Assim, deste trabalho resultaram diversos sistemas baseados em machine learning capazes de obter lucro a longo-termo a apostar em Counter Strike: Global Offensive

    Temporal models for mining, ranking and recommendation in the Web

    Due to their first-hand, diverse and evolution-aware reflection of nearly all areas of life, heterogeneous temporal datasets i.e., the Web, collaborative knowledge bases and social networks have been emerged as gold-mines for content analytics of many sorts. In those collections, time plays an essential role in many crucial information retrieval and data mining tasks, such as from user intent understanding, document ranking to advanced recommendations. There are two semantically closed and important constituents when modeling along the time dimension, i.e., entity and event. Time is crucially served as the context for changes driven by happenings and phenomena (events) that related to people, organizations or places (so-called entities) in our social lives. Thus, determining what users expect, or in other words, resolving the uncertainty confounded by temporal changes is a compelling task to support consistent user satisfaction. In this thesis, we address the aforementioned issues and propose temporal models that capture the temporal dynamics of such entities and events to serve for the end tasks. Specifically, we make the following contributions in this thesis: (1) Query recommendation and document ranking in the Web - we address the issues for suggesting entity-centric queries and ranking effectiveness surrounding the happening time period of an associated event. In particular, we propose a multi-criteria optimization framework that facilitates the combination of multiple temporal models to smooth out the abrupt changes when transitioning between event phases for the former and a probabilistic approach for search result diversification of temporally ambiguous queries for the latter. (2) Entity relatedness in Wikipedia - we study the long-term dynamics of Wikipedia as a global memory place for high-impact events, specifically the reviving memories of past events. Additionally, we propose a neural network-based approach to measure the temporal relatedness of entities and events. The model engages different latent representations of an entity (i.e., from time, link-based graph and content) and use the collective attention from user navigation as the supervision. (3) Graph-based ranking and temporal anchor-text mining inWeb Archives - we tackle the problem of discovering important documents along the time-span ofWeb Archives, leveraging the link graph. Specifically, we combine the problems of relevance, temporal authority, diversity and time in a unified framework. The model accounts for the incomplete link structure and natural time lagging in Web Archives in mining the temporal authority. (4) Methods for enhancing predictive models at early-stage in social media and clinical domain - we investigate several methods to control model instability and enrich contexts of predictive models at the “cold-start” period. We demonstrate their effectiveness for the rumor detection and blood glucose prediction cases respectively. Overall, the findings presented in this thesis demonstrate the importance of tracking these temporal dynamics surround salient events and entities for IR applications. We show that determining such changes in time-based patterns and trends in prevalent temporal collections can better satisfy user expectations, and boost ranking and recommendation effectiveness over time

    Essays on Latent Variable Models and Roll Call Scaling

    This dissertation comprises three essays on latent variable models and Bayesian statistical methods for the study of American legislative institutions and the more general problems of measurement and model comparison. In the first paper, I explore the dimensionality of latent variables in the context of roll call scaling. The dimensionality of ideal points is an aspect of roll call scaling which has received significant attention due to its impact on both substantive and spatial interpretations of estimates. I find that previous evidence for unidimensional ideal points is a product of the Scree procedure. I propose a new varying dimensions model of legislative voting and a corresponding Bayesian nonparametric estimation procedure (BPIRT) that allows for probabilistic inference on the number of dimensions. Using this approach, I show that there is strong evidence for multidimensional ideal points in the U.S. Congress and that using only a single dimension misses much of the disagreement that occurs within parties. I reexamine theories of U.S. legislative voting and find that empirical evidence for these models is conditional on unidimensionality. In the second paper, I expand on the varying dimensions model of legislative voting and explore the role of group dependencies in legislative voting. Assumptions about independence of observations in the scaling model ignore the possibility that members of the voting body have shared incentives to vote as a group and lead to problems in estimating ideal points and corresponding latent dimensions. I propose a new ideal point model, clustered beta process IRT (C-BPIRT), that explicitly allows for group contributions in the underlying spatial model of voting. I derive a corresponding empirical model that uses flexible Bayesian nonparametric priors to estimate group effects in ideal points and the corresponding dimensionality of the ideal points. I apply this model to the 107th U.S. House (2001 - 2003) and the 88th U.S. House (1963 - 1965) and show how modeling group dynamics improves the estimation and interpretation of ideal points. Similarly, I show that existing methods of ideal point estimation produce results that are substantively misaligned with historical studies of the U.S. Congress. In the third and final paper, I dive into the more general problem of Bayesian model comparison and marginal likelihood computation. Various methods of computing the marginal likelihood exist, such as importance sampling or variational methods, but they frequently provide inaccurate results. I demonstrate that point estimates for the marginal likelihood achieved using importance sampling are inaccurate in settings where the joint posterior is skewed. I propose a light extension to the variational method that treats the marginal likelihood as a random variable and create a set of intervals on the marginal likelihood which do not share the same inaccuracies. I show that these new intervals, called kappa bounds, provide a computationally efficient and accurate way to estimate the marginal likelihood under arbitrarily complex Bayesian model specifications. I show the superiority of kappa bounds estimates of the marginal likelihood through a series of simulated and real-world data examples, including comparing measurement models that estimate latent variables from ordered discrete survey data.PHDPolitical ScienceUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163023/1/kamcal_1.pd

    Flavour-universal search for heavy neutral leptons with a deep neural network-based displaced jet tagger with the CMS experiment

    This thesis describes a search for long-lived heavy neutral leptons using a dataset of 137/fb collected during the 2016-2018 proton-proton runs with the CMS detector. The search uses a final state containing two leptons and at least one hadronic jet. This is the first analysis at the Large Hadron Collider which considers universal mixing between the Standard Model and heavy neutral lepton species. The search makes heavy use of a deep neural network-based displaced jet tagging algorithm, originally developed to target heavy long-lived gluino decays. The tagger was trained on both simulation and proton-proton collision data using the domain adaptation technique, which significantly improved the modelling of its output in simulation. The tagger has excellent performance for a range of long-lived particle lifetimes and generalises well to various flavours of displaced jets. In this analysis, the backgrounds are estimated in an entirely data-driven manner. No evidence for heavy neutral leptons is observed, and upper limits are set for a wide range of heavy neutral lepton mass, lifetime, and mixing scenarios. This is the most sensitive search for heavy neutral leptons in the 1–12 GeV mass range to date.Open Acces

    Discovery in Physics

    Volume 2 covers knowledge discovery in particle and astroparticle physics. Instruments gather petabytes of data and machine learning is used to process the vast amounts of data and to detect relevant examples efficiently. The physical knowledge is encoded in simulations used to train the machine learning models. The interpretation of the learned models serves to expand the physical knowledge resulting in a cycle of theory enhancement

    Task-based parser output combination : workflow and infrastructure

    This dissertation introduces the method of task-based parser output combination as a device to enhance the reliability of automatically generated syntactic information for further processing tasks. Parsers, i.e. tools generating syntactic analyses, are usually based on reference data. Typically these are modern news texts. However, the data relevant for applications or tasks beyond parsing often differs from this standard domain, or only specific phenomena from the syntactic analysis are actually relevant for further processing. In these cases, the reliability of the parsing output might deviate essentially from the expected outcome on standard news text. Studies for several levels of analysis in natural language processing have shown that combining systems from the same analysis level outperforms the best involved single system. This is due to different error distributions of the involved systems which can be exploited, e.g. in a majority voting approach. In other words: for an effective combination, the involved systems have to be sufficiently different. In these combination studies, usually the complete analyses are combined and evaluated. However, to be able to combine the analyses completely, a full mapping of their structures and tagsets has to be found. The need for a full mapping either restricts the degree to which the participating systems are allowed to differ or it results in information loss. Moreover, the evaluation of the combined complete analyses does not reflect the reliability achieved in the analysis of the specific aspects needed to resolve a given task. This work presents an abstract workflow which can be instantiated based on the respective task and the available parsers. The approach focusses on the task-relevant aspects and aims at increasing the reliability of their analysis. Moreover, this focus allows a combination of more diverging systems, since no full mapping of the structures and tagsets from the single systems is needed. The usability of this method is also increased by focussing on the output of the parsers: It is not necessary for the users to reengineer the tools. Instead, off-the-shelf parsers and parsers for which no configuration options or sources are available to the users can be included. Based on this, the method is applicable to a broad range of applications. For instance, it can be applied to tasks from the growing field of Digital Humanities, where the focus is often on tasks different from syntactic analysis

    Automatic Identification of Online Predators in Chat Logs by Anomaly Detection and Deep Learning

    Providing a safe environment for juveniles and children in online social networks is considered as a major factor in improving public safety. Due to the prevalence of the online conversations, mitigating the undesirable effects of juvenile abuse in cyberspace has become inevitable. Using automatic ways to address this kind of crime is challenging and demands efficient and scalable data mining techniques. The problem can be casted as a combination of textual preprocessing in data/text mining and binary classification in machine learning. This thesis proposes two machine learning approaches to deal with the following two issues in the domain of online predator identification: 1) The first problem is gathering a comprehensive set of negative training samples which is unrealistic due to the nature of the problem. This problem is addressed by applying an existing method for semi-supervised anomaly detection that allows the training process based on only one class label. The method was tested on two datasets; 2) The second issue is improving the performance of current binary classification methods in terms of classification accuracy and F1-score. In this regard, we have customized a deep learning approach called Convolutional Neural Network to be used in this domain. Using this approach, we show that the classification performance (F1-score) is improved by almost 1.7% compared to the classification method (Support Vector Machine). Two different datasets were used in the empirical experiments: PAN-2012 and SQ (Sûreté du Québec). The former is a large public dataset that has been used extensively in the literature and the latter is a small dataset collected from the Sûreté du Québec

    Towards A Computational Intelligence Framework in Steel Product Quality and Cost Control

    Steel is a fundamental raw material for all industries. It can be widely used in vari-ous fields, including construction, bridges, ships, containers, medical devices and cars. However, the production process of iron and steel is very perplexing, which consists of four processes: ironmaking, steelmaking, continuous casting and rolling. It is also extremely complicated to control the quality of steel during the full manufacturing pro-cess. Therefore, the quality control of steel is considered as a huge challenge for the whole steel industry. This thesis studies the quality control, taking the case of Nanjing Iron and Steel Group, and then provides new approaches for quality analysis, manage-ment and control of the industry. At present, Nanjing Iron and Steel Group has established a quality management and control system, which oversees many systems involved in the steel manufacturing. It poses a high statistical requirement for business professionals, resulting in a limited use of the system. A lot of data of quality has been collected in each system. At present, all systems mainly pay attention to the processing and analysis of the data after the manufacturing process, and the quality problems of the products are mainly tested by sampling-experimental method. This method cannot detect product quality or predict in advance the hidden quality issues in a timely manner. In the quality control system, the responsibilities and functions of different information systems involved are intricate. Each information system is merely responsible for storing the data of its corresponding functions. Hence, the data in each information system is relatively isolated, forming a data island. The iron and steel production process belongs to the process industry. The data in multiple information systems can be combined to analyze and predict the quality of products in depth and provide an early warning alert. Therefore, it is necessary to introduce new product quality control methods in the steel industry. With the waves of industry 4.0 and intelligent manufacturing, intelligent technology has also been in-troduced in the field of quality control to improve the competitiveness of the iron and steel enterprises in the industry. Applying intelligent technology can generate accurate quality analysis and optimal prediction results based on the data distributed in the fac-tory and determine the online adjustment of the production process. This not only gives rise to the product quality control, but is also beneficial to in the reduction of product costs. Inspired from this, this paper provide in-depth discussion in three chapters: (1) For scrap steel to be used as raw material, how to use artificial intelligence algorithms to evaluate its quality grade is studied in chapter 3; (2) the probability that the longi-tudinal crack occurs on the surface of continuous casting slab is studied in chapter 4;(3) The prediction of mechanical properties of finished steel plate in chapter 5. All these 3 chapters will serve as the technical support of quality control in iron and steel production