2,864 research outputs found
Long-tail Relation Extraction via Knowledge Graph Embeddings and Graph Convolution Networks
We propose a distance supervised relation extraction approach for
long-tailed, imbalanced data which is prevalent in real-world settings. Here,
the challenge is to learn accurate "few-shot" models for classes existing at
the tail of the class distribution, for which little data is available.
Inspired by the rich semantic correlations between classes at the long tail and
those at the head, we take advantage of the knowledge from data-rich classes at
the head of the distribution to boost the performance of the data-poor classes
at the tail. First, we propose to leverage implicit relational knowledge among
class labels from knowledge graph embeddings and learn explicit relational
knowledge using graph convolution networks. Second, we integrate that
relational knowledge into relation extraction model by coarse-to-fine
knowledge-aware attention mechanism. We demonstrate our results for a
large-scale benchmark dataset which show that our approach significantly
outperforms other baselines, especially for long-tail relations.Comment: To be published in NAACL 201
Large-scale diversity estimation through surname origin inference
The study of surnames as both linguistic and geographical markers of the past
has proven valuable in several research fields spanning from biology and
genetics to demography and social mobility. This article builds upon the
existing literature to conceive and develop a surname origin classifier based
on a data-driven typology. This enables us to explore a methodology to describe
large-scale estimates of the relative diversity of social groups, especially
when such data is scarcely available. We subsequently analyze the
representativeness of surname origins for 15 socio-professional groups in
France
OddAssist - An eSports betting recommendation system
It is globally accepted that sports betting has been around for as long as the sport itself. Back in
the 1st century, circuses hosted chariot races and fans would bet on who they thought would
emerge victorious. With the evolution of technology, sports evolved and, mainly, the
bookmakers evolved. Due to the mass digitization, these houses are now available online, from
anywhere, which makes this market inherently more tempting. In fact, this transition has
propelled the sports betting industry into a multi-billion-dollar industry that can rival the sports
industry.
Similarly, younger generations are increasingly attached to the digital world, including
electronic sports – eSports. In fact, young men are more likely to follow eSports than traditional
sports. Counter-Strike: Global Offensive, the videogame on which this dissertation focuses, is
one of the pillars of this industry and during 2022, 15 million dollars were distributed in
tournament prizes and there was a peak of 2 million concurrent viewers. This factor, combined
with the digitization of bookmakers, make the eSports betting market extremely appealing for
exploring machine learning techniques, since young people who follow this type of sports also
find it easy to bet online.
In this dissertation, a betting recommendation system is proposed, implemented, tested, and
validated, which considers the match history of each team, the odds of several bookmakers and
the general feeling of fans in a discussion forum.
The individual machine learning models achieved great results by themselves. More specifically,
the match history model managed an accuracy of 66.66% with an expected calibration error of
2.10% and the bookmaker odds model, with an accuracy of 65.05% and a calibration error of
2.53%.
Combining the models through stacking increased the accuracy to 67.62% but worsened the
expected calibration error to 5.19%. On the other hand, merging the datasets and training a
new, stronger model on that data improved the accuracy to 66.81% and had an expected
calibration error of 2.67%.
The solution is thoroughly tested in a betting simulation encapsulating 2500 matches. The
system’s final odd is compared with the odds of the bookmakers and the expected long-term
return is computed. A bet is made depending on whether it is above a certain threshold. This
strategy called positive expected value betting was used at multiple thresholds and the results
were compared.
While the stacking solution did not perform in a betting environment, the match history model
prevailed with profits form 8% to 90%; the odds model had profits ranging from 13% to 211%;
and the dataset merging solution profited from 11% to 77%, all depending on the minimum
expected value thresholds.
Therefore, from this work resulted several machine learning approaches capable of profiting
from Counter Strike: Global Offensive bets long-term.É globalmente aceite que as apostas desportivas existem há tanto tempo quanto o próprio
desporto. Mesmo no primeiro século, os circos hospedavam corridas de carruagens e os fãs
apostavam em quem achavam que sairia vitorioso, semelhante às corridas de cavalo de agora.
Com a evolução da tecnologia, os desportos foram evoluindo e, principalmente, evoluíram as
casas de apostas. Devido à onda de digitalização em massa, estas casas passaram a estar
disponíveis online, a partir de qualquer sítio, o que torna este mercado inerentemente mais
tentador. De facto, esta transição propulsionou a indústria das apostas desportivas para uma
indústria multibilionária que agora pode mesmo ser comparada à indústria dos desportos.
De forma semelhante, gerações mais novas estão cada vez mais ligadas ao digital, incluindo
desportos digitais – eSports. Counter-Strike: Global Offensive, o videojogo sobre o qual esta
dissertação incide, é um dos grandes impulsionadores desta indústria e durante 2022, 15
milhões de dólares foram distribuídos em prémios de torneios e houve um pico de espectadores
concorrentes de 2 milhões. Embora esta realidade não seja tão pronunciada em Portugal, em
vários países, jovens adultos do sexo masculino, têm mais probabilidade de acompanharem
eSports que desportos tradicionais. Este fator, aliado à digitalização das casas de apostas,
tornam o mercado de apostas em eSports muito apelativo para a exploração técnicas de
aprendizagem automática, uma vez que os jovens que acompanham este tipo de desportos têm
facilidade em apostar online.
Nesta dissertação é proposto, implementado, testado e validado um sistema de recomendação
de apostas que considera o histórico de resultados de cada equipa, as cotas de várias casas de
apostas e o sentimento geral dos fãs num fórum de discussão – HLTV. Deste modo, foram
inicialmente desenvolvidos 3 sistemas de aprendizagem automática.
Para avaliar os sistemas criados, foi considerado o período de outubro de 2020 até março de
2023, o que corresponde a 2500 partidas. Porém, sendo o período de testes tão extenso, existe
muita variação na competitividade das equipas. Deste modo, para evitar que os modelos
ficassem obsoletos durante este período de teste, estes foram re-treinados no mínimo uma vez
por mês durante a duração do período de testes.
O primeiro sistema de aprendizagem automática incide sobre a previsão a partir de resultados
anteriores, ou seja, o histórico de jogos entre as equipas. A melhor solução foi incorporar os
jogadores na previsão, juntamente com o ranking da equipa e dando mais peso aos jogos mais
recentes. Esta abordagem, utilizando regressão logística teve uma taxa de acerto de 66.66%
com um erro expectável de calibração de 2.10%.
O segundo sistema compila as cotas das várias casas de apostas e faz previsões com base em
padrões das suas variações. Neste caso, incorporar as casas de aposta tendo atingido uma taxa
de acerto de 65.88% utilizando regressão logística, porém, era um modelo pior calibrado que o
modelo que utilizava a média das cotas utilizando gradient boosting machine, que exibiu uma
taxa de acerto de 65.06%, mas melhores métricas de calibração, com um erro expectável de
2.53%.
O terceiro sistema, baseia-se no sentimento dos fãs no fórum HLTV. Primeiramente, é utilizado
o GPT 3.5 para extrair o sentimento de cada comentário, com uma taxa geral de acerto de
84.28%. No entanto, considerando apenas os comentários classificados como conclusivos, a taxa de acerto é de 91.46%. Depois de classificados, os comentários são depois passados a um
modelo support vector machine que incorpora o comentador e a sua taxa de acerto nas partidas
anteriores. Esta solução apenas previu corretamente 59.26% dos casos com um erro esperado
de calibração de 3.22%.
De modo a agregar as previsões destes 3 modelos, foram testadas duas abordagens.
Primeiramente, foi testado treinar um novo modelo a partir das previsões dos restantes
(stacking), obtendo uma taxa de acerto de 67.62%, mas com um erro de calibração esperado
de 5.19%. Na segunda abordagem, por outro lado, são agregados os dados utilizados no treino
dos 3 modelos individuais, e é treinado um novo modelo com base nesse conjunto de dados
mais complexo. Esta abordagem, recorrendo a support vector machine, obteve uma taxa de
acerto mais baixa, 66.81% mas um erro esperado de calibração mais baixo, 2.67%.
Por fim, as abordagens são postas à prova através de um simulador de apostas, onde sistema
cada faz uma previsão e a compara com a cota oferecia pelas casas de apostas. A simulação é
feita para vários patamares de retorno mínimo esperado, onde os sistemas apenas apostam
caso a taxa esperada de retorno da cota seja superior à do patamar.
Esta cota final é depois comparada com as cotas das casas de apostas e, caso exista uma casa
com uma cota superior, uma aposta é feita. Esta estratégia denomina-se de apostas de valor
esperado positivo, ou seja, apostas cuja cota é demasiado elevada face à probabilidade de se
concretizar e que geram lucros a longo termo. Nesta simulação, os melhores resultados, para
uma taxa de mínima de 5% foram os modelos criados a partir das cotas das casas de apostas,
com lucros entre os 13% e os 211%; o dos dados históricos que lucrou entre 8% e 90%; e por
fim, o modelo composto, com lucros entre os 11% e os 77%.
Assim, deste trabalho resultaram diversos sistemas baseados em machine learning capazes de
obter lucro a longo-termo a apostar em Counter Strike: Global Offensive
LOL: An Investigation into Cybernetic Humor, or: Can Machines Laugh?
The mechanisms of humour have been the subject of much study and investigation, starting with and up to our days. Much of this work is based on literary theories, put forward by some of the most eminent philosophers and thinkers of all times, or medical theories, investigating the impact of humor on brain activity or behaviour. Recent functional neuroimaging studies, for instance, have investigated the process of comprehending and appreciating humor by examining functional activity in distinctive regions of brains stimulated by joke corpora. Yet, there is precious little work on the computational side, possibly due to the less hilarious nature of computer scientists as compared to men of letters and sawbones. In this paper, we set to investigate whether literary theories of humour can stand the test of algorithmic laughter. Or, in other words, we ask ourselves the vexed question: Can machines laugh?
We attempt to answer that question by testing whether an algorithm - namely, a neural network - can "understand" humour, and in particular whether it is possible to automatically identify abstractions that are predicted to be relevant by established literary theories about the mechanisms of humor. Notice that we do not focus here on distinguishing humorous from serious statements - a feat that is clearly way beyond the capabilities of the average human voter, not to mention the average machine - but rather on identifying the underlying mechanisms and triggers that are postulated to exist by literary theories, by verifying if similar mechanisms can be learned by machines
Weakly Supervised Reasoning by Neuro-Symbolic Approaches
Deep learning has largely improved the performance of various natural
language processing (NLP) tasks. However, most deep learning models are
black-box machinery, and lack explicit interpretation. In this chapter, we will
introduce our recent progress on neuro-symbolic approaches to NLP, which
combines different schools of AI, namely, symbolism and connectionism.
Generally, we will design a neural system with symbolic latent structures for
an NLP task, and apply reinforcement learning or its relaxation to perform
weakly supervised reasoning in the downstream task. Our framework has been
successfully applied to various tasks, including table query reasoning,
syntactic structure reasoning, information extraction reasoning, and rule
reasoning. For each application, we will introduce the background, our
approach, and experimental results.Comment: Compendium of Neurosymbolic Artificial Intelligence, 665--692, 2023,
IOS Pres
Detection of Hate Speech in Videos Using Machine Learning
With the progression of the internet and social media, people are given multiple platforms to share their thoughts and opinions about various subject matters freely. However, this freedom of speech is misused to direct hate towards individuals or group of people due to their race, religion, gender etc. The rise of hate speech has led to conflicts and cases of cyber bullying, causing many organizations to look for optimal solutions to solve this problem.
Developments in the field of machine learning and deep learning have piqued the interest of researchers, leading them to research and implement solutions to solve the problem of hate speech. Currently, machine learning techniques are applied to textual data to detect hate speech. With the ample use of video sharing sites, there is a need to find a way to detect hate speech in videos.
This project deals with classification of videos into normal or hateful categories based on the spoken content of the videos. The video dataset is built using a crawler to search and download videos based on offensive words that are specified as keywords. The audio is extracted from the videos and is converted into textual format using a speech-to-text converter to obtain a transcript of the videos.
Experiments are conducted by training four models with three different feature sets extracted from the dataset. The models are evaluated by computing the specified evaluation metrics. The evaluated metrics indicate that random forest classifier model delivers the best results in classifying videos
- …