107 research outputs found

    Reinforcement Learning with Human Feedback for Realistic Traffic Simulation

    Full text link
    In light of the challenges and costs of real-world testing, autonomous vehicle developers often rely on testing in simulation for the creation of reliable systems. A key element of effective simulation is the incorporation of realistic traffic models that align with human knowledge, an aspect that has proven challenging due to the need to balance realism and diversity. This works aims to address this by developing a framework that employs reinforcement learning with human preference (RLHF) to enhance the realism of existing traffic models. This study also identifies two main challenges: capturing the nuances of human preferences on realism and the unification of diverse traffic simulation models. To tackle these issues, we propose using human feedback for alignment and employ RLHF due to its sample efficiency. We also introduce the first dataset for realism alignment in traffic modeling to support such research. Our framework, named TrafficRLHF, demonstrates its proficiency in generating realistic traffic scenarios that are well-aligned with human preferences, as corroborated by comprehensive evaluations on the nuScenes dataset.Comment: 9 pages, 4 figure

    Anytime Discovery of a Diverse Set of Patterns with Monte Carlo Tree Search

    Get PDF
    International audienceThe discovery of patterns that accurately discriminate one class label from another remains a challenging data mining task. Subgroup discovery (SD) is one of the frameworks that enables to elicit such interesting patterns from labeled data. A question remains fairly open: How to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is infeasible? Existing approaches make use of beam-search, sampling, and genetic algorithms for discovering a pattern set that is non-redundant and of high quality w.r.t. a pattern quality measure. We argue that such approaches produce pattern sets that lack of diversity: Only few patterns of high quality, and different enough, are discovered. Our main contribution is then to formally define pattern mining as a game and to solve it with Monte Carlo tree search (MCTS). It can be seen as an exhaustive search guided by random simulations which can be stopped early (limited budget) by virtue of its best-first search property. We show through a comprehensive set of experiments how MCTS enables the anytime discovery of a diverse pattern set of high quality. It out-performs other approaches when dealing with a large pattern search space and for different quality measures. Thanks to its genericity, our MCTS approach can be used for SD but also for many other pattern mining tasks

    Cross-Lingual Knowledge Transfer for Clinical Phenotyping

    Full text link
    Clinical phenotyping enables the automatic extraction of clinical conditions from patient records, which can be beneficial to doctors and clinics worldwide. However, current state-of-the-art models are mostly applicable to clinical notes written in English. We therefore investigate cross-lingual knowledge transfer strategies to execute this task for clinics that do not use the English language and have a small amount of in-domain data available. We evaluate these strategies for a Greek and a Spanish clinic leveraging clinical notes from different clinical domains such as cardiology, oncology and the ICU. Our results reveal two strategies that outperform the state-of-the-art: Translation-based methods in combination with domain-specific encoders and cross-lingual encoders plus adapters. We find that these strategies perform especially well for classifying rare phenotypes and we advise on which method to prefer in which situation. Our results show that using multilingual data overall improves clinical phenotyping models and can compensate for data sparseness.Comment: LREC 2022 submmision: January 202

    Explainable methods for knowledge graph refinement and exploration via symbolic reasoning

    Get PDF
    Knowledge Graphs (KGs) have applications in many domains such as Finance, Manufacturing, and Healthcare. While recent efforts have created large KGs, their content is far from complete and sometimes includes invalid statements. Therefore, it is crucial to refine the constructed KGs to enhance their coverage and accuracy via KG completion and KG validation. It is also vital to provide human-comprehensible explanations for such refinements, so that humans have trust in the KG quality. Enabling KG exploration, by search and browsing, is also essential for users to understand the KG value and limitations towards down-stream applications. However, the large size of KGs makes KG exploration very challenging. While the type taxonomy of KGs is a useful asset along these lines, it remains insufficient for deep exploration. In this dissertation we tackle the aforementioned challenges of KG refinement and KG exploration by combining logical reasoning over the KG with other techniques such as KG embedding models and text mining. Through such combination, we introduce methods that provide human-understandable output. Concretely, we introduce methods to tackle KG incompleteness by learning exception-aware rules over the existing KG. Learned rules are then used in inferring missing links in the KG accurately. Furthermore, we propose a framework for constructing human-comprehensible explanations for candidate facts from both KG and text. Extracted explanations are used to insure the validity of KG facts. Finally, to facilitate KG exploration, we introduce a method that combines KG embeddings with rule mining to compute informative entity clusters with explanations.Wissensgraphen haben viele Anwendungen in verschiedenen Bereichen, beispielsweise im Finanz- und Gesundheitswesen. Wissensgraphen sind jedoch unvollständig und enthalten auch ungültige Daten. Hohe Abdeckung und Korrektheit erfordern neue Methoden zur Wissensgraph-Erweiterung und Wissensgraph-Validierung. Beide Aufgaben zusammen werden als Wissensgraph-Verfeinerung bezeichnet. Ein wichtiger Aspekt dabei ist die Erklärbarkeit und Verständlichkeit von Wissensgraphinhalten für Nutzer. In Anwendungen ist darüber hinaus die nutzerseitige Exploration von Wissensgraphen von besonderer Bedeutung. Suchen und Navigieren im Graph hilft dem Anwender, die Wissensinhalte und ihre Limitationen besser zu verstehen. Aufgrund der riesigen Menge an vorhandenen Entitäten und Fakten ist die Wissensgraphen-Exploration eine Herausforderung. Taxonomische Typsystem helfen dabei, sind jedoch für tiefergehende Exploration nicht ausreichend. Diese Dissertation adressiert die Herausforderungen der Wissensgraph-Verfeinerung und der Wissensgraph-Exploration durch algorithmische Inferenz über dem Wissensgraph. Sie erweitert logisches Schlussfolgern und kombiniert es mit anderen Methoden, insbesondere mit neuronalen Wissensgraph-Einbettungen und mit Text-Mining. Diese neuen Methoden liefern Ausgaben mit Erklärungen für Nutzer. Die Dissertation umfasst folgende Beiträge: Insbesondere leistet die Dissertation folgende Beiträge: • Zur Wissensgraph-Erweiterung präsentieren wir ExRuL, eine Methode zur Revision von Horn-Regeln durch Hinzufügen von Ausnahmebedingungen zum Rumpf der Regeln. Die erweiterten Regeln können neue Fakten inferieren und somit Lücken im Wissensgraphen schließen. Experimente mit großen Wissensgraphen zeigen, dass diese Methode Fehler in abgeleiteten Fakten erheblich reduziert und nutzerfreundliche Erklärungen liefert. • Mit RuLES stellen wir eine Methode zum Lernen von Regeln vor, die auf probabilistischen Repräsentationen für fehlende Fakten basiert. Das Verfahren erweitert iterativ die aus einem Wissensgraphen induzierten Regeln, indem es neuronale Wissensgraph-Einbettungen mit Informationen aus Textkorpora kombiniert. Bei der Regelgenerierung werden neue Metriken für die Regelqualität verwendet. Experimente zeigen, dass RuLES die Qualität der gelernten Regeln und ihrer Vorhersagen erheblich verbessert. • Zur Unterstützung der Wissensgraph-Validierung wird ExFaKT vorgestellt, ein Framework zur Konstruktion von Erklärungen für Faktkandidaten. Die Methode transformiert Kandidaten mit Hilfe von Regeln in eine Menge von Aussagen, die leichter zu finden und zu validieren oder widerlegen sind. Die Ausgabe von ExFaKT ist eine Menge semantischer Evidenzen für Faktkandidaten, die aus Textkorpora und dem Wissensgraph extrahiert werden. Experimente zeigen, dass die Transformationen die Ausbeute und Qualität der entdeckten Erklärungen deutlich verbessert. Die generierten unterstützen Erklärungen unterstütze sowohl die manuelle Wissensgraph- Validierung durch Kuratoren als auch die automatische Validierung. • Zur Unterstützung der Wissensgraph-Exploration wird ExCut vorgestellt, eine Methode zur Erzeugung von informativen Entitäts-Clustern mit Erklärungen unter Verwendung von Wissensgraph-Einbettungen und automatisch induzierten Regeln. Eine Cluster-Erklärung besteht aus einer Kombination von Relationen zwischen den Entitäten, die den Cluster identifizieren. ExCut verbessert gleichzeitig die Cluster- Qualität und die Cluster-Erklärbarkeit durch iteratives Verschränken des Lernens von Einbettungen und Regeln. Experimente zeigen, dass ExCut Cluster von hoher Qualität berechnet und dass die Cluster-Erklärungen für Nutzer informativ sind

    Menetelmiä mielenkiintoisten solmujen löytämiseen verkostoista

    Get PDF
    With the increasing amount of graph-structured data available, finding interesting objects, i.e., nodes in graphs, becomes more and more important. In this thesis we focus on finding interesting nodes and sets of nodes in graphs or networks. We propose several definitions of node interestingness as well as different methods to find such nodes. Specifically, we propose to consider nodes as interesting based on their relevance and non-redundancy or representativeness w.r.t. the graph topology, as well as based on their characterisation for a class, such as a given node attribute value. Identifying nodes that are relevant, but non-redundant to each other is motivated by the need to get an overview of different pieces of information related to a set of given nodes. Finding representative nodes is of interest, e.g. when the user needs or wants to select a few nodes that abstract the large set of nodes. Discovering nodes characteristic for a class helps to understand the causes behind that class. Next, four methods are proposed to find a representative set of interesting nodes. The first one incrementally picks one interesting node after another. The second iteratively changes the set of nodes to improve its overall interestingness. The third method clusters nodes and picks a medoid node as a representative for each cluster. Finally, the fourth method contrasts diverse sets of nodes in order to select nodes characteristic for their class, even if the classes are not identical across the selected nodes. The first three methods are relatively simple and are based on the graph topology and a similarity or distance function for nodes. For the second and third, the user needs to specify one parameter, either an initial set of k nodes or k, the size of the set. The fourth method assumes attributes and class attributes for each node, a class-related interesting measure, and possible sets of nodes which the user wants to contrast, such as sets of nodes that represent different time points. All four methods are flexible and generic. They can, in principle, be applied on any weighted graph or network regardless of what nodes, edges, weights, or attributes represent. Application areas for the methods developed in this thesis include word co-occurrence networks, biological networks, social networks, data traffic networks, and the World Wide Web. As an illustrating example, consider a word co-occurrence network. There, finding terms (nodes in the graph) that are relevant to some given nodes, e.g. branch and root, may help to identify different, shared contexts such as botanics, mathematics, and linguistics. A real life application lies in biology where finding nodes (biological entities, e.g. biological processes or pathways) that are relevant to other, given nodes (e.g. some genes or proteins) may help in identifying biological mechanisms that are possibly shared by both the genes and proteins.Väitöskirja käsittelee verkostojen louhinnan menetelmiä. Sen tavoitteena on löytää mielenkiintoisia tietoja painotetuista verkoista. Painotettuna verkkona voi tarkastella esim. tekstiainestoja, biologisia ainestoja, ihmisten välisiä yhteyksiä tai internettiä. Tällaisissa verkoissa solmut edustavat käsitteitä (esim. sanoja, geenejä, ihmisiä tai internetsivuja) ja kaaret niiden välisiä suhteita (esim. kaksi sanaa esiintyy samassa lauseessa, geeni koodaa proteiinia, ihmisten ystävyyksiä tai internetsivu viittaa toiseen internetsivuun). Kaarten painot voivat vastata esimerkiksi yhteyden voimakuutta tai luotettavuutta. Väitöskirjassa esitetään erilaisia verkon rakenteeseen tai solmujen attribuutteihin perustuvia määritelmiä solmujen mielenkiintoisuudelle sekä useita menetelmiä mielenkiintoisten solmujen löytämiseksi. Mielenkiintoisuuden voi määritellä esim. merkityksellisyytenä suhteessa joihinkin annettuihin solmuihin ja toisaalta mielenkiintoisten solmujen keskinäisenä erilaisuutena. Esimerkiksi ns. ahneella menetelmällä voidaan löytää keskenään erilaisia solmuja yksi kerrallaan. Väitöskirjan tuloksia voidaan soveltaa esimerkiksi tekstiaineistoa käsittelemällä saatuun sanojen väliseen verkostoon, jossa kahden sanan välillä on sitä voimakkaampi yhteys mitä useammin ne tapaavat esiintyä keskenään samoissa lauseissa. Sanojen erilaisia käyttöyhteyksiä ja jopa merkityksiä voidaan nyt löytää automaattisesti. Jos kohdesanaksi otetaan vaikkapa "juuri", niin siihen liittyviä mutta keskenään toisiinsa liittymättömiä sanoja ovat "puu" (biologinen merkitys: kasvin juuri), "yhtälö" (matemaattinen merkitys: yhtälön ratkaisu eli juuri) sekä "indoeurooppalainen" (kielitieteellinen merkitys: sanan vartalo eli juuri). Tällaisia menetelmiä voidaan soveltaa esimerkiksi hakukoneessa: sanalla "juuri" tehtyihin hakutuloksiin sisällytetään tuloksia mahdollisimman erilaisista käyttöyhteyksistä, jotta käyttäjän tarkoittama merkitys tulisi todennäköisemmin katetuksi hakutuloksissa. Merkittävä sovelluskohde väitöskirjan menetelmille ovat biologiset verkot, joissa solmut edustavat biologisia käsitteitä (esim. geenejä, proteiineja tai sairauksia) ja kaaret niiden välisiä suhteita (esim. geeni koodaa proteiinia tai proteiini on aktiivinen tietyssä sairauksessa). Menetelmillä voidaan etsiä esimerkiksi sairauksiin vaikuttavia biologisia mekanismeja paikantamalla edustava joukko sairauteen ja siihen mahdollisesti liittyviin geeneihin verkostossa kytkeytyviä muita solmuja. Nämä voivat auttaa biologeja ymmärtämään geenien ja sairauden mahdollisia kytköksiä ja siten kohdentamaan jatkotutkimustaan lupaavimpiin geeneihin, proteiineihin tms. Väitöskirjassa esitetyt solmujen mielenkiintoisuuden määritelmät sekä niiden löytämiseen ehdotetut menetelmät ovat yleispäteviä ja niitä voi soveltaa periaatteessa mihin tahansa verkkoon riippumatta siitä, mitä solmut, kaaret tai painot edustavat. Kokeet erilaisilla verkoilla osoittavat että ne löytävät mielenkiintoisia solmuja

    Evaluation of Feature Selection Techniques for Breast Cancer Risk Prediction

    Get PDF
    This study evaluates several feature ranking techniques together with some classifiers based on machine learning to identify relevant factors regarding the probability of contracting breast cancer and improve the performance of risk prediction models for breast cancer in a healthy population. The dataset with 919 cases and 946 controls comes from the MCC-Spain study and includes only environmental and genetic features. Breast cancer is a major public health problem. Our aim is to analyze which factors in the cancer risk prediction model are the most important for breast cancer prediction. Likewise, quantifying the stability of feature selection methods becomes essential before trying to gain insight into the data. This paper assesses several feature selection algorithms in terms of performance for a set of predictive models. Furthermore, their robustness is quantified to analyze both the similarity between the feature selection rankings and their own stability. The ranking provided by the SVM-RFE approach leads to the best performance in terms of the area under the ROC curve (AUC) metric. Top-47 ranked features obtained with this approach fed to the Logistic Regression classifier achieve an AUC = 0.616. This means an improvement of 5.8% in comparison with the full feature set. Furthermore, the SVM-RFE ranking technique turned out to be highly stable (as well as Random Forest), whereas relief and the wrapper approaches are quite unstable. This study demonstrates that the stability and performance of the model should be studied together as Random Forest and SVM-RFE turned out to be the most stable algorithms, but in terms of model performance SVM-RFE outperforms Random Forest.The study was partially funded by the “Accion Transversal del Cancer”, approved on the Spanish Ministry Council on the 11th October 2007, by the Instituto de Salud Carlos III-FEDER (PI08/1770, PI08/0533, PI08/1359, PS09/00773, PS09/01286, PS09/01903, PS09/02078, PS09/01662, PI11/01403, PI11/01889, PI11/00226, PI11/01810, PI11/02213, PI12/00488, PI12/00265, PI12/01270, PI12/00715, PI12/00150), by the Fundación Marqués de Valdecilla (API 10/09), by the ICGC International Cancer Genome Consortium CLL, by the Junta de Castilla y León (LE22A10-2), by the Consejería de Salud of the Junta de Andalucía (PI-0571), by the Conselleria de Sanitat of the Generalitat Valenciana (AP 061/10), by the Recercaixa (2010ACUP 00310), by the Regional Government of the Basque Country by European Commission grants FOOD-CT- 2006-036224- HIWATE, by the Spanish Association Against Cancer (AECC) Scientific Foundation, by the The Catalan Government DURSI grant 2009SGR1489. Samples: Biological samples were stored at the Parc de Salut MAR Biobank (MARBiobanc; Barcelona) which is supported by Instituto de Salud Carlos III FEDER (RD09/0076/00036). Furthermore, at the Public Health Laboratory from Gipuzkoa and the Basque Biobank. Furthermore, sample collection was supported by the Xarxa de Bancs de Tumors de Catalunya sponsored by Pla Director d’Oncologia de Catalunya (XBTC). Biological samples were stored at the “Biobanco La Fe” which is supported by Instituto de Salud Carlos III (RD 09 0076/00021) and FISABIO biobanking, which is supported by Instituto de Salud Carlos III (RD09 0076/00058).S

    How can SMEs benefit from big data? Challenges and a path forward

    Get PDF
    Big data is big news, and large companies in all sectors are making significant advances in their customer relations, product selection and development and consequent profitability through using this valuable commodity. Small and medium enterprises (SMEs) have proved themselves to be slow adopters of the new technology of big data analytics and are in danger of being left behind. In Europe, SMEs are a vital part of the economy, and the challenges they encounter need to be addressed as a matter of urgency. This paper identifies barriers to SME uptake of big data analytics and recognises their complex challenge to all stakeholders, including national and international policy makers, IT, business management and data science communities. The paper proposes a big data maturity model for SMEs as a first step towards an SME roadmap to data analytics. It considers the ‘state-of-the-art’ of IT with respect to usability and usefulness for SMEs and discusses how SMEs can overcome the barriers preventing them from adopting existing solutions. The paper then considers management perspectives and the role of maturity models in enhancing and structuring the adoption of data analytics in an organisation. The history of total quality management is reviewed to inform the core aspects of implanting a new paradigm. The paper concludes with recommendations to help SMEs develop their big data capability and enable them to continue as the engines of European industrial and business success. Copyright © 2016 John Wiley & Sons, Ltd.Peer ReviewedPostprint (author's final draft
    corecore