15 research outputs found

    Effect of heuristics on serendipity in path-based storytelling with linked data

    Get PDF
    Path-based storytelling with Linked Data on the Web provides users the ability to discover concepts in an entertaining and educational way. Given a query context, many state-of-the-art pathfinding approaches aim at telling a story that coincides with the user's expectations by investigating paths over Linked Data on the Web. By taking into account serendipity in storytelling, we aim at improving and tailoring existing approaches towards better fitting user expectations so that users are able to discover interesting knowledge without feeling unsure or even lost in the story facts. To this end, we propose to optimize the link estimation between - and the selection of facts in a story by increasing the consistency and relevancy of links between facts through additional domain delineation and refinement steps. In order to address multiple aspects of serendipity, we propose and investigate combinations of weights and heuristics in paths forming the essential building blocks for each story. Our experimental findings with stories based on DBpedia indicate the improvements when applying the optimized algorithm

    Mining Novellas from PubMed Abstracts using a Storytelling Algorithm

    Get PDF
    Motivation: There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and entire processes. Each article investigates particular subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must computationally integrate information across multiple publications. This is especially important in problems such as modeling cross-talk in signaling networks, designing drug therapies for combinatorial selectivity, and unraveling the role of gene interactions in deleterious phenotypes, where the cost of performing combinatorial screens is exorbitant. Results: We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for unraveling combinatorial relationships. It involves the systematic application of a `storytelling' algorithm followed by compression of the stories into `novellas.' Given a start and end publication, typically with little or no overlap in content, storytelling identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. Stories discovered thus provide an argued approach to relate distant concepts through compositions of related concepts. The chains of links employed by stories are then mined to find frequently reused sub-stories, which can be compressed to yield novellas, or compact templates of connections. We demonstrate a successful application of storytelling and novella finding to modeling combinatorial relationships between introduction of extracellular factors and downstream cellular events. Availability: A story visualizer, suitable for interactive exploration of stories and novellas described in this paper, is available for demo/download at https://bioinformatics.cs.vt.edu/storytelling

    Big Data Techniques in Science Education and What Story Google Trends Tells Us About Science?

    Get PDF
    The intention of this work is to provide a quick overview of what Big Data is and present a few examples of techniques through which it can contribute to Science Education. Google offers the Google Trends (GT) free analysis tool that allows users to sort through several years of Google search queries from around the world to get a graphical plotting showing the popularity of chosen search terms both over region and time. According to the time, region, and frequency of search, three kinds of data are evaluated in terms of compatibility with a sort of “correlation analysis”. A few techniques of extracting meaning from them are exemplified through geographical searches for ‘Solar Eclipse’ in USA and through temporal searches of the term ‘research’ in the period 2013-2017. In addition, and as the main study, an experiment was conducted to replicate with Big Data and GT TaƟdere, Özsevgeç, and Turkmen’s survey on the Nature of Science (NoS). To that end, the same nine concepts they selected were searched in GT. Two-way correlation analysis was performed on these words, and those pairs with a Pearson Correlation of 0.8 and higher were used to build a conceptual network. Three main levels emerge in our hierarchical conceptual network and, as a result of this structuring, a storytelling can be built: Science is seen, in a more publicly understandable level, as associated to ‘laws’, followed by a less-visible level of research being associated to ‘building theories’, and then, in a even lesser understanding level, the scientists doing experiments to test hypotheses, which are confirmed or not by observation – an image of scientists’ work shaped in a large degree by popular media

    Compositional Mining of Multi-Relational Biological Datasets

    Get PDF
    High-throughput biological screens are yielding ever-growing streams of information about multiple aspects of cellular activity. As more and more categories of datasets come online, there is a corresponding multitude of ways in which inferences can be chained across them, motivating the need for compositional data mining algorithms. In this paper, we argue that such compositional data mining can be effectively realized by functionally cascading redescription mining and biclustering algorithms as primitives. Both these primitives mirror shifts of vocabulary that can be composed in arbitrary ways to create rich chains of inferences. Given a relational database and its schema, we show how the schema can be automatically compiled into a compositional data mining program, and how different domains in the schema can be related through logical sequences of biclustering and redescription invocations. This feature allows us to rapidly prototype new data mining applications, yielding greater understanding of scientific datasets. We describe two applications of compositional data mining: (i) matching terms across categories of the Gene Ontology and (ii) understanding the molecular mechanisms underlying stress response in human cells

    Découverte de définitions dans le web des données

    Get PDF
    In this thesis, we are interested in the web of data and knowledge units that can be possibly discovered inside. The web of data can be considered as a very large graph consisting of connected RDF triple databases. An RDF triple, denoted as (subject, predicate, object), represents a relation (i.e. the predicate) existing between two resources (i.e. the subject and the object). Resources can belong to one or more classes, where a class aggregates resources sharing common characteristics. Thus, these RDF triple databases can be seen as interconnected knowledge bases. Most of the time, these knowledge bases are collaboratively built thanks to human users. This is particularly the case of DBpedia, a central knowledge base within the web of data, which encodes Wikipedia content in RDF format. DBpedia is built from two types of Wikipedia data: on the one hand, (semi-)structured data such as infoboxes, and, on the other hand, categories, which are thematic clusters of manually generated pages. However, the semantics of categories in DBpedia, that is, the reason a human agent has bundled resources, is rarely made explicit. In fact, considering a class, a software agent has access to the resources that are regrouped together, i.e. the class extension, but it generally does not have access to the ``reasons'' underlying such a cluster, i.e. it does not have the class intension. Considering a category as a class of resources, we aim at discovering an intensional description of the category. More precisely, given a class extension, we are searching for the related intension. The pair (extension, intension) which is produced provides the final definition and the implementation of classification-based reasoning for software agents. This can be expressed in terms of necessary and sufficient conditions: if x belongs to the class C, then x has the property P (necessary condition), and if x has the property P, then it belongs to the class C (sufficient condition). Two complementary data mining methods allow us to materialize the discovery of definitions, the search for association rules and the search for redescriptions. In this thesis, we first present a state of the art about association rules and redescriptions. Next, we propose an adaptation of each data mining method for the task of definition discovery. Then we detail a set of experiments applied to DBpedia, and we qualitatively and quantitatively compare the two approaches. Finally, we discuss how discovered definitions can be added to DBpedia to improve its quality in terms of consistency and completeness.Dans cette thĂšse, nous nous intĂ©ressons au web des donnĂ©es et aux ``connaissances'' que potentiellement il renferme. Le web des donnĂ©es se prĂ©sente comme un trĂšs grand graphe constituĂ© de bases de triplets RDF connectĂ©es entre elles. Un triplet RDF, dĂ©notĂ© (sujet, prĂ©dicat, objet), reprĂ©sente une relation (le prĂ©dicat) qui existe entre deux ressources (le sujet et l'objet). Les ressources peuvent appartenir Ă  une ou plusieurs classes, oĂč une classe regroupe des ressources partageant des caractĂ©ristiques communes. Ainsi, ces bases de triplets RDF peuvent ĂȘtre vues comme des bases de connaissances interconnectĂ©es. La plupart du temps ces bases de connaissances sont construites de maniĂšre collaborative par des utilisateurs. C'est notamment le cas de DBpedia, une base de connaissances centrale dans le web des donnĂ©es, qui encode le contenu de WikipĂ©dia au format RDF. DBpedia est construite Ă  partir de deux types de donnĂ©es de WikipĂ©dia : d'une part, des donnĂ©es (semi-)structurĂ©es telles que les infoboxes et d'autre part les catĂ©gories, qui sont des regroupements thĂ©matiques de pages gĂ©nĂ©rĂ©s manuellement. Cependant, la sĂ©mantique des catĂ©gories dans DBpedia, c'est-Ă -dire la raison pour laquelle un agent humain a regroupĂ© des ressources, n'est pas explicite. De fait, en considĂ©rant une classe, un agent logiciel a accĂšs aux ressources qui y sont regroupĂ©es --- il dispose de la dĂ©finition dite en extension --- mais il n'a gĂ©nĂ©ralement pas accĂšs aux ``motifs'' de ce regroupement --- il ne dispose pas de la dĂ©finition dite en intension. Dans cette thĂšse, nous cherchons Ă  associer une dĂ©finition Ă  une catĂ©gorie en l'assimilant Ă  une classe de ressources. Plus prĂ©cisĂ©ment, nous cherchons Ă  associer une intension Ă  une classe donnĂ©e en extension. La paire (extension, intension) produite va fournir la dĂ©finition recherchĂ©e et va autoriser la mise en Ɠuvre d'un raisonnement par classification pour un agent logiciel. Cela peut s'exprimer en termes de conditions nĂ©cessaires et suffisantes : si x appartient Ă  la classe C, alors x a la propriĂ©tĂ© P (condition nĂ©cessaire), et si x a la propriĂ©tĂ© P, alors il appartient Ă  la classe C (condition suffisante). Deux mĂ©thodes de fouille de donnĂ©es complĂ©mentaires nous permettent de matĂ©rialiser la dĂ©couverte de dĂ©finitions, la fouille de rĂšgles d'association et la fouille de redescriptions. Dans le mĂ©moire, nous prĂ©sentons d'abord un Ă©tat de l'art sur les rĂšgles d'association et les redescriptions. Ensuite, nous proposons une adaptation de chacune des mĂ©thodes pour finaliser la tĂąche de dĂ©couverte de dĂ©finitions. Puis nous dĂ©taillons un ensemble d'expĂ©rimentations menĂ©es sur DBpedia, oĂč nous comparons qualitativement et quantitativement les deux approches. Enfin les dĂ©finitions dĂ©couvertes peuvent potentiellement ĂȘtre ajoutĂ©es Ă  DBpedia pour amĂ©liorer sa qualitĂ© en termes de cohĂ©rence et de complĂ©tud
    corecore