Search CORE

15 research outputs found

Effect of heuristics on serendipity in path-based storytelling with linked data

Author: A Aizawa
A Foster
B Aleman-Meza
D Kumar
F Godin
G Cheng
L De Vocht
L Fang
L Mazuel
P Hart
R Verborgh
RL Cilibrasi
V Franzoni
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Path-based storytelling with Linked Data on the Web provides users the ability to discover concepts in an entertaining and educational way. Given a query context, many state-of-the-art pathfinding approaches aim at telling a story that coincides with the user's expectations by investigating paths over Linked Data on the Web. By taking into account serendipity in storytelling, we aim at improving and tailoring existing approaches towards better fitting user expectations so that users are able to discover interesting knowledge without feeling unsure or even lost in the story facts. To this end, we propose to optimize the link estimation between - and the selection of facts in a story by increasing the consistency and relevancy of links between facts through additional domain delineation and refinement steps. In order to address multiple aspects of serendipity, we propose and investigate combinations of weights and heuristics in paths forming the essential building blocks for each story. Our experimental findings with stories based on DBpedia indicate the improvements when applying the optimized algorithm

Crossref

Ghent University Academic Bibliography

Publikationsserver der RWTH Aachen University

Mining Novellas from PubMed Abstracts using a Storytelling Algorithm

Author: Gresock Joseph
Helm Richard
Kumar Deept
Potts Malcolm
Ramakrishnan Naren
Publication venue
Publication date: 01/01/2007
Field of study

Motivation: There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and entire processes. Each article investigates particular subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must computationally integrate information across multiple publications. This is especially important in problems such as modeling cross-talk in signaling networks, designing drug therapies for combinatorial selectivity, and unraveling the role of gene interactions in deleterious phenotypes, where the cost of performing combinatorial screens is exorbitant. Results: We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for unraveling combinatorial relationships. It involves the systematic application of a `storytelling' algorithm followed by compression of the stories into `novellas.' Given a start and end publication, typically with little or no overlap in content, storytelling identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. Stories discovered thus provide an argued approach to relate distant concepts through compositions of related concepts. The chains of links employed by stories are then mined to find frequently reused sub-stories, which can be compressed to yield novellas, or compact templates of connections. We demonstrate a successful application of storytelling and novella finding to modeling combinatorial relationships between introduction of extracellular factors and downstream cellular events. Availability: A story visualizer, suitable for interactive exploration of stories and novellas described in this paper, is available for demo/download at https://bioinformatics.cs.vt.edu/storytelling

Computer Science Technical Reports @Virginia Tech

CiteSeerX

Big Data Techniques in Science Education and What Story Google Trends Tells Us About Science?

Author: Bülbül M. Şahin
dos Santos Renato P.
Lemes Isadora Luiz
Publication venue: PPGECIM - Universidade Luterana do Brasil
Publication date: 22/12/2017
Field of study

The intention of this work is to provide a quick overview of what Big Data is and present a few examples of techniques through which it can contribute to Science Education. Google offers the Google Trends (GT) free analysis tool that allows users to sort through several years of Google search queries from around the world to get a graphical plotting showing the popularity of chosen search terms both over region and time. According to the time, region, and frequency of search, three kinds of data are evaluated in terms of compatibility with a sort of “correlation analysis”. A few techniques of extracting meaning from them are exemplified through geographical searches for ‘Solar Eclipse’ in USA and through temporal searches of the term ‘research’ in the period 2013-2017. In addition, and as the main study, an experiment was conducted to replicate with Big Data and GT Taşdere, Özsevgeç, and Turkmen’s survey on the Nature of Science (NoS). To that end, the same nine concepts they selected were searched in GT. Two-way correlation analysis was performed on these words, and those pairs with a Pearson Correlation of 0.8 and higher were used to build a conceptual network. Three main levels emerge in our hierarchical conceptual network and, as a result of this structuring, a storytelling can be built: Science is seen, in a more publicly understandable level, as associated to ‘laws’, followed by a less-visible level of research being associated to ‘building theories’, and then, in a even lesser understanding level, the scientists doing experiments to test hypotheses, which are confirmed or not by observation – an image of scientists’ work shaped in a large degree by popular media

Universidade Luterana do Brasil: Periódicos ULBRA

Compositional Mining of Multi-Relational Biological Datasets

Author: Jin Ying
Murali T.M.
Ramakrishnan Naren
Publication venue
Publication date: 01/01/2007
Field of study

High-throughput biological screens are yielding ever-growing streams of information about multiple aspects of cellular activity. As more and more categories of datasets come online, there is a corresponding multitude of ways in which inferences can be chained across them, motivating the need for compositional data mining algorithms. In this paper, we argue that such compositional data mining can be effectively realized by functionally cascading redescription mining and biclustering algorithms as primitives. Both these primitives mirror shifts of vocabulary that can be composed in arbitrary ways to create rich chains of inferences. Given a relational database and its schema, we show how the schema can be automatically compiled into a compositional data mining program, and how different domains in the schema can be related through logical sequences of biclustering and redescription invocations. This feature allows us to rapidly prototype new data mining applications, yielding greater understanding of scientific datasets. We describe two applications of compositional data mining: (i) matching terms across categories of the Gene Ontology and (ii) understanding the molecular mechanisms underlying stress response in human cells

Computer Science Technical Reports @Virginia Tech

CiteSeerX

Redescription Mining With Three Primary Data Mining Functionalities

Author: M Kamala Kumari
Prof Suresh Varma
Publication venue
Publication date: 11/04/2020
Field of study

Abstrac

CiteSeerX

Compositional mining of multirelational biological datasets

Author: Agrawal R.
Ball C.
Bayardo R.
Benjamini Y.
Matzke M.
Michalski R.
Murali T.
Naren Ramakrishnan
Parida L.
T. M. Murali
Ying Jin
Zaki M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Découverte de définitions dans le web des données

Author: Reynaud Justine
Publication venue: HAL CCSD
Publication date: 10/12/2019
Field of study

In this thesis, we are interested in the web of data and knowledge units that can be possibly discovered inside. The web of data can be considered as a very large graph consisting of connected RDF triple databases. An RDF triple, denoted as (subject, predicate, object), represents a relation (i.e. the predicate) existing between two resources (i.e. the subject and the object). Resources can belong to one or more classes, where a class aggregates resources sharing common characteristics. Thus, these RDF triple databases can be seen as interconnected knowledge bases. Most of the time, these knowledge bases are collaboratively built thanks to human users. This is particularly the case of DBpedia, a central knowledge base within the web of data, which encodes Wikipedia content in RDF format. DBpedia is built from two types of Wikipedia data: on the one hand, (semi-)structured data such as infoboxes, and, on the other hand, categories, which are thematic clusters of manually generated pages. However, the semantics of categories in DBpedia, that is, the reason a human agent has bundled resources, is rarely made explicit. In fact, considering a class, a software agent has access to the resources that are regrouped together, i.e. the class extension, but it generally does not have access to the ``reasons'' underlying such a cluster, i.e. it does not have the class intension. Considering a category as a class of resources, we aim at discovering an intensional description of the category. More precisely, given a class extension, we are searching for the related intension. The pair (extension, intension) which is produced provides the final definition and the implementation of classification-based reasoning for software agents. This can be expressed in terms of necessary and sufficient conditions: if x belongs to the class C, then x has the property P (necessary condition), and if x has the property P, then it belongs to the class C (sufficient condition). Two complementary data mining methods allow us to materialize the discovery of definitions, the search for association rules and the search for redescriptions. In this thesis, we first present a state of the art about association rules and redescriptions. Next, we propose an adaptation of each data mining method for the task of definition discovery. Then we detail a set of experiments applied to DBpedia, and we qualitatively and quantitatively compare the two approaches. Finally, we discuss how discovered definitions can be added to DBpedia to improve its quality in terms of consistency and completeness.Dans cette thèse, nous nous intéressons au web des données et aux ``connaissances'' que potentiellement il renferme. Le web des données se présente comme un très grand graphe constitué de bases de triplets RDF connectées entre elles. Un triplet RDF, dénoté (sujet, prédicat, objet), représente une relation (le prédicat) qui existe entre deux ressources (le sujet et l'objet). Les ressources peuvent appartenir à une ou plusieurs classes, où une classe regroupe des ressources partageant des caractéristiques communes. Ainsi, ces bases de triplets RDF peuvent être vues comme des bases de connaissances interconnectées. La plupart du temps ces bases de connaissances sont construites de manière collaborative par des utilisateurs. C'est notamment le cas de DBpedia, une base de connaissances centrale dans le web des données, qui encode le contenu de Wikipédia au format RDF. DBpedia est construite à partir de deux types de données de Wikipédia : d'une part, des données (semi-)structurées telles que les infoboxes et d'autre part les catégories, qui sont des regroupements thématiques de pages générés manuellement. Cependant, la sémantique des catégories dans DBpedia, c'est-à-dire la raison pour laquelle un agent humain a regroupé des ressources, n'est pas explicite. De fait, en considérant une classe, un agent logiciel a accès aux ressources qui y sont regroupées --- il dispose de la définition dite en extension --- mais il n'a généralement pas accès aux ``motifs'' de ce regroupement --- il ne dispose pas de la définition dite en intension. Dans cette thèse, nous cherchons à associer une définition à une catégorie en l'assimilant à une classe de ressources. Plus précisément, nous cherchons à associer une intension à une classe donnée en extension. La paire (extension, intension) produite va fournir la définition recherchée et va autoriser la mise en œuvre d'un raisonnement par classification pour un agent logiciel. Cela peut s'exprimer en termes de conditions nécessaires et suffisantes : si x appartient à la classe C, alors x a la propriété P (condition nécessaire), et si x a la propriété P, alors il appartient à la classe C (condition suffisante). Deux méthodes de fouille de données complémentaires nous permettent de matérialiser la découverte de définitions, la fouille de règles d'association et la fouille de redescriptions. Dans le mémoire, nous présentons d'abord un état de l'art sur les règles d'association et les redescriptions. Ensuite, nous proposons une adaptation de chacune des méthodes pour finaliser la tâche de découverte de définitions. Puis nous détaillons un ensemble d'expérimentations menées sur DBpedia, où nous comparons qualitativement et quantitativement les deux approches. Enfin les définitions découvertes peuvent potentiellement être ajoutées à DBpedia pour améliorer sa qualité en termes de cohérence et de complétud

INRIA a CCSD electronic archive server