210 research outputs found

    Mapping Big Data into Knowledge Space with Cognitive Cyber-Infrastructure

    Full text link
    Big data research has attracted great attention in science, technology, industry and society. It is developing with the evolving scientific paradigm, the fourth industrial revolution, and the transformational innovation of technologies. However, its nature and fundamental challenge have not been recognized, and its own methodology has not been formed. This paper explores and answers the following questions: What is big data? What are the basic methods for representing, managing and analyzing big data? What is the relationship between big data and knowledge? Can we find a mapping from big data into knowledge space? What kind of infrastructure is required to support not only big data management and analysis but also knowledge discovery, sharing and management? What is the relationship between big data and science paradigm? What is the nature and fundamental challenge of big data computing? A multi-dimensional perspective is presented toward a methodology of big data computing.Comment: 59 page

    The Example Guru: Suggesting Examples to Novice Programmers in an Artifact-Based Context

    Get PDF
    Programmers in artifact-based contexts could likely benefit from skills that they do not realize exist. We define artifact-based contexts as contexts where programmers have a goal project, like an application or game, which they must figure out how to accomplish and can change along the way. Artifact-based contexts do not have quantifiable goal states, like the solution to a puzzle or the resolution of a bug in task-based contexts. Currently, programmers in artifact-based contexts have to seek out information, but may be unaware of useful information or choose not to seek out new skills. This is especially problematic for young novice programmers in blocks programming environments. Blocks programming environments often lack even minimal in-context support, such as auto-complete or in-context documentation. Novices programming independently in these blocks-based programming environments often plateau in the programming skills and API methods they use. This work aims to encourage novices in artifact-based programming contexts to explore new API methods and skills. One way to support novices may be with examples, as examples are effective for learning and highly available. In order to better understand how to use examples for supporting novice programmers, I first ran two studies exploring novices\u27 use and focus on example code. I used those results to design a system called the Example Guru. The Example Guru suggests example snippets to novice programmers that contain previously unused API methods or code concepts. Finally, I present an approach for semi-automatically generating content for this type of suggestion system. This approach reduces the amount of expert effort required to create suggestions. This work contains three contributions: 1) a better understanding of difficulties novices have using example code, 2) a system that encourages exploration and use of new programming skills, and 3) an approach for generating content for a suggestion system with less expert effort

    Relevance, Rhetoric, and Argumentation: A Cross-Disciplinary Inquiry into Patterns of Thinking and Information Structuring

    Get PDF
    This dissertation research is a multidisciplinary inquiry into topicality, involving an in-depth examination of literatures and empirical data and an inductive development of a faceted typology (containing 227 fine-grained topical relevance relationships and 33 types of presentation relationship). This inquiry investigates a large variety of topical connections beyond topic matching, renders a closer look into the structure of a topic, achieves an enriched understanding of topicality and relevance, and induces a cohesive topic-oriented information architecture that is meaningful across topics and domains. The findings from the analysis contribute to the foundation work of information organization, intellectual access / information retrieval, and knowledge discovery. Using qualitative content analysis, the inquiry focuses on meaning and deep structure: Phase 1 : develop a unified theory-grounded typology of topical relevance relationships through close reading of literature and synthesis of thinking from communication, rhetoric, cognitive psychology, education, information science, argumentation, logic, law, medicine, and art history; Phase 2 : in-depth qualitative analysis of empirical relevance datasets in oral history, clinical question answering, and art image tagging, to examine manifestations of the theory-grounded typology in various contexts and to further refine the typology; the three relevance datasets were used for analysis to achieve variation in form, domain, and context. The typology of topical relevance relationships is structured with three major facets: Functional role of a piece of information plays in the overall structure of a topic or an argument; Mode of reasoning: How information contributes to the user's reasoning about a topic; Semantic relationship: How information connects to a topic semantically. This inquiry demonstrated that topical relevance with its close linkage to thinking and reasoning is central to many disciplines. The multidisciplinary approach allows synthesis and examination from new angles, leading to an integrated scheme of relevance relationships or a system of thinking that informs each individual discipline. The scheme resolving from the synthesis can be used to improve text and image understanding, knowledge organization and retrieval, reasoning, argumentation, and thinking in general, by people and machines

    Variations in expertise: Implications for the design of assistance systems

    Get PDF
    International audienceThe paper presents an investigation of the differences between two experts in the same domain. The observed differences concern comparisons between domain objects, rule justifications (technical vs.pragmatic justifications, naïve physics reasoning), and categorical knowledge (logic, level, and extension of the categorization). Differences are attributed to the prior experience of the two experts (workshop vs.laboratory). Implications for knowledge elicitation and for the design of assistance tools are presented.Après une présentation critique des études sur l'expertise, cet article expose les résultats d'une étude sur les différences entre deux experts dans le même domaine (conception de procédures de fabrication de pièces). Les différences observées portent sur les comparaisons que les experts font entre des objets du domaine, leurs justifications des règles qu'ils adoptent (justifications techniques ou pragmatiques; explications de type "physique naïve") et leurs catégorisations (la logique de description des catégories, le niveau de leur description, et l'extension de leurs catégorisations). Les différences sont attribuées à l'expérience antérieure des deux experts (atelier vs. laboratoire). Des implications pour l'acquisition de connaissances et la conception d'outils d'assistance sont présentées

    Vector representation of Internet domain names using Word embedding techniques

    Get PDF
    Word embeddings is a well-known set of techniques widely used in natural language processing ( NLP ). This thesis explores the use of word embeddings in a new scenario. A vector space model ( VSM) for Internet domain names ( DNS) is created by taking core ideas from NLP techniques and applying them to real anonymized DNS log queries from a large Internet Service Provider ( ISP) . The main goal is to find semantically similar domains only using information of DNS queries without any other knowledge about the content of those domains. A set of transformations through a detailed preprocessing pipeline with eight specific steps is defined to move the original problem to a problem in the NLP field. Once the preprocessing pipeline is applied and the DNS log files are transformed to a standard text corpus, we show that state-of-the-art techniques for word embeddings can be successfully applied in order to build what we called a DNS-VSM (a vector space model for Internet domain names). Different word embeddings techniques are evaluated in this work: Word2Vec (with Skip-Gram and CBOW architectures), App2Vec (with a CBOW architecture and adding time gaps between DNS queries), and FastText (which includes sub-word information). The obtained results are compared using various metrics from Information Retrieval theory and the quality of the learned vectors is validated with a third party source, namely, similar sites service offered by Alexa Internet, Inc2 . Due to intrinsic characteristics of domain names, we found that FastText is the best option for building a vector space model for DNS. Furthermore, its performance (considering the top 3 most similar learned vectors to each domain) is compared against two baseline methods: Random Guessing (returning randomly any domain name from the dataset) and Zero Rule (returning always the same most popular domains), outperforming both of them considerably. The results presented in this work can be useful in many engineering activities, with practical application in many areas. Some examples include websites recommendations based on similar sites, competitive analysis, identification of fraudulent or risky sites, parental-control systems, UX improvements (based on recommendations, spell correction, etc.), click-stream analysis, representation and clustering of users navigation profiles, optimization of cache systems in recursive DNS resolvers (among others). Finally, as a contribution to the research community a set of vectors of the DNS-VSM trained on a similar dataset to the one used in this thesis is released and made available for download through the github page in [1]. With this we hope that further work and research can be done using these vectors.La vectorización de palabras es un conjunto de técnicas bien conocidas y ampliamente usadas en el procesamiento del lenguaje natural ( PLN ). Esta tesis explora el uso de vectorización de palabras en un nuevo escenario. Un modelo de espacio vectorial ( VSM) para nombres de dominios de Internet ( DNS ) es creado tomando ideas fundamentales de PLN, l as cuales son aplicadas a consultas reales anonimizadas de logs de DNS de un gran proveedor de servicios de Internet ( ISP) . El objetivo principal es encontrar dominios relacionados semánticamente solamente usando información de consultas DNS sin ningún otro conocimiento sobre el contenido de esos dominios. Un conjunto de transformaciones a través de un detallado pipeline de preprocesamiento con ocho pasos específicos es definido para llevar el problema original a un problema en el campo de PLN. Una vez aplicado el pipeline de preprocesamiento y los logs de DNS son transformados a un corpus de texto estándar, se muestra que es posible utilizar con éxito técnicas del estado del arte respecto a vectorización de palabras para construir lo que denominamos un DNS-VSM (un modelo de espacio vectorial para nombres de dominio de Internet). Diferentes técnicas de vectorización de palabras son evaluadas en este trabajo: Word2Vec (con arquitectura Skip-Gram y CBOW) , App2Vec (con arquitectura CBOW y agregando intervalos de tiempo entre consultas DNS ), y FastText (incluyendo información a nivel de sub-palabra). Los resultados obtenidos se comparan usando varias métricas de la teoría de Recuperación de Información y la calidad de los vectores aprendidos es validada por una fuente externa, un servicio para obtener sitios similares ofrecido por Alexa Internet, Inc . Debido a características intrínsecas de los nombres de dominio, encontramos que FastText es la mejor opción para construir un modelo de espacio vectorial para DNS . Además, su performance es comparada contra dos métodos de línea base: Random Guessing (devolviendo cualquier nombre de dominio del dataset de forma aleatoria) y Zero Rule (devolviendo siempre los mismos dominios más populares), superando a ambos de manera considerable. Los resultados presentados en este trabajo pueden ser útiles en muchas actividades de ingeniería, con aplicación práctica en muchas áreas. Algunos ejemplos incluyen recomendaciones de sitios web, análisis competitivo, identificación de sitios riesgosos o fraudulentos, sistemas de control parental, mejoras de UX (basada en recomendaciones, corrección ortográfica, etc.), análisis de flujo de clics, representación y clustering de perfiles de navegación de usuarios, optimización de sistemas de cache en resolutores de DNS recursivos (entre otros). Por último, como contribución a la comunidad académica, un conjunto de vectores del DNS-VSM entrenado sobre un juego de datos similar al utilizado en esta tesis es liberado y hecho disponible para descarga a través de la página github en [1]. Con esto esperamos a que más trabajos e investigaciones puedan realizarse usando estos vectores
    corecore