1,188 research outputs found

    Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art

    Get PDF
    Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover

    End-to-End Entity Resolution for Big Data: A Survey

    Get PDF
    One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

    Application of decision trees and multivariate regression trees in design and optimization

    Get PDF
    Induction of decision trees and regression trees is a powerful technique not only for performing ordinary classification and regression analysis but also for discovering the often complex knowledge which describes the input-output behavior of a learning system in qualitative forms;In the area of classification (discrimination analysis), a new technique called IDea is presented for performing incremental learning with decision trees. It is demonstrated that IDea\u27s incremental learning can greatly reduce the spatial complexity of a given set of training examples. Furthermore, it is shown that this reduction in complexity can also be used as an effective tool for improving the learning efficiency of other types of inductive learners such as standard backpropagation neural networks;In the area of regression analysis, a new methodology for performing multiobjective optimization has been developed. Specifically, we demonstrate that muitiple-objective optimization through induction of multivariate regression trees is a powerful alternative to the conventional vector optimization techniques. Furthermore, in an attempt to investigate the effect of various types of splitting rules on the overall performance of the optimizing system, we present a tree partitioning algorithm which utilizes a number of techniques derived from diverse fields of statistics and fuzzy logic. These include: two multivariate statistical approaches based on dispersion matrices, an information-theoretic measure of covariance complexity which is typically used for obtaining multivariate linear models, two newly-formulated fuzzy splitting rules based on Pearson\u27s parametric and Kendall\u27s nonparametric measures of association, Bellman and Zadeh\u27s fuzzy decision-maximizing approach within an inductive framework, and finally, the multidimensional extension of a widely-used fuzzy entropy measure. The advantages of this new approach to optimization are highlighted by presenting three examples which respectively deal with design of a three-bar truss, a beam, and an electric discharge machining (EDM) process


    Get PDF
    Human cognition is exciting, it is a mesh up of several neural phenomena which really strive our ability to constantly reason and infer about the involving world. In cognitive computer science, Commonsense Reasoning is the terminology given to our ability to infer uncertain events and reason about Cognitive Knowledge. The introduction of Commonsense to intelligent systems has been for years desired, but the mechanism for this introduction remains a scientific jigsaw. Some, implicitly believe language understanding is enough to achieve some level of Commonsense [90]. In a less common ground, there are others who think enriching language with Knowledge Graphs might be enough for human-like reasoning [63], while there are others who believe human-like reasoning can only be truly captured with symbolic rules and logical deduction powered by Knowledge Bases, such as taxonomies and ontologies [50]. We focus on Commonsense Knowledge integration to Language Models, because we believe that this integration is a step towards a beneficial embedding of Commonsense Reasoning to interactive Intelligent Systems, such as conversational assistants. Conversational assistants, such as Alexa from Amazon, are user driven systems. Thus, giving birth to a more human-like interaction is strongly desired to really capture the user’s attention and empathy. We believe that such humanistic characteristics can be leveraged through the introduction of stronger Commonsense Knowledge and Reasoning to fruitfully engage with users. To this end, we intend to introduce a new family of models, the Relation-Aware BART (RA-BART), leveraging language generation abilities of BART [51] with explicit Commonsense Knowledge extracted from Commonsense Knowledge Graphs to further extend human capabilities on these models. We evaluate our model on three different tasks: Abstractive Question Answering, Text Generation conditioned on certain concepts and aMulti-Choice Question Answering task. We find out that, on generation tasks, RA-BART outperforms non-knowledge enriched models, however, it underperforms on the multi-choice question answering task. Our Project can be consulted in our open source, public GitHub repository (Explicit Commonsense).A cognição humana é entusiasmante, é uma malha de vários fenómenos neuronais que nos estimulam vivamente a capacidade de raciocinar e inferir constantemente sobre o mundo envolvente. Na ciência cognitiva computacional, o raciocínio de senso comum é a terminologia dada à nossa capacidade de inquirir sobre acontecimentos incertos e de raciocinar sobre o conhecimento cognitivo. A introdução do senso comum nos sistemas inteligentes é desejada há anos, mas o mecanismo para esta introdução continua a ser um quebra-cabeças científico. Alguns acreditam que apenas compreensão da linguagem é suficiente para alcançar o senso comum [90], num campo menos similar há outros que pensam que enriquecendo a linguagem com gráfos de conhecimento pode serum caminho para obter um raciocínio mais semelhante ao ser humano [63], enquanto que há outros ciêntistas que acreditam que o raciocínio humano só pode ser verdadeiramente capturado com regras simbólicas e deduções lógicas alimentadas por bases de conhecimento, como taxonomias e ontologias [50]. Concentramo-nos na integração de conhecimento de censo comum em Modelos Linguísticos, acreditando que esta integração é um passo no sentido de uma incorporação benéfica no racíocinio de senso comum em Sistemas Inteligentes Interactivos, como é o caso dos assistentes de conversação. Assistentes de conversação, como o Alexa da Amazon, são sistemas orientados aos utilizadores. Assim, dar origem a uma comunicação mais humana é fortemente desejada para captar realmente a atenção e a empatia do utilizador. Acreditamos que tais características humanísticas podem ser alavancadas por meio de uma introdução mais rica de conhecimento e raciocínio de senso comum de forma a proporcionar uma interação mais natural com o utilizador. Para tal, pretendemos introduzir uma nova família de modelos, o Relation-Aware BART (RA-BART), alavancando as capacidades de geração de linguagem do BART [51] com conhecimento de censo comum extraído a partir de grafos de conhecimento explícito de senso comum para alargar ainda mais as capacidades humanas nestes modelos. Avaliamos o nosso modelo em três tarefas distintas: Respostas a Perguntas Abstratas, Geração de Texto com base em conceitos e numa tarefa de Resposta a Perguntas de Escolha Múltipla . Descobrimos que, nas tarefas de geração, o RA-BART tem um desempenho superior aos modelos sem enriquecimento de conhecimento, contudo, tem um desempenho inferior na tarefa de resposta a perguntas de múltipla escolha. O nosso Projecto pode ser consultado no nosso repositório GitHub público, de código aberto (Explicit Commonsense)


    Get PDF
    本文データは平成22年度国立国会図書館の学位論文(博士)のデジタル化実施により作成された画像ファイルを基にpdf変換したものである京都大学0048新制・論文博士博士(工学)乙第8652号論工博第2893号新制||工||968(附属図書館)UT51-94-R411(主査)教授 長尾 真, 教授 堂下 修司, 教授 池田 克夫学位規則第4条第2項該当Doctor of EngineeringKyoto UniversityDFA

    Concept drift learning and its application to adaptive information filtering

    Get PDF
    Tracking the evolution of user interests is a problem instance of concept drift learning. Keeping track of multiple interest categories is a natural phenomenon as well as an interesting tracking problem because interests can emerge and diminish at different time frames. The first part of this dissertation presents a Multiple Three-Descriptor Representation (MTDR) algorithm, a novel algorithm for learning concept drift especially built for tracking the dynamics of multiple target concepts in the information filtering domain. The learning process of the algorithm combines the long-term and short-term interest (concept) models in an attempt to benefit from the strength of both models. The MTDR algorithm improves over existing concept drift learning algorithms in the domain. Being able to track multiple target concepts with a few examples poses an even more important and challenging problem because casual users tend to be reluctant to provide the examples needed, and learning from a few labeled data is generally difficult. The second part presents a computational Framework for Extending Incomplete Labeled Data Stream (FEILDS). The system modularly extends the capability of an existing concept drift learner in dealing with incomplete labeled data stream. It expands the learner's original input stream with relevant unlabeled data; the process generates a new stream with improved learnability. FEILDS employs a concept formation system for organizing its input stream into a concept (cluster) hierarchy. The system uses the concept and cluster hierarchy to identify the instance's concept and unlabeled data relevant to a concept. It also adopts the persistence assumption in temporal reasoning for inferring the relevance of concepts. Empirical evaluation indicates that FEILDS is able to improve the performance of existing learners particularly when learning from a stream with a few labeled data. Lastly, a new concept formation algorithm, one of the key components in the FEILDS architecture, is presented. The main idea is to discover intrinsic hierarchical structures regardless of the class distribution and the shape of the input stream. Experimental evaluation shows that the algorithm is relatively robust to input ordering, consistently producing a hierarchy structure of high quality

    Computer integrated documentation

    Get PDF
    The main technical issues of the Computer Integrated Documentation (CID) project are presented. The problem of automation of documents management and maintenance is analyzed both from an artificial intelligence viewpoint and from a human factors viewpoint. Possible technologies for CID are reviewed: conventional approaches to indexing and information retrieval; hypertext; and knowledge based systems. A particular effort was made to provide an appropriate representation for contextual knowledge. This representation is used to generate context on hypertext links. Thus, indexing in CID is context sensitive. The implementation of the current version of CID is described. It includes a hypertext data base, a knowledge based management and maintenance system, and a user interface. A series is also presented of theoretical considerations as navigation in hyperspace, acquisition of indexing knowledge, generation and maintenance of a large documentation, and relation to other work

    Type-driven Synthesis of Evolving Data Mode

    Get PDF
    Modern commercial software is often framed under the umbrella of data-centric applications. Data-centric applications define data as the main and permanent asset. These applications use a single data model for application functionality, data management, and analytical activities, which is built before the applications. Moreover, since applications are temporary, in contrast to data, there is the need to continuously evolve and change the data schema to accommodate new functionality. In this sense, the continuously evolving (rich) feature set that is expected of state-of-the-art applications is intrinsically bound by not only the amount of available data but also by its structure, its internal dependencies, and by the ability to transparently and uniformly grow and evolve data representations and their properties on the fly. The GOLEM project aims to produce new methods of program automation integrated in the development of data-centric applications in low-code frameworks. In this context, one of the key targets for automation is the data layer itself, encompassing the data layout and its integrity constraints, as well as validation and access control rules. The aim of this dissertation, which is integrated in GOLEM, is to develop a synthesis framework that, based on high-level specifications, correctly defines and evolves a rich data layer component by means of high-level operations. The construction of the framework was approached by defining a specification language to express richly-typed specifications, a target language which is the goal of synthesis and a type-directed synthesis procedure based on proof-search concepts. The range of real database operations the framework is able to synthesize is demonstrated through a case study. In a component-based synthesis style, with an extensible library of base operations on database tables (specified using the target language) in context, the case study shows that the synthesis framework is capable of expressing and solving a wide variety of data schema creation and evolution problems.Os sistemas modernos de software comercial são frequentemente caracterizados como aplicações centradas em dados. Estas aplicações definem os dados como o seu principal e persistente ativo, e utilizam um único modelo de dados para as suas funcionalidades, gestão de dados, e atividades analíticas. Além disso, uma vez que as aplicações são efémeras, contrariamente aos dados, existe a necessidade de continuamente evoluir o esquema de dados para introduzir novas funcionalidades. Neste sentido, o conjunto rico de características e em constante evolução que é esperado das aplicações modernas encontra-se restricto, não só pela quantidade de dados disponíveis, mas também pela sua estrutura, dependências internas, e a capacidade de crescer e evoluir a representação dos dados de uma forma uniforme e rápida. O projeto GOLEM tem como objetivo a produção de novos métodos de automação de programas integrado no desenvolvimento de aplicações centradas nos dados em sistemas low-code. Neste contexto, um dos objetivos principais de automação é a camada de dados, compreendendo a estrutura dos dados e as respectivas condições de integridade, como também as regras de validação e controlo de acessos. O objetivo desta dissertação, integrada no projeto GOLEM, é o desenvolvimento de um sistema de síntese que, baseado em especificações de alto nível, define e evolui corretamente uma camada de dados rica com recurso a operações de alto nível. A construção deste sistema baseia-se na definição de uma linguagem de especificação que permite definir especificações com tipos ricos, uma linguagem de expressões que é considerada o objetivo da síntese e um procedimento de síntese orientada pelos tipos. O espectro de operações reais de bases de dados que o sistema consegue sintetizar é demonstrado através de um caso de estudo. Com uma biblioteca extensível de operações sobre tabelas no contexto, o caso de estudo demonstra que o sistema de síntese é capaz de expressar e resolver uma grande variedade de problemas de criação e evolução de esquemas de dados