1,188 research outputs found
Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art
Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover
End-to-End Entity Resolution for Big Data: A Survey
One of the most important tasks for improving data quality and the
reliability of data analytics results is Entity Resolution (ER). ER aims to
identify different descriptions that refer to the same real-world entity, and
remains a challenging problem. While previous works have studied specific
aspects of ER (and mostly in traditional settings), in this survey, we provide
for the first time an end-to-end view of modern ER workflows, and of the novel
aspects of entity indexing and matching methods in order to cope with more than
one of the Big Data characteristics simultaneously. We present the basic
concepts, processing steps and execution strategies that have been proposed by
different communities, i.e., database, semantic Web and machine learning, in
order to cope with the loose structuredness, extreme diversity, high speed and
large scale of entity descriptions used by real-world applications. Finally, we
provide a synthetic discussion of the existing approaches, and conclude with a
detailed presentation of open research directions
Application of decision trees and multivariate regression trees in design and optimization
Induction of decision trees and regression trees is a powerful technique not only for performing ordinary classification and regression analysis but also for discovering the often complex knowledge which describes the input-output behavior of a learning system in qualitative forms;In the area of classification (discrimination analysis), a new technique called IDea is presented for performing incremental learning with decision trees. It is demonstrated that IDea\u27s incremental learning can greatly reduce the spatial complexity of a given set of training examples. Furthermore, it is shown that this reduction in complexity can also be used as an effective tool for improving the learning efficiency of other types of inductive learners such as standard backpropagation neural networks;In the area of regression analysis, a new methodology for performing multiobjective optimization has been developed. Specifically, we demonstrate that muitiple-objective optimization through induction of multivariate regression trees is a powerful alternative to the conventional vector optimization techniques. Furthermore, in an attempt to investigate the effect of various types of splitting rules on the overall performance of the optimizing system, we present a tree partitioning algorithm which utilizes a number of techniques derived from diverse fields of statistics and fuzzy logic. These include: two multivariate statistical approaches based on dispersion matrices, an information-theoretic measure of covariance complexity which is typically used for obtaining multivariate linear models, two newly-formulated fuzzy splitting rules based on Pearson\u27s parametric and Kendall\u27s nonparametric measures of association, Bellman and Zadeh\u27s fuzzy decision-maximizing approach within an inductive framework, and finally, the multidimensional extension of a widely-used fuzzy entropy measure. The advantages of this new approach to optimization are highlighted by presenting three examples which respectively deal with design of a three-bar truss, a beam, and an electric discharge machining (EDM) process
TALK COMMONSENSE TO ME! ENRICHING LANGUAGE MODELS WITH COMMONSENSE KNOWLEDGE
Human cognition is exciting, it is a mesh up of several neural phenomena which really
strive our ability to constantly reason and infer about the involving world. In cognitive
computer science, Commonsense Reasoning is the terminology given to our ability to
infer uncertain events and reason about Cognitive Knowledge. The introduction of Commonsense
to intelligent systems has been for years desired, but the mechanism for this
introduction remains a scientific jigsaw. Some, implicitly believe language understanding
is enough to achieve some level of Commonsense [90]. In a less common ground, there
are others who think enriching language with Knowledge Graphs might be enough for
human-like reasoning [63], while there are others who believe human-like reasoning can
only be truly captured with symbolic rules and logical deduction powered by Knowledge
Bases, such as taxonomies and ontologies [50]. We focus on Commonsense Knowledge
integration to Language Models, because we believe that this integration is a step towards
a beneficial embedding of Commonsense Reasoning to interactive Intelligent Systems,
such as conversational assistants.
Conversational assistants, such as Alexa from Amazon, are user driven systems. Thus,
giving birth to a more human-like interaction is strongly desired to really capture the
user’s attention and empathy. We believe that such humanistic characteristics can be
leveraged through the introduction of stronger Commonsense Knowledge and Reasoning
to fruitfully engage with users.
To this end, we intend to introduce a new family of models, the Relation-Aware
BART (RA-BART), leveraging language generation abilities of BART [51] with explicit
Commonsense Knowledge extracted from Commonsense Knowledge Graphs to further
extend human capabilities on these models.
We evaluate our model on three different tasks: Abstractive Question Answering, Text
Generation conditioned on certain concepts and aMulti-Choice Question Answering task.
We find out that, on generation tasks, RA-BART outperforms non-knowledge enriched
models, however, it underperforms on the multi-choice question answering task.
Our Project can be consulted in our open source, public GitHub repository (Explicit
Commonsense).A cognição humana é entusiasmante, é uma malha de vários fenómenos neuronais que
nos estimulam vivamente a capacidade de raciocinar e inferir constantemente sobre o
mundo envolvente. Na ciência cognitiva computacional, o raciocínio de senso comum é
a terminologia dada à nossa capacidade de inquirir sobre acontecimentos incertos e de
raciocinar sobre o conhecimento cognitivo. A introdução do senso comum nos sistemas
inteligentes é desejada há anos, mas o mecanismo para esta introdução continua a ser
um quebra-cabeças científico. Alguns acreditam que apenas compreensão da linguagem
é suficiente para alcançar o senso comum [90], num campo menos similar há outros que
pensam que enriquecendo a linguagem com gráfos de conhecimento pode serum caminho
para obter um raciocínio mais semelhante ao ser humano [63], enquanto que há outros
ciêntistas que acreditam que o raciocínio humano só pode ser verdadeiramente capturado
com regras simbólicas e deduções lógicas alimentadas por bases de conhecimento, como
taxonomias e ontologias [50]. Concentramo-nos na integração de conhecimento de censo
comum em Modelos Linguísticos, acreditando que esta integração é um passo no sentido
de uma incorporação benéfica no racíocinio de senso comum em Sistemas Inteligentes
Interactivos, como é o caso dos assistentes de conversação.
Assistentes de conversação, como o Alexa da Amazon, são sistemas orientados aos
utilizadores. Assim, dar origem a uma comunicação mais humana é fortemente desejada
para captar realmente a atenção e a empatia do utilizador. Acreditamos que tais características
humanísticas podem ser alavancadas por meio de uma introdução mais rica de
conhecimento e raciocínio de senso comum de forma a proporcionar uma interação mais
natural com o utilizador.
Para tal, pretendemos introduzir uma nova família de modelos, o Relation-Aware
BART (RA-BART), alavancando as capacidades de geração de linguagem do BART [51]
com conhecimento de censo comum extraído a partir de grafos de conhecimento explícito
de senso comum para alargar ainda mais as capacidades humanas nestes modelos.
Avaliamos o nosso modelo em três tarefas distintas: Respostas a Perguntas Abstratas,
Geração de Texto com base em conceitos e numa tarefa de Resposta a Perguntas de Escolha Múltipla . Descobrimos que, nas tarefas de geração, o RA-BART tem um desempenho
superior aos modelos sem enriquecimento de conhecimento, contudo, tem um
desempenho inferior na tarefa de resposta a perguntas de múltipla escolha.
O nosso Projecto pode ser consultado no nosso repositório GitHub público, de código
aberto (Explicit Commonsense)
音声翻訳における文解析技法について
本文データは平成22年度国立国会図書館の学位論文(博士)のデジタル化実施により作成された画像ファイルを基にpdf変換したものである京都大学0048新制・論文博士博士(工学)乙第8652号論工博第2893号新制||工||968(附属図書館)UT51-94-R411(主査)教授 長尾 真, 教授 堂下 修司, 教授 池田 克夫学位規則第4条第2項該当Doctor of EngineeringKyoto UniversityDFA
Concept drift learning and its application to adaptive information filtering
Tracking the evolution of user interests is a problem instance of concept drift learning. Keeping track of multiple interest categories is a natural phenomenon as well as an interesting tracking problem because interests can emerge and diminish at different time frames. The first part of this dissertation presents a Multiple Three-Descriptor Representation (MTDR) algorithm, a novel algorithm for learning concept drift especially built for tracking the dynamics of multiple target concepts in the information filtering domain. The learning process of the algorithm combines the long-term and short-term interest (concept) models in an attempt to benefit from the strength of both models. The MTDR algorithm improves over existing concept drift learning algorithms in the domain.
Being able to track multiple target concepts with a few examples poses an even more important and challenging problem because casual users tend to be reluctant to provide the examples needed, and learning from a few labeled data is generally difficult. The second part presents a computational Framework for Extending Incomplete Labeled Data Stream (FEILDS). The system modularly extends the capability of an existing concept drift learner in dealing with incomplete labeled data stream. It expands the learner's original input stream with relevant unlabeled data; the process generates a new stream with improved learnability. FEILDS employs a concept formation system for organizing its input stream into a concept (cluster) hierarchy. The system uses the concept and cluster hierarchy to identify the instance's concept and unlabeled data relevant to a concept. It also adopts the persistence assumption in temporal reasoning for inferring the relevance of concepts. Empirical evaluation indicates that FEILDS is able to improve the performance of existing learners particularly when learning from a stream with a few labeled data.
Lastly, a new concept formation algorithm, one of the key components in the FEILDS architecture, is presented. The main idea is to discover intrinsic hierarchical structures regardless of the class distribution and the shape of the input stream. Experimental evaluation shows that the algorithm is relatively robust to input ordering, consistently producing a hierarchy structure of high quality
Recommended from our members
A study of instance-based algorithms for supervised learning tasks : mathematical, empirical, and psychological evaluations
This dissertation introduces a framework for specifying instance-based algorithms that can solve supervised learning tasks. These algorithms input a sequence of instances and yield a partial concept description, which is represented by a set of stored instances and associated information. This description can be used to predict values for subsequently presented instances. The thesis of this framework is that extensional concept descriptions and lazy generalization strategies can support efficient supervised learning behavior.The instance-based learning framework consists of three components. The pre-processor component transforms an instance into a more palatable form for the performance component, which computes the instance's similarity with a set of stored instances and yields a prediction for its target value(s). Therefore, the similarity and prediction functions impose generalizations on the stored instances to inductively derive predictions. The learning component assesses the accuracy of these prediction(s) and updates partial concept descriptions to improve their predictive accuracy.This framework is evaluated in four ways. First, its generality is evaluated by mathematically determining the classes of symbolic concepts and numeric functions that can be closely approximated by IB_1, a simple algorithm specified by this framework. Second, this framework is empirically evaluated for its ability to specify algorithms that improve IB_1's learning efficiency. Significant efficiency improvements are obtained by instance-based algorithms that reduce storage requirements, tolerate noisy data, and learn domain-specific similarity functions respectively. Alternative component definitions for these algorithms are empirically analyzed in a set of five high-level parameter studies. Third, this framework is evaluated for its ability to specify psychologically plausible process models for categorization tasks. Results from subject experiments indicate a positive correlation between a models' ability to utilize attribute correlation information and its ability to explain psychological phenomena. Finally, this framework is evaluated for its ability to explain and relate a dozen prominent instance-based learning systems. The survey shows that this framework requires only slight modifications to fit these highly diverse systems. Relationships with edited nearest neighbor algorithms, case-based reasoners, and artificial neural networks are also described
Computer integrated documentation
The main technical issues of the Computer Integrated Documentation (CID) project are presented. The problem of automation of documents management and maintenance is analyzed both from an artificial intelligence viewpoint and from a human factors viewpoint. Possible technologies for CID are reviewed: conventional approaches to indexing and information retrieval; hypertext; and knowledge based systems. A particular effort was made to provide an appropriate representation for contextual knowledge. This representation is used to generate context on hypertext links. Thus, indexing in CID is context sensitive. The implementation of the current version of CID is described. It includes a hypertext data base, a knowledge based management and maintenance system, and a user interface. A series is also presented of theoretical considerations as navigation in hyperspace, acquisition of indexing knowledge, generation and maintenance of a large documentation, and relation to other work
Type-driven Synthesis of Evolving Data Mode
Modern commercial software is often framed under the umbrella of data-centric applications.
Data-centric applications define data as the main and permanent asset. These
applications use a single data model for application functionality, data management, and
analytical activities, which is built before the applications.
Moreover, since applications are temporary, in contrast to data, there is the need to
continuously evolve and change the data schema to accommodate new functionality. In
this sense, the continuously evolving (rich) feature set that is expected of state-of-the-art
applications is intrinsically bound by not only the amount of available data but also by
its structure, its internal dependencies, and by the ability to transparently and uniformly
grow and evolve data representations and their properties on the fly.
The GOLEM project aims to produce new methods of program automation integrated
in the development of data-centric applications in low-code frameworks. In this context,
one of the key targets for automation is the data layer itself, encompassing the data layout
and its integrity constraints, as well as validation and access control rules.
The aim of this dissertation, which is integrated in GOLEM, is to develop a synthesis
framework that, based on high-level specifications, correctly defines and evolves a
rich data layer component by means of high-level operations. The construction of the
framework was approached by defining a specification language to express richly-typed
specifications, a target language which is the goal of synthesis and a type-directed synthesis
procedure based on proof-search concepts.
The range of real database operations the framework is able to synthesize is demonstrated
through a case study. In a component-based synthesis style, with an extensible
library of base operations on database tables (specified using the target language) in context,
the case study shows that the synthesis framework is capable of expressing and
solving a wide variety of data schema creation and evolution problems.Os sistemas modernos de software comercial são frequentemente caracterizados como
aplicações centradas em dados. Estas aplicações definem os dados como o seu principal
e persistente ativo, e utilizam um único modelo de dados para as suas funcionalidades,
gestão de dados, e atividades analíticas.
Além disso, uma vez que as aplicações são efémeras, contrariamente aos dados, existe
a necessidade de continuamente evoluir o esquema de dados para introduzir novas funcionalidades.
Neste sentido, o conjunto rico de características e em constante evolução
que é esperado das aplicações modernas encontra-se restricto, não só pela quantidade de
dados disponíveis, mas também pela sua estrutura, dependências internas, e a capacidade
de crescer e evoluir a representação dos dados de uma forma uniforme e rápida.
O projeto GOLEM tem como objetivo a produção de novos métodos de automação de
programas integrado no desenvolvimento de aplicações centradas nos dados em sistemas
low-code. Neste contexto, um dos objetivos principais de automação é a camada de dados,
compreendendo a estrutura dos dados e as respectivas condições de integridade, como
também as regras de validação e controlo de acessos.
O objetivo desta dissertação, integrada no projeto GOLEM, é o desenvolvimento de
um sistema de síntese que, baseado em especificações de alto nível, define e evolui corretamente
uma camada de dados rica com recurso a operações de alto nível. A construção
deste sistema baseia-se na definição de uma linguagem de especificação que permite definir
especificações com tipos ricos, uma linguagem de expressões que é considerada o
objetivo da síntese e um procedimento de síntese orientada pelos tipos.
O espectro de operações reais de bases de dados que o sistema consegue sintetizar é
demonstrado através de um caso de estudo. Com uma biblioteca extensível de operações
sobre tabelas no contexto, o caso de estudo demonstra que o sistema de síntese é capaz
de expressar e resolver uma grande variedade de problemas de criação e evolução de
esquemas de dados
- …