8 research outputs found

    Automating Text Encapsulation Using Deep Learning

    Get PDF
    Data is an important aspect in any form be it communication, reviews, news articles, social media data, machine or real-time data. With the emergence of Covid-19, a pandemic seen like no other in recent times, information is being poured in from all directions on the internet. At times it is overwhelming to determine which data to read and follow. Another crucial aspect is separating factual data from distorted data that is being circulated widely. The title or short description of this data can play a key role. Many times, these descriptions can deceive a user with unwanted information. The user is then more likely to spread this information with his colleagues/family and if they too are unaware, this false piece of information can spread like a forest wildfire. Deep machine learning models can play a vital role in automatically encapsulating the description and providing an accurate overview. This automated overview can then be used by the end user to determine if that piece of information can be consumed or not. This research presents an efficient Deep learning model for automating text encapsulation and its comparison with existing systems in terms of data, features and their point of failures. It aims at condensing text percepts more accurately

    Exploiting general-purpose background knowledge for automated schema matching

    Full text link
    The schema matching task is an integral part of the data integration process. It is usually the first step in integrating data. Schema matching is typically very complex and time-consuming. It is, therefore, to the largest part, carried out by humans. One reason for the low amount of automation is the fact that schemas are often defined with deep background knowledge that is not itself present within the schemas. Overcoming the problem of missing background knowledge is a core challenge in automating the data integration process. In this dissertation, the task of matching semantic models, so-called ontologies, with the help of external background knowledge is investigated in-depth in Part I. Throughout this thesis, the focus lies on large, general-purpose resources since domain-specific resources are rarely available for most domains. Besides new knowledge resources, this thesis also explores new strategies to exploit such resources. A technical base for the development and comparison of matching systems is presented in Part II. The framework introduced here allows for simple and modularized matcher development (with background knowledge sources) and for extensive evaluations of matching systems. One of the largest structured sources for general-purpose background knowledge are knowledge graphs which have grown significantly in size in recent years. However, exploiting such graphs is not trivial. In Part III, knowledge graph em- beddings are explored, analyzed, and compared. Multiple improvements to existing approaches are presented. In Part IV, numerous concrete matching systems which exploit general-purpose background knowledge are presented. Furthermore, exploitation strategies and resources are analyzed and compared. This dissertation closes with a perspective on real-world applications

    Intégration de VerbNet dans un réalisateur profond

    Full text link
    La génération automatique de texte (GAT) a comme objectif de produire du texte compréhensible en langue naturelle à partir de données non-linguistiques. Les générateurs font essentiellement deux tâches : d’abord ils déterminent le contenu d’un message à communiquer, puis ils sélectionnent les mots et les constructions syntaxiques qui serviront à transmettre le message, aussi appellée la réalisation linguistique. Pour générer des textes aussi naturels que possible, un système de GAT doit être doté de ressources lexicales riches. Si on veut avoir un maximum de flexibilité dans les réalisations, il nous faut avoir accès aux différentes propriétés de combinatoire des unités lexicales d’une langue donnée. Puisque les verbes sont au coeur de chaque énoncé et qu’ils contrôlent généralement la structure de la phrase, il faudrait encoder leurs propriétés afin de produire du texte exploitant toute la richesse des langues. De plus, les verbes ont des propriétés de combinatoires imprévisibles, c’est pourquoi il faut les encoder dans un dictionnaire. Ce mémoire porte sur l’intégration de VerbNet, un dictionnaire riche de verbes de l’anglais et de leurs comportements syntaxiques, à un réalisateur profond, GenDR. Pour procéder à cette implémentation, nous avons utilisé le langage de programmation Python pour extraire les données de VerbNet et les manipuler pour les adapter à GenDR, un réalisateur profond basé sur la théorie Sens-Texte. Nous avons ainsi intégré 274 cadres syntaxiques à GenDR ainsi que 6 393 verbes de l’anglais.Natural language generation’s (NLG) goal is to produce understandable text from nonlinguistic data. Generation essentially consists in two tasks : first, determine the content of a message to transmit and then, carefully select the words that will transmit the desired message. That second task is called linguistic realization. An NLG system requires access to a rich lexical ressource to generate natural-looking text. If we want a maximum of flexibility in the realization, we need access to the combinatory properties of a lexical unit. Because verbs are at the core of each utterance and they usually control its structure, we should encode their properties to generate text representing the true richness of any language. In addition to that, verbs are highly unpredictible in terms of syntactic behaviours, which is why we need to store them into a dictionary. This work is about the integration of VerbNet, a rich lexical ressource on verbs and their syntactic behaviors, into a deep realizer called GenDR. To make this implementation possible, we have used the Python programming language to extract VerbNet’s data and to adapt it to GenDR. We have imported 274 syntactic frames and 6 393 verbs

    Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora

    Get PDF
    The amount of information available through the Internet has been showing a significant growth in the last decade. The information can result from various sources such as scientific experiments resulting from particle acceleration, recording the flight data of a commercial aircraft, or sets of documents from a given domain such as medical articles, news headlines from a newspaper, or social networks contents. Due to the volume of data that must be analyzed, it is necessary to endow the search engines with new tools that allow the user to obtain the desired information in a timely and accurate manner. One approach is the annotation of documents with their relevant expressions. The extraction of relevant expressions from natural language text documents can be accomplished by the use of semantic, syntactic, or statistical techniques. Although the latter tend to be not so accurate, they have the advantage of being independent of the language. This investigation was performed in the context of LocalMaxs, which is a statistical method, thus language-independent, capable of extracting relevant expressions from natural language corpora. However, due to the large volume of data involved, the sequential implementations of the above techniques have severe limitations both in terms of execution time and memory space. In this thesis we propose a distributed architecture and strategies for parallel implementations of statistical-based extraction of relevant expressions from large corpora. A methodology was developed for modeling and evaluating those strategies based on empirical and theoretical approaches to estimate the statistical distribution of n-grams in natural language corpora. These approaches were applied to guide the design and evaluation of the behavior of LocalMaxs parallel and distributed implementations on cluster and cloud computing platforms. The implementation alternatives were compared regarding their precision and recall, and their performance metrics, namely, execution time, parallel speedup and sizeup. The performance results indicate almost linear speedup and sizeup for the range of large corpora sizes
    corecore