83 research outputs found

    Entity-Linking via Graph-Distance Minimization

    Get PDF
    Entity-linking is a natural-language-processing task that consists in identifying the entities mentioned in a piece of text, linking each to an appropriate item in some knowledge base; when the knowledge base is Wikipedia, the problem comes to be known as wikification (in this case, items are wikipedia articles). One instance of entity-linking can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between chosen items. Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; nonetheless, under some restrictive assumptions, it turns out to be solvable in linear time. For the general case, we propose two heuristics: one tries to enforce the above assumptions and another one is based on the notion of hitting distance; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset.Comment: In Proceedings GRAPHITE 2014, arXiv:1407.7671. The second and third authors were supported by the EU-FET grant NADINE (GA 288956

    Nanosistemas dendríticos para transporte de fármacos

    Get PDF
    A medida que aumenta la esperanza de vida aparecen nuevos retos en la medicina que requieren tratamientos más complejos capaces de mejorar la solubilidad de los medicamentos, superar barreras biológicas y, desarrollar formulaciones más eficientes y de acción más prolongada. El uso de dendrímeros ha supuesto un importante avance debido a su gran potencial. Por si solos pueden usarse como sistemas de transporte o unidos a otras moléculas, formar parte de sistemas más complejos como micelas o vesículas. Esta tesis doctoral se centrará en la síntesis de dendrímeros y copolímeros de bloque dendrítico, así como en la preparación y caracterización de nanosistemas dendríticos, la encapsulación de fármacos o biomoléculas y el estudio in vitro e in vivo de sus propiedades

    Using graph distances for named-entity linking

    Get PDF
    Entity-linking is a natural-language-processing task that consists in identifying strings of text that refer to a particular item in some reference knowledge base. When the knowledge base is Wikipedia, the problem is also referred to as wikification (in this case, items are Wikipedia articles). Entity-linking consists conceptually of many different phases: identifying the portions of text that may refer to an entity (sometimes called "entity detection"), determining a set of concepts (candidates) from the knowledge base that may match each such portion, and choosing one candidate for each set; the latter step, known as candidate selection, is the phase on which this paper focuses. One instance of candidate selection can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between the selected items. Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; we propose several heuristics trying to optimize similar easier objective functions; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset. Finally, in the appendix, we show an exact linear time algorithm that works under some more restrictive assumptions

    Index compression for information retrielval systems

    Get PDF
    [Abstract] Given the increasing amount of information that is available today, there is a clear need for Information Retrieval (IR) systems that can process this information in an efficient and effective way. Efficient processing means minimising the amount of time and space required to process data, whereas effective processing means identifying accurately which information is relevant to the user and which is not. Traditionally, efficiency and effectiveness are at opposite ends (what is beneficial to efficiency is usually harmful to effectiveness, and vice versa), so the challenge of IR systems is to find a compromise between efficient and effective data processing. This thesis investigates the efficiency of IR systems. It suggests several novel strategies that can render IR systems more efficient by reducing the index size of IR systems, referred to as index compression. The index is the data structure that stores the information handled in the retrieval process. Two different approaches are proposed for index compression, namely document reordering and static index pruning. Both of these approaches exploit document collection characteristics in order to reduce the size of indexes, either by reassigning the document identifiers in the collection in the index, or by selectively discarding information that is less relevant to the retrieval process by pruning the index. The index compression strategies proposed in this thesis can be grouped into two categories: (i) Strategies which extend state of the art in the field of efficiency methods in novel ways. (ii) Strategies which are derived from properties pertaining to the effectiveness of IR systems; these are novel strategies, because they are derived from effectiveness as opposed to efficiency principles, and also because they show that efficiency and effectiveness can be successfully combined for retrieval. The main contributions of this work are in indicating principled extensions of state of the art in index compression, and also in suggesting novel theoretically-driven index compression techniques which are derived from principles of IR effectiveness. All these techniques are evaluated extensively, in thorough experiments involving established datasets and baselines, which allow for a straight-forward comparison with state of the art. Moreover, the optimality of the proposed approaches is addressed from a theoretical perspective.[Resumen] Dada la creciente cantidad de información disponible hoy en día, existe una clara necesidad de sistemas de Recuperación de Información (RI) que sean capaces de procesar esa información de una manera efectiva y eficiente. En este contexto, eficiente significa cantidad de tiempo y espacio requeridos para procesar datos, mientras que efectivo significa identificar de una manera precisa qué información es relevante para el usuario y cual no lo es. Tradicionalmente, eficiencia y efectividad se encuentran en polos opuestos - lo que es beneficioso para la eficiencia, normalmente perjudica la efectividad y viceversa - así que un reto para los sistemas de RI es encontrar un compromiso adecuado entre el procesamiento efectivo y eficiente de los datos. Esta tesis investiga el problema de la eficiencia de los sistemas de RI. Sugiere diferentes estrategias novedosas que pueden permitir la reducción de los índices de los sistemas de RI, enmarcadas dentro da las técnicas conocidas como compresión de índices. El índice es la estructura de datos que almacena la información utilizada en el proceso de recuperación. Se presentan dos aproximaciones diferentes para la compresión de los índices, referidas como reordenación de documentos y pruneado estático del índice. Ambas aproximaciones explotan características de colecciones de documentos para reducir el tamaño final de los índices, mediante la reasignación de los identificadores de los documentos de la colección o bien descartando selectivamente la información que es "menos relevante" para el proceso de recuperación. Las estrategias de compresión propuestas en este tesis se pueden agrupar en dos categorías: (i) estrategias que extienden el estado del arte en la eficiencia de una manera novedosa y (ii) estrategias derivadas de propiedades relacionadas con los principios de la efectividad en los sistemas de RI; estas estrategias son novedosas porque son derivadas desde principios de la efectividad como contraposición a los de la eficiencia, e porque revelan como la eficiencia y la efectividad pueden ser combinadas de una manera efectiva para la recuperación de información. Las contribuciones de esta tesis abarcan la elaboración de técnicas del estado del arte en compresión de índices y también en la derivación de técnicas de compresión basadas en fundamentos teóricos derivados de los principios de la efectividad de los sistemas de RI. Todas estas técnicas han sido evaluadas extensamente con numerosos experimentos que involucran conjuntos de datos y técnicas de referencia bien establecidas en el campo, las cuales permiten una comparación directa con el estado del arte. Finalmente, la optimalidad de las aproximaciones presentadas es tratada desde una perspectiva teórica

    Bias Beyond English: Counterfactual Tests for Bias in Sentiment Analysis in Four Languages

    Get PDF
    Sentiment analysis (SA) systems are used in many products and hundreds of languages. Gender and racial biases are well-studied in English SA systems, but understudied in other languages, with few resources for such studies. To remedy this, we build a counterfactual evaluation corpus for gender and racial/migrant bias in four languages. We demonstrate its usefulness by answering a simple but important question that an engineer might need to answer when deploying a system: What biases do systems import from pre-trained models when compared to a baseline with no pre-training? Our evaluation corpus, by virtue of being counterfactual, not only reveals which models have less bias, but also pinpoints changes in model bias behaviour, which enables more targeted mitigation strategies. We release our code and evaluation corpora to facilitate future research

    A comparative performance evaluation of different implementations of the SOAP protocol

    Get PDF
    Abstract—This paper presents a study evaluation of the SOAP [1] protocol performance between two different implementations: Java (Axis2) [2] and Erlang. This comparison has been carried out using several testbeds with input and output data of different sizes. More concretely, we developed three different web services representing typical scenarios likely to be found in real environments. The evaluation is two-fold: we measured both the number of requests per second answered (throughput) by each server and the response to a common server workload, mixing stress and stand-by phases. The Erlang [3] functional programming language claims to be especifically designed and suited for distributed, reliable and soft real-time concurrent systems. Morever, its built-in lightweight processes management and easeness of replication within distributed environments stand out Erlang as an appealing choice for service oriented architectures (SOAs) [4]. On the other hand, we compared this new approximation with the well-known Apache Axis2 project, as it is widely employed on the Web Services field by the Java community. This work allows us to conclude that the Erlang server is more suitable when the computational cost of the web service is low, whereas the Axis2 server is more efficient as the service workload increases. I

    IntentsKB: A Knowledge Base of Entity-Oriented Search Intents

    Full text link
    We address the problem of constructing a knowledge base of entity-oriented search intents. Search intents are defined on the level of entity types, each comprising of a high-level intent category (property, website, service, or other), along with a cluster of query terms used to express that intent. These machine-readable statements can be leveraged in various applications, e.g., for generating entity cards or query recommendations. By structuring service-oriented search intents, we take one step towards making entities actionable. The main contribution of this paper is a pipeline of components we develop to construct a knowledge base of entity intents. We evaluate performance both component-wise and end-to-end, and demonstrate that our approach is able to generate high-quality data.Comment: Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM'18), 2018. 4 pages. 2 figure

    Information extraction from multimedia web documents: an open-source platform and testbed

    No full text
    The LivingKnowledge project aimed to enhance the current state of the art in search, retrieval and knowledge management on the web by advancing the use of sentiment and opinion analysis within multimedia applications. To achieve this aim, a diverse set of novel and complementary analysis techniques have been integrated into a single, but extensible software platform on which such applications can be built. The platform combines state-of-the-art techniques for extracting facts, opinions and sentiment from multimedia documents, and unlike earlier platforms, it exploits both visual and textual techniques to support multimedia information retrieval. Foreseeing the usefulness of this software in the wider community, the platform has been made generally available as an open-source project. This paper describes the platform design, gives an overview of the analysis algorithms integrated into the system and describes two applications that utilise the system for multimedia information retrieval
    corecore