106 research outputs found
Coreference detection in XML metadata
Preserving data quality is an important issue in data collection management. One of the crucial issues hereby is the detection of duplicate objects (called coreferent objects) which describe the same entity, but in different ways. In this paper we present a method for detecting coreferent objects in metadata, in particular in XML schemas. Our approach consists in comparing the paths from a root element to a given element in the schema. Each path precisely defines the context and location of a specific element in the schema. Path matching is based on the comparison of the different steps of which paths are composed. The uncertainty about the matching of steps is expressed with possibilistic truth values and aggregated using the Sugeno integral. The discovered coreference of paths can help for determining the coreference of different XML schemas
Programming language semantics as a foundation for Bayesian inference
Bayesian modelling, in which our prior belief about the distribution on model parameters
is updated by observed data, is a popular approach to statistical data analysis.
However, writing specific inference algorithms for Bayesian models by hand is time-consuming
and requires significant machine learning expertise.
Probabilistic programming promises to make Bayesian modelling easier and more
accessible by letting the user express a generative model as a short computer program
(with random variables), leaving inference to the generic algorithm provided by the
compiler of the given language. However, it is not easy to design a probabilistic programming
language correctly and define the meaning of programs expressible in it.
Moreover, the inference algorithms used by probabilistic programming systems usually
lack formal correctness proofs and bugs have been found in some of them, which
limits the confidence one can have in the results they return.
In this work, we apply ideas from the areas of programming language theory and
statistics to show that probabilistic programming can be a reliable tool for Bayesian
inference. The first part of this dissertation concerns the design, semantics and type
system of a new, substantially enhanced version of the Tabular language. Tabular is a
schema-based probabilistic language, which means that instead of writing a full program,
the user only has to annotate the columns of a schema with expressions generating
corresponding values. By adopting this paradigm, Tabular aims to be user-friendly,
but this unusual design also makes it harder to define the syntax and semantics correctly
and reason about the language. We define the syntax of a version of Tabular extended
with user-defined functions and pseudo-deterministic queries, design a dependent type
system for this language and endow it with a precise semantics. We also extend Tabular
with a concise formula notation for hierarchical linear regressions, define the type
system of this extended language and show how to reduce it to pure Tabular.
In the second part of this dissertation, we present the first correctness proof for a
Metropolis-Hastings sampling algorithm for a higher-order probabilistic language. We
define a measure-theoretic semantics of the language by means of an operationally-defined
density function on program traces (sequences of random variables) and a map
from traces to program outputs. We then show that the distribution of samples returned
by our algorithm (a variant of “Trace MCMC” used by the Church language) matches
the program semantics in the limit
Implementing the Duty Trip Support Application
We are in the process of developing an agent and ontology-based Duty Trip Support application. The goal of this paper is to consider issues arising when implementing such a system. In addition to the description of our current implementation, which is also critically analyzed, other possible approaches are considered as well.software agents, agent systems, ontologies, transport objects, agent-non-agent integration.
System SINUS – otwarte narzędzie do budowy bibliograficznych baz danych
The aim of this paper is to present new open tool for building bibliographic databases. SINUS system, developed by Poznań Supercomputing and Networking Center, was initially created to fulfill the needs related to management of scientific publications of Poznań University of Technology staff. In the paper we present basic functional assumptions of the system, its current functionality and future development directions
Recalibrating classifiers for interpretable abusive content detection
Dataset and code for the paper, 'Recalibrating classifiers for interpretable abusive content detection' by Vidgen et al. (2020) -- to appear at the NLP + CSS workshop at EMNLP 2020.
We provide:
1,000 annotated tweets, sampled using the Davidson classifier with 20 0.05 increments (50 from each) from a dataset of tweets directed against MPs in the UK 2017 General Election
1,000 annotated tweets, sampled using the Perspective classifier with 20 0.05 increments (50 from each) from a dataset of tweets directed against MPs in the UK 2017 General Election
Code for recalibration in R and STAN.
Annotation guidelines for both datasets.
Paper abstract
We investigate the use of machine learning classifiers for detecting online abuse in empirical research. We show that uncalibrated classifiers (i.e. where the 'raw' scores are used) align poorly with human evaluations. This limits their use to understand the dynamics, patterns and prevalence of online abuse. We examine two widely used classifiers (created by Perspective and Davidson et al.) on a dataset of tweets directed against candidates in the UK's 2017 general election.
A Bayesian approach is presented to recalibrate the raw scores from the classifiers, using probabilistic programming and newly annotated data. We argue that interpretability evaluation and recalibration is integral to the application of abusive content classifiers
System SINUS – otwarte narzędzie do budowy bibliograficznych baz danych
Streszczenie: Celem artykułu jest przedstawienie nowego, otwartego narzędzia do budowy bibliograficznych baz danych. System SINUS, opracowany w Poznańskim Centrum Superkomputerowo-Sieciowym (PCSS), powstał pierwotnie na potrzeby zarządzania danymi o publikacjach naukowych pracowników Politechniki Poznańskiej (PP). W referacie omówione są podstawowe założenia, jakie przyświecały twórcom systemu, jego funkcjonalność oraz dotychczasowe wykorzystanie, a także dalsze kierunki rozwoju.Abstract: The aim of this paper is to present new open tool for building bibliographic databases. SINUS system, developed by Poznań Supercomputing and Networking Center, was initially created to fulfill the needs related to management of scientific publications of Poznań University of Technology staff. In the paper we present basic functional assumptions of the system, its current functionality and future development directions
Semantical mapping of attribute values for data integration
Nowadays the amount of data is increasing very fast. Moreover, useful information is scattered over multiple sources. Therefore, automatic data integration that guarantees high data quality is extremely important. One of the crucial operations in integration of information from independent databases is detection of different representations of the same piece of information (called coreferent data) and translation of the representation of data from one source into the representation of the other source. That translation is also known as object mapping. In this paper, we investigate automatic mapping methods for attributes the values of which may need semantical comparison and can be sorted by means of an order relation that reflects a notion of generality. These mapping methods are investigated closely in terms of their effectiveness. An experimental evaluation of our method shows that using different mapping methods can enlarge a set of true positive mappings
Fibers as carriers of microbial particles
Background: The aim of the study was to assess the ability of natural, synthetic and semi-synthetic fibers to transport microbial particles. Material and Methods: The simultaneously settled dust and aerosol sampling was carried out in 3 industrial facilities processing natural (cotton, silk, flax, hemp), synthetic (polyamide, polyester, polyacrylonitrile, polypropylene) and semi-synthetic (viscose) fibrous materials; 2 stables where horses and sheep were bred; 4 homes where dogs or cats were kept and 1 zoo lion pavilion. All samples were laboratory analyzed for their microbiological purity. The isolated strains were qualitatively identified. To identify the structure and arrangement of fibers that may support transport of microbial particles, a scanning electron microscopy analysis was performed. Results: Both settled and airborne fibers transported analogous microorganisms. All synthetic, semi-synthetic and silk fibers, present as separated threads with smooth surface, were free from microbial contamination. Natural fibers with loose packing and rough surface (e.g., wool, horse hair), sheaf packing and septated surface (e.g., flax, hemp) or present as twisted ribbons with corrugated surface (cotton) were able to carry up to 9×105 cfu/g aerobic bacteria, 3.4×104 cfu/g anaerobic bacteria and 6.3×104 cfu/g of fungi, including pathogenic strains classified by Directive 2000/54/EC in hazard group 2. Conclusions: As plant and animal fibers are contaminated with a significant number of microorganisms, including pathogens, all of them should be mechanically eliminated from the environment. In factories, if the manufacturing process allows, they should be replaced by synthetic or semi-synthetic fibers. To avoid unwanted exposure to harmful microbial agents on fibers, the containment measures that efficiently limit their presence and dissemination in both occupational and non-occupational environments should be introduced. Med Pr 2015;66(4):511–52
- …