181 research outputs found

    Pattern masking for dictionary matching

    Get PDF
    Data masking is a common technique for sanitizing sensitive data maintained in database systems, and it is also becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem. In PMDM, we are given a dictionary of d strings, each of length , a query string q of length , and a positive integer z, and we are asked to compute a smallest set K ⊆ {1,…,}, so that if q[i] is replaced by a wildcard for all i ∈ K, then q matches at least z strings from . Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet. We thus approach the problem from a more practical perspective. We show a combinatorial ((d)^{|K|/3}+d)-time and (d)-space algorithm for PMDM for |K| = (1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails [Abboud et al., SIAM J. Comput. 2018; Lincoln et al., SODA 2018]. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in . Note that PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z = 1, one obtains the minimal number of mismatches of q with any string from . The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time (2^ d). We present a data structure for PMDM that answers queries over in time (2^{/2}(2^{/2}+τ)) and requires space (2^ d²/τ²+2^{/2}d), for any parameter τ ∈ [1,d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., SODA 2017]. This gives a polynomial-time (d^{1/4+ε})-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. </p

    Preparation of name and address data for record linkage using hidden Markov models

    Get PDF
    BACKGROUND: Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalised in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as "standardisation". This paper describes an alternative approach to standardisation, using a combination of lexicon-based tokenisation and probabilistic hidden Markov models (HMMs). METHODS: HMMs were trained to standardise typical Australian name and address data drawn from a range of health data collections. The accuracy of the results was compared to that produced by rule-based systems. RESULTS: Training of HMMs was found to be quick and did not require any specialised skills. For addresses, HMMs produced equal or better standardisation accuracy than a widely-used rule-based system. However, acccuracy was worse when used with simpler name data. Possible reasons for this poorer performance are discussed. CONCLUSION: Lexicon-based tokenisation and HMMs provide a viable and effort-effective alternative to rule-based systems for pre-processing more complex variably formatted data such as addresses. Further work is required to improve the performance of this approach with simpler data such as names. Software which implements the methods described in this paper is freely available under an open source license for other researchers to use and improve

    Collectivity and configuration mixing in 186,188Pb and 194Po

    No full text
    Lifetimes of prolate intruder states in 186Pb and oblate intruder states in 194Po have been determined by employing, for the first time, the recoil-decay tagging technique in recoil distance Doppler-shift lifetime measurements. In addition, lifetime measurements of prolate states in 188Pb up to the 8+ state were carried out using the recoil-gating method. The B(E2) values have been deduced from which deformation parameters |β2|=0.29(5) and |β2|=0.17(3) for the prolate and the oblate bands, respectively, have been extracted. The results also shed new light on the mixing between different shapes

    Functional Diversity and Structural Disorder in the Human Ubiquitination Pathway

    Get PDF
    The ubiquitin-proteasome system plays a central role in cellular regulation and protein quality control (PQC). The system is built as a pyramid of increasing complexity, with two E1 (ubiquitin activating), few dozen E2 (ubiquitin conjugating) and several hundred E3 (ubiquitin ligase) enzymes. By collecting and analyzing E3 sequences from the KEGG BRITE database and literature, we assembled a coherent dataset of 563 human E3s and analyzed their various physical features. We found an increase in structural disorder of the system with multiple disorder predictors (IUPred - E1: 5.97%, E2: 17.74%, E3: 20.03%). E3s that can bind E2 and substrate simultaneously (single subunit E3, ssE3) have significantly higher disorder (22.98%) than E3s in which E2 binding (multi RING-finger, mRF, 0.62%), scaffolding (6.01%) and substrate binding (adaptor/substrate recognition subunits, 17.33%) functions are separated. In ssE3s, the disorder was localized in the substrate/adaptor binding domains, whereas the E2-binding RING/HECT-domains were structured. To demonstrate the involvement of disorder in E3 function, we applied normal modes and molecular dynamics analyses to show how a disordered and highly flexible linker in human CBL (an E3 that acts as a regulator of several tyrosine kinase-mediated signalling pathways) facilitates long-range conformational changes bringing substrate and E2-binding domains towards each other and thus assisting in ubiquitin transfer. E3s with multiple interaction partners (as evidenced by data in STRING) also possess elevated levels of disorder (hubs, 22.90% vs. non-hubs, 18.36%). Furthermore, a search in PDB uncovered 21 distinct human E3 interactions, in 7 of which the disordered region of E3s undergoes induced folding (or mutual induced folding) in the presence of the partner. In conclusion, our data highlights the primary role of structural disorder in the functions of E3 ligases that manifests itself in the substrate/adaptor binding functions as well as the mechanism of ubiquitin transfer by long-range conformational transitions. © 2013 Bhowmick et al

    A c-di-GMP Effector System Controls Cell Adhesion by Inside-Out Signaling and Surface Protein Cleavage

    Get PDF
    In Pseudomonas fluorescens Pf0-1 the availability of inorganic phosphate (Pi) is an environmental signal that controls biofilm formation through a cyclic dimeric GMP (c-di-GMP) signaling pathway. In low Pi conditions, a c-di-GMP phosphodiesterase (PDE) RapA is expressed, depleting cellular c-di-GMP and causing the loss of a critical outer-membrane adhesin LapA from the cell surface. This response involves an inner membrane protein LapD, which binds c-di-GMP in the cytoplasm and exerts a periplasmic output promoting LapA maintenance on the cell surface. Here we report how LapD differentially controls maintenance and release of LapA: c-di-GMP binding to LapD promotes interaction with and inhibition of the periplasmic protease LapG, which targets the N-terminus of LapA. We identify conserved amino acids in LapA required for cleavage by LapG. Mutating these residues in chromosomal lapA inhibits LapG activity in vivo, leading to retention of the adhesin on the cell surface. Mutations with defined effects on LapD's ability to control LapA localization in vivo show concomitant effects on c-di-GMP-dependent LapG inhibition in vitro. To establish the physiological importance of the LapD-LapG effector system, we track cell attachment and LapA protein localization during Pi starvation. Under this condition, the LapA adhesin is released from the surface of cells and biofilms detach from the substratum. This response requires c-di-GMP depletion by RapA, signaling through LapD, and proteolytic cleavage of LapA by LapG. These data, in combination with the companion study by Navarro et al. presenting a structural analysis of LapD's signaling mechanism, give a detailed description of a complete c-di-GMP control circuit—from environmental signal to molecular output. They describe a novel paradigm in bacterial signal transduction: regulation of a periplasmic enzyme by an inner membrane signaling protein that binds a cytoplasmic second messenger

    Some methods for blindfolded record linkage

    Get PDF
    BACKGROUND: The linkage of records which refer to the same entity in separate data collections is a common requirement in public health and biomedical research. Traditionally, record linkage techniques have required that all the identifying data in which links are sought be revealed to at least one party, often a third party. This necessarily invades personal privacy and requires complete trust in the intentions of that party and their ability to maintain security and confidentiality. Dusserre, Quantin, Bouzelat and colleagues have demonstrated that it is possible to use secure one-way hash transformations to carry out follow-up epidemiological studies without any party having to reveal identifying information about any of the subjects – a technique which we refer to as "blindfolded record linkage". A limitation of their method is that only exact comparisons of values are possible, although phonetic encoding of names and other strings can be used to allow for some types of typographical variation and data errors. METHODS: A method is described which permits the calculation of a general similarity measure, the n-gram score, without having to reveal the data being compared, albeit at some cost in computation and data communication. This method can be combined with public key cryptography and automatic estimation of linkage model parameters to create an overall system for blindfolded record linkage. RESULTS: The system described offers good protection against misdeeds or security failures by any one party, but remains vulnerable to collusion between or simultaneous compromise of two or more parties involved in the linkage operation. In order to reduce the likelihood of this, the use of last-minute allocation of tasks to substitutable servers is proposed. Proof-of-concept computer programmes written in the Python programming language are provided to illustrate the similarity comparison protocol. CONCLUSION: Although the protocols described in this paper are not unconditionally secure, they do suggest the feasibility, with the aid of modern cryptographic techniques and high speed communication networks, of a general purpose probabilistic record linkage system which permits record linkage studies to be carried out with negligible risk of invasion of personal privacy

    A Ligand Channel through the G Protein Coupled Receptor Opsin

    Get PDF
    The G protein coupled receptor rhodopsin contains a pocket within its seven-transmembrane helix (TM) structure, which bears the inactivating 11-cis-retinal bound by a protonated Schiff-base to Lys296 in TM7. Light-induced 11-cis-/all-trans-isomerization leads to the Schiff-base deprotonated active Meta II intermediate. With Meta II decay, the Schiff-base bond is hydrolyzed, all-trans-retinal is released from the pocket, and the apoprotein opsin reloaded with new 11-cis-retinal. The crystal structure of opsin in its active Ops* conformation provides the basis for computational modeling of retinal release and uptake. The ligand-free 7TM bundle of opsin opens into the hydrophobic membrane layer through openings A (between TM1 and 7), and B (between TM5 and 6), respectively. Using skeleton search and molecular docking, we find a continuous channel through the protein that connects these two openings and comprises in its central part the retinal binding pocket. The channel traverses the receptor over a distance of ca. 70 Å and is between 11.6 and 3.2 Å wide. Both openings are lined with aromatic residues, while the central part is highly polar. Four constrictions within the channel are so narrow that they must stretch to allow passage of the retinal β-ionone-ring. Constrictions are at openings A and B, respectively, and at Trp265 and Lys296 within the retinal pocket. The lysine enforces a 90° elbow-like kink in the channel which limits retinal passage. With a favorable Lys side chain conformation, 11-cis-retinal can take the turn, whereas passage of the all-trans isomer would require more global conformational changes. We discuss possible scenarios for the uptake of 11-cis- and release of all-trans-retinal. If the uptake gate of 11-cis-retinal is assigned to opening B, all-trans is likely to leave through the same gate. The unidirectional passage proposed previously requires uptake of 11-cis-retinal through A and release of photolyzed all-trans-retinal through B

    Molecular dynamics simulation of biomolecular systems

    Get PDF
    The group for computer-aided chemistry at the ETH Zurich focuses its research on the development of methodology to simulate the behavior of biomolecular systems and the use of simulation techniques to analyze and understand biomolecular processes at the atomic level. Here, the current research directions are briefly reviewed and illustrated with a few examples
    corecore