10,138 research outputs found

    An effective Chinese indexing method based on partitioned signature files.

    Get PDF
    Wong Chi Yin.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 107-114).Abstract also in Chinese.Abstract --- p.iiAcknowledgements --- p.viChapter 1 --- Introduction --- p.1Chapter 1.1 --- Introduction to Chinese IR --- p.1Chapter 1.2 --- Contributions --- p.3Chapter 1.3 --- Organization of this Thesis --- p.5Chapter 2 --- Background --- p.6Chapter 2.1 --- Indexing methods --- p.6Chapter 2.1.1 --- Full-text scanning --- p.7Chapter 2.1.2 --- Inverted files --- p.7Chapter 2.1.3 --- Signature files --- p.9Chapter 2.1.4 --- Clustering --- p.10Chapter 2.2 --- Information Retrieval Models --- p.10Chapter 2.2.1 --- Boolean model --- p.11Chapter 2.2.2 --- Vector space model --- p.11Chapter 2.2.3 --- Probabilistic model --- p.13Chapter 2.2.4 --- Logical model --- p.14Chapter 3 --- Investigation of Segmentation on the Vector Space Retrieval Model --- p.15Chapter 3.1 --- Segmentation of Chinese Texts --- p.16Chapter 3.1.1 --- Character-based segmentation --- p.16Chapter 3.1.2 --- Word-based segmentation --- p.18Chapter 3.1.3 --- N-Gram segmentation --- p.21Chapter 3.2 --- Performance Evaluation of Three Segmentation Approaches --- p.23Chapter 3.2.1 --- Experimental Setup --- p.23Chapter 3.2.2 --- Experimental Results --- p.24Chapter 3.2.3 --- Discussion --- p.29Chapter 4 --- Signature File Background --- p.32Chapter 4.1 --- Superimposed coding --- p.34Chapter 4.2 --- False drop probability --- p.36Chapter 5 --- Partitioned Signature File Based On Chinese Word Length --- p.39Chapter 5.1 --- Fixed Weight Block (FWB) Signature File --- p.41Chapter 5.2 --- Overview of PSFC --- p.45Chapter 5.3 --- Design Considerations --- p.50Chapter 6 --- New Hashing Techniques for Partitioned Signature Files --- p.59Chapter 6.1 --- Direct Division Method --- p.61Chapter 6.2 --- Random Number Assisted Division Method --- p.62Chapter 6.3 --- Frequency-based hashing method --- p.64Chapter 6.4 --- Chinese character-based hashing method --- p.68Chapter 7 --- Experiments and Results --- p.72Chapter 7.1 --- Performance evaluation of partitioned signature file based on Chi- nese word length --- p.74Chapter 7.1.1 --- Retrieval Performance --- p.75Chapter 7.1.2 --- Signature Reduction Ratio --- p.77Chapter 7.1.3 --- Storage Requirement --- p.79Chapter 7.1.4 --- Discussion --- p.81Chapter 7.2 --- Performance evaluation of different dynamic signature generation methods --- p.82Chapter 7.2.1 --- Collision --- p.84Chapter 7.2.2 --- Retrieval Performance --- p.86Chapter 7.2.3 --- Discussion --- p.89Chapter 8 --- Conclusions and Future Work --- p.91Chapter 8.1 --- Conclusions --- p.91Chapter 8.2 --- Future work --- p.95Chapter A --- Notations of Signature Files --- p.96Chapter B --- False Drop Probability --- p.98Chapter C --- Experimental Results --- p.103Bibliography --- p.10

    "Influence Sketching": Finding Influential Samples In Large-Scale Regressions

    Full text link
    There is an especially strong need in modern large-scale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence Generalized Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware.Comment: fixed additional typo

    Communication Cost for Updating Linear Functions when Message Updates are Sparse: Connections to Maximally Recoverable Codes

    Full text link
    We consider a communication problem in which an update of the source message needs to be conveyed to one or more distant receivers that are interested in maintaining specific linear functions of the source message. The setting is one in which the updates are sparse in nature, and where neither the source nor the receiver(s) is aware of the exact {\em difference vector}, but only know the amount of sparsity that is present in the difference-vector. Under this setting, we are interested in devising linear encoding and decoding schemes that minimize the communication cost involved. We show that the optimal solution to this problem is closely related to the notion of maximally recoverable codes (MRCs), which were originally introduced in the context of coding for storage systems. In the context of storage, MRCs guarantee optimal erasure protection when the system is partially constrained to have local parity relations among the storage nodes. In our problem, we show that optimal solutions exist if and only if MRCs of certain kind (identified by the desired linear functions) exist. We consider point-to-point and broadcast versions of the problem, and identify connections to MRCs under both these settings. For the point-to-point setting, we show that our linear-encoder based achievable scheme is optimal even when non-linear encoding is permitted. The theory is illustrated in the context of updating erasure coded storage nodes. We present examples based on modern storage codes such as the minimum bandwidth regenerating codes.Comment: To Appear in IEEE Transactions on Information Theor

    Technology challenges of stealth unmanned combat aerial vehicles

    Get PDF
    The ever-changing battlefield environment, as well as the emergence of global command and control architectures currently used by armed forces around the globe, requires the use of robust and adaptive technologies integrated into a reliable platform. Unmanned Combat Aerial Vehicles (UCAVs) aim to integrate such advanced technologies while also increasing the tactical capabilities of combat aircraft. This paper provides a summary of the technical and operational design challenges specific to UCAVs, focusing on high-performance, and stealth designs. After a brief historical overview, the main technology demonstrator programmes currently under development are presented. The key technologies affecting UCAV design are identified and discussed. Finally, this paper briefly presents the main issues related to airworthiness, navigation, and ethical concerns behind UAV/UCAV operations

    Security and Privacy for Green IoT-based Agriculture: Review, Blockchain solutions, and Challenges

    Get PDF
    open access articleThis paper presents research challenges on security and privacy issues in the field of green IoT-based agriculture. We start by describing a four-tier green IoT-based agriculture architecture and summarizing the existing surveys that deal with smart agriculture. Then, we provide a classification of threat models against green IoT-based agriculture into five categories, including, attacks against privacy, authentication, confidentiality, availability, and integrity properties. Moreover, we provide a taxonomy and a side-by-side comparison of the state-of-the-art methods toward secure and privacy-preserving technologies for IoT applications and how they will be adapted for green IoT-based agriculture. In addition, we analyze the privacy-oriented blockchain-based solutions as well as consensus algorithms for IoT applications and how they will be adapted for green IoT-based agriculture. Based on the current survey, we highlight open research challenges and discuss possible future research directions in the security and privacy of green IoT-based agriculture

    Genetic diversity and population structure of six autochthonous pig breeds from Croatia, Serbia, and Slovenia

    Get PDF
    Background: The importance of local breeds as genetic reservoirs of valuable genetic variation is well established. Pig breeding in Central and South-Eastern Europe has a long tradition that led to the formation of several local pig breeds. In the present study, genetic diversity parameters were analysed in six autochthonous pig breeds from Slovenia, Croatia and Serbia (Banija spotted, Black Slavonian, Turopolje pig, Swallow-bellied Mangalitsa, Moravka and Krskopolje pig). Animals from each of these breeds were genotyped using microsatellites and single nucleotide polymorphisms (SNPs). The results obtained with these two marker systems and those based on pedigree data were compared. In addition, we estimated inbreeding levels based on the distribution of runs of homozygosity (ROH) and identified genomic regions under selection pressure using ROH islands and the integrated haplotype score (iHS). Results: The lowest heterozygosity values calculated from microsatellite and SNP data were observed in the Turopolje pig. The observed heterozygosity was higher than the expected heterozygosity in the Black Slavonian, Moravka and Turopolje pig. Both types of markers allowed us to distinguish clusters of individuals belonging to each breed. The analysis of admixture between breeds revealed potential gene flow between the Mangalitsa and Moravka, and between the Mangalitsa and Black Slavonian, but no introgression events were detected in the Banija spotted and Turopolje pig. The distribution of ROH across the genome was not uniform. Analysis of the ROH islands identified genomic regions with an extremely high frequency of shared ROH within the Swallow-bellied Mangalitsa, which harboured genes associated with cholesterol biosynthesis, fatty acid metabolism and daily weight gain. The iHS approach to detect signatures of selection revealed candidate regions containing genes with potential roles in reproduction traits and disease resistance. Conclusions: Based on the estimation of population parameters obtained from three data sets, we showed the existence of relationships among the six pig breeds analysed here. Analysis of the distribution of ROH allowed us to estimate the level of inbreeding and the extent of homozygous regions in these breeds. The iHS analysis revealed genomic regions potentially associated with phenotypic traits and allowed the detection of genomic regions under selection pressure

    Road Graph Simplification for Minimum Cost Flow Problem

    Get PDF
    V této práci se zaměřujeme na problém výpočtu nejlevnějších toků jako na klíčový problém pro řízení dopravního provozu. Tento problém se řeší pravidelně během dne, tj. nejde o nalezení řešení jednou, ale o dlouhodobý proces, ve kterém se pořád hledá řešení toho samého problémů s různými vstupy. Proto představujeme řešení, které může být úspěšně použito v dlouhodobém horizontu. Předpokládáme, že v poptávce existuje periodický vzor, tj. směr vozidel se obecně opakuje denně. Naše zlepšení je založeno na metodě generování sloupců, která umožňuje opětovné použití cest vozidel z předchozích dnů při vyhledávání řešení. Dosáhli jsme snížení výpočetního času o 40% při zachování optimality řešení.In this work we consider the Minimum Cost Multicommodity Network Flow (MCMNF) problem as a key problem for traffic routing. The routing problem is recurring, it should be solved many times a day on a daily basis. So we present a solution that may be successfully used in the long term. We make use of a periodic demand pattern, i.e. vehicles' directions are in general recurring daily. Our improvement is based on column generation method, that allows us to reuse vehicles paths from previous days in the solution process. We achieved a 40% reduction of computational time, while the optimal solution is preserved

    Evaluating diabetes and hypertension disease causality using mouse phenotypes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genome-wide association studies (GWAS) have found hundreds of single nucleotide polymorphisms (SNPs) associated with common diseases. However, it is largely unknown what genes linked with the SNPs actually implicate disease causality. A definitive proof for disease causality can be demonstration of disease-like phenotypes through genetic perturbation of the genes or alleles, which is obviously a daunting task for complex diseases where only mammalian models can be used.</p> <p>Results</p> <p>Here we tapped the rich resource of mouse phenotype data and developed a method to quantify the probability that a gene perturbation causes the phenotypes of a disease. Using type II diabetes (T2D) and hypertension (HT) as study cases, we found that the genes, when perturbed, having high probability to cause T2D and HT phenotypes tend to be hubs in the interactome networks and are enriched for signaling pathways regulating metabolism but not metabolic pathways, even though the genes in these metabolic pathways are often the most significantly changed in expression levels in these diseases.</p> <p>Conclusions</p> <p>Compared to human genetic disease-based predictions, our mouse phenotype based predictors greatly increased the coverage while keeping a similarly high specificity. The disease phenotype probabilities given by our approach can be used to evaluate the likelihood of disease causality of disease-associated genes and genes surrounding disease-associated SNPs.</p
    corecore