1,162 research outputs found

    A kernel-based framework for learning graded relations from data

    Get PDF
    Driven by a large number of potential applications in areas like bioinformatics, information retrieval and social network analysis, the problem setting of inferring relations between pairs of data objects has recently been investigated quite intensively in the machine learning community. To this end, current approaches typically consider datasets containing crisp relations, so that standard classification methods can be adopted. However, relations between objects like similarities and preferences are often expressed in a graded manner in real-world applications. A general kernel-based framework for learning relations from data is introduced here. It extends existing approaches because both crisp and graded relations are considered, and it unifies existing approaches because different types of graded relations can be modeled, including symmetric and reciprocal relations. This framework establishes important links between recent developments in fuzzy set theory and machine learning. Its usefulness is demonstrated through various experiments on synthetic and real-world data.Comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibl

    A heuristic information retrieval study : an investigation of methods for enhanced searching of distributed data objects exploiting bidirectional relevance feedback

    Get PDF
    A thesis submitted for the degree of Doctor of Philosophy of the University of LutonThe primary aim of this research is to investigate methods of improving the effectiveness of current information retrieval systems. This aim can be achieved by accomplishing numerous supporting objectives. A foundational objective is to introduce a novel bidirectional, symmetrical fuzzy logic theory which may prove valuable to information retrieval, including internet searches of distributed data objects. A further objective is to design, implement and apply the novel theory to an experimental information retrieval system called ANACALYPSE, which automatically computes the relevance of a large number of unseen documents from expert relevance feedback on a small number of documents read. A further objective is to define a methodology used in this work as an experimental information retrieval framework consisting of multiple tables including various formulae which anow a plethora of syntheses of similarity functions, ternl weights, relative term frequencies, document weights, bidirectional relevance feedback and history adjusted term weights. The evaluation of bidirectional relevance feedback reveals a better correspondence between system ranking of documents and users' preferences than feedback free system ranking. The assessment of similarity functions reveals that the Cosine and Jaccard functions perform significantly better than the DotProduct and Overlap functions. The evaluation of history tracking of the documents visited from a root page reveals better system ranking of documents than tracking free information retrieval. The assessment of stemming reveals that system information retrieval performance remains unaffected, while stop word removal does not appear to be beneficial and can sometimes be harmful. The overall evaluation of the experimental information retrieval system in comparison to a leading edge commercial information retrieval system and also in comparison to the expert's golden standard of judged relevance according to established statistical correlation methods reveal enhanced system information retrieval effectiveness

    A Survey on Important Aspects of Information Retrieval

    Get PDF
    Information retrieval has become an important field of study and research under computer science due to the explosive growth of information available in the form of full text, hypertext, administrative text, directory, numeric or bibliographic text. The research work is going on various aspects of information retrieval systems so as to improve its efficiency and reliability. This paper presents a comprehensive survey discussing not only the emergence and evolution of information retrieval but also include different information retrieval models and some important aspects such as document representation, similarity measure and query expansion

    The generalized dice similarity measures for multiple attribute decision making with hesitant fuzzy linguistic information

    Get PDF
    In this paper, we shall present some novel Dice similarity measures of hesitant fuzzy linguistic term sets and the generalized Dice similarity measures of hesitant fuzzy linguistic term sets and indicate that the Dice similarity measures and asymmetric measures (projection measures) are the special cases of the generalized Dice similarity measures in some parameter values. Then, we propose the generalized Dice similarity measures-based multiple attribute decision making models with hesitant fuzzy linguistic term sets. Finally, a practical example concerning the evaluation of the quality of movies is given to illustrate the applicability and advantage of the proposed generalized Dice similarity measure

    The generalized dice similarity measures for multiple attribute decision making with hesitant fuzzy linguistic information

    Get PDF
    In this paper, we shall present some novel Dice similarity measures of hesitant fuzzy linguistic term sets and the generalized Dice similarity measures of hesitant fuzzy linguistic term sets and indicate that the Dice similarity measures and asymmetric measures (projection measures) are the special cases of the generalized Dice similarity measures in some parameter values. Then, we propose the generalized Dice similarity measures-based multiple attribute decision making models with hesitant fuzzy linguistic term sets. Finally, a practical example concerning the evaluation of the quality of movies is given to illustrate the applicability and advantage of the proposed generalized Dice similarity measure

    Data-informed fuzzy measures for fuzzy integration of intervals and fuzzy numbers

    Get PDF
    The fuzzy integral (FI) with respect to a fuzzy measure (FM) is a powerful means of aggregating information. The most popular FIs are the Choquet and Sugeno, and most research focuses on these two variants. The arena of the FM is much more populated, including numerically derived FMs such as the Sugeno λ-measure and decomposable measure, expert-defined FMs, and data-informed FMs. The drawback of numerically derived and expert-defined FMs is that one must know something about the relative values of the input sources. However, there are many problems where this information is unavailable, such as crowdsourcing. This paper focuses on data-informed FMs, or those FMs that are computed by an algorithm that analyzes some property of the input data itself, gleaning the importance of each input source by the data they provide. The original instantiation of a data-informed FM is the agreement FM, which assigns high confidence to combinations of sources that numerically agree with one another. This paper extends upon our previous work in datainformed FMs by proposing the uniqueness measure and additive measure of agreement for interval-valued evidence. We then extend data-informed FMs to fuzzy number (FN)-valued inputs. We demonstrate the proposed FMs by aggregating interval and FN evidence with the Choquet and Sugeno FIs for both synthetic and real-world data

    Data Fingerprinting -- Identifying Files and Tables with Hashing Schemes

    Get PDF
    Master's thesis in Computer scienceINTRODUCTION: Although hash functions are nothing new, these are not limited to cryptographic purposes. One important field is data fingerprinting. Here, the purpose is to generate a digest which serves as a fingerprint (or a license plate) that uniquely identifies a file. More recently, fuzzy fingerprinting schemes — which will scrap the avalanche effect in favour of detecting local changes — has hit the spotlight. The main purpose of this project is to find ways to classify text tables, and discover where potential changes or inconsitencies have happened. METHODS: Large parts of this report can be considered applied discrete mathematics — and finite fields and combinatorics have played an important part. Rabin’s fingerprinting scheme was tested extensively and compared against existing cryptographic algorithms, CRC and FNV. Moreover, a self-designed fuzzy hashing algorithm with the preliminary name No-Frills Hash has been created and tested against Nilsimsa and Spamsum. NFHash is based on Mersenne primes, and uses a sliding window to create a fuzzy hash. Futhermore, the usefullness of lookup tables (with partial seeds) were also explored. The fuzzy hashing algorithm has also been combined with a k-NN classifier to get an overview over it’s ability to classify files. In addition to NFHash, Bloom filters combined with Merkle Trees have been the most important part of this report. This combination will allow a user to see where a change was made, despite the fact that hash functions are one-way. Large parts of this project has dealt with the study of other open-source libraries and applications, such as Cassandra and SSDeep — as well as how bitcoins work. Optimizations have played a crucial role as well; different approaches to a problem might lead to the same solution, but resource consumption can be very different. RESULTS: The results have shown that the Merkle Tree-based approach can track changes to a table very quickly and efficiently, due to it being conservative when it comes to CPU resources. Moreover, the self-designed algorithm NFHash also does well in terms of file classification when it is coupled with a k-NN classifyer. CONCLUSION: Hash functions refers to a very diverse set of algorithms, and not just algorithms that serve a limited purpose. Fuzzy Fingerprinting Schemes can still be considered to be at their infant stage, but a lot has still happened the last ten years. This project has introduced two new ways to create and compare hashes that can be compared to similar, yet not necessarily identical files — or to detect if (and to what extent) a file was changed. Note that the algorithms presented here should be considered prototypes, and still might need some large scale testing to sort out potential flaw

    Inventor mobility index : a method to disambiguate inventor careers

    Full text link
    Usually patent data does not contain any unique identifiers for the patenting assignees or the inventors, as the main tasks of patent authorities is the examination of applications and the administration of the patent documents as public contracts and not the support of the empirical analysis of their data. An inventor in a patent document is identified by his or her name. Depending on the patent authority the full address or parts of it may be included to further identify this inventor. The goal is to define an inventor mobility index that traces the career of an inventor as an individual with all the job switches and relocations approximated by the patents as potential milestones. The inventor name is the main criteria for this identifier. The inventor address information on the other hand is only of limited use for the definition of a mobility index. The name alone can work for exotic name variants, but for more common names the problem of namesakes gets in the way of identifying individuals. The solution discussed here consists in the construction of a relationship network between inventors with the same name. This network will be created by using all the other information available in the patent data. These could be simple connections like the same applicant or just the same home address, up to more complex connections that are created by the overlapping of colleagues and co-inventors, similar technology fields or shared citations. Traversal of these heuristically weighted networks by using methods of the graph theory leads to clusters representing a person. The applied methodology will give uncommon names a higher degree of freedom regarding the heuristic limitations than the more common names will get
    • …
    corecore