7 research outputs found

    B-splines in EMD and Graph Theory in Pattern Recognition

    Get PDF
    With the development of science and technology, a large amount of data is waiting for further scientific exploration. We can always build up some good mathematical models based on the given data to analyze and solve the real life problems. In this work, we propose three types of mathematical models for different applications.;In chapter 1, we use Bspline based EMD to analysis nonlinear and no-stationary signal data. A new idea about the boundary extension is introduced and applied to the Empirical Mode Decomposition(EMD) algorithm. Instead of the traditional mirror extension on the boundary, we propose a ratio extension on the boundary.;In chapter 2 we propose a weighted directed multigraph for text pattern recognition. We set up a weighted directed multigraph model using the distances between the keywords as the weights of arcs. We then developed a keyword-frequency-distance-based algorithm which not only utilizes the frequency information of keywords but also their ordering information.;In chapter 3, we propose a centrality guided clustering method. Different from traditional methods which choose a center of a cluster randomly, we start clustering from a LEADER - a vertex with highest centrality score, and a new member is added into an existing community if the new vertex meet some criteria and the new community with the new vertex maintain a certain density.;In chapter 4, we define a new graph optimization problem which is called postman tour with minimum route-pair cost. And we model the DNA sequence assembly problem as the postman tour with minimum route-pair cost problem

    Reimagining the SSMinT Software Package

    Get PDF
    We examine two proposed indexing algorithms taking advantage of the new SSMinT libraries. The two algorithms primarily differ in their selection of documents for learning. The batch indexing method selects some random number of documents for learning. The iterative indexing method uses a single randomly selected document to discover semantic signatures, which are then used to find additional related documents. The batch indexing method discovers one to three semantic signatures per document, resulting in poor clustering performance as evaluated by human cross-validation of clusters using the Adjusted Rand Index. The iterative indexing method discovers more semantic signatures per document, resulting in far better clustering performance using the same cross-validation method.;Our new tools enable faster development of new experiments, forensic applications, and more. The experiments show that SSMinT can provide effective indexing for text data such as e-mail or web pages. We conclude with areas of future research which may benefit from utilizing SSMinT. (Abstract shortened by ProQuest.)

    Automated Discovery of Relevant Features for Text Mining

    Get PDF
    Text mining refers to the process of extracting information from text. There are massive amounts of data available today due to enhanced data collection capabilities, inexpensive high capacity storage, and the proliferation of World Wide Web pages. A substantial portion of this data is in text format. The main goal of text data mining software tools is to help us learn and benefit from this wealth of text data. Humans cannot cope with the overwhelming text data resources. The information in text data needs to be filtered, summarized, analyzed, and refined for human analysts.;A semantic signature is the concept that semantic content in text has characteristic word patterns, such as frequency of words and proximity between words, which can be identified and quantified. A type of quantitative semantic signature was developed by Barnes, Eschen, Para, and Peddada in 2010. The utility and sensitivity of semantic signatures of this type in capturing semantic content in text data was demonstrated by this group via the development of a software package named Semantic Signature Mining Tool (SSMinT). SSMinT is a suite of software tools that assist a data analyst to develop semantic signatures that capture targeted content and then use these semantic signatures to categorize a corpus of text documents with unknown content or to retrieve text documents with the targeted content from a corpus of documents with arbitrary content.;Key features of SSMinT are the expert input from the human analyst and the interaction between the analyst and the software; the tool is designed to assist the analyst and does not work independently. This is a strong feature in the sense that the resulting semantic signatures are tailored by the analyst\u27s expert knowledge of the domain. This was demonstrated by Barnes, Eschen, Para, and Peddada to be a powerful approach to text data mining.;This thesis develops an automated version of the SSMinT software package that requires minimal input from an analyst. This work includes an automated keyword group generation and refinement algorithm, automated generation of candidate semantic signatures, methods to prune irrelevant and redundant relevant semantic signatures from the semantic signature set. Relieving the analyst from the tedious and time consuming task of developing semantic signatures is not the only motivation for an automated tool. The automation is designed to discover semantic signatures in text data without human input, except for the choice of training documents. The advantage of automated semantic signature discovery is the ability to identify patterns an analyst may not recognize due to the large volume of data or his point of view bias. The effectiveness of Automated SSMinT in categorizing text documents into groups with closely related content and retrieving documents with content similar to those in its training set is demonstrated in experiments on various corpora. These experiments prove Automated SSMinT to be an efficient, convenient, and powerful text mining tool

    Enhanced Automated Discovery of Relevant Features in Text Mining

    Get PDF
    Semantic Signature Mining Tool (SSMinT) is a suite of software tools that aid a data analyst to develop semantic signatures that capture targeted content, and uses these semantic signatures to categorize text documents with unknown content or retrieve documents of a specific type or interest. This was developed by Barnes, Eschen, Para, and Peddada in 2010. These tools require expert input. An automated version of SSMinT software package was developed with the aim to reduce manual input and use machine learning techniques to discover semantic signatures. This was developed by Barnes, Eschen, and Kota in 2011. Key features of this include automated keyword group generation, automated generation of candidate semantic signatures, and methods to prune redundant relevant semantic signatures from the semantic signature set. Human input is required only at the time of choosing the training documents.;This thesis develops an enhanced version of the Automated Semantic Signature Mining Tool which increases the scope for capturing semantic content from the training documents. In particular, problems with analyzing very short documents are addressed. Improvements made in the tools minimize the unnecessary keyword groups in the early stages of the learning phase, and thereby maximizes the number of significant semantic signatures generated in the later stages of the learning phase. Thereby, a larger number of documents that are similar to the training documents are retrieved. The resulting fine-tuned semantic signatures also yield effective categorization of text documents into groups with closely related content. Tools are developed to automate the tedious process of measuring the document retrieval rates. A statistical method is also employed to estimate the precision of document retrieval

    Identificaci贸n de relaciones entre los nodos de una red social

    Get PDF
    In this paper a review is conduced about representation and classifi cation of membership among nodes belonging to a social network. For this purpose, topics such as Natural Language Processing, Text Mining, Information Retrieval and Named Entities are considered description and survey of outstanding approaches is carry out in each topic.El presente art铆culo realiza una revisi贸n del tema, representaci贸n y clasificaci贸n de de聽relaciones de pertenencia entre los nodos de una red social. Para ello, se abordan aspectos聽sobre Procesamiento de Lenguaje Natural, Miner铆a de Texto, Recuperaci贸n de Informaci贸ny Entidades Nombradas. Se hace una descripci贸n de cada una de ellas y se referencian y聽discuten trabajos acad茅micos destacados que se han desarrollado en dicho tema

    Computer-aided Semantic Signature Identification and Document Classification via Semantic Signatures

    Get PDF
    In this era of textual data explosion on the World Wide Web, it may be very hard to find documents that are similar to the documents that are of interest to us. To overcome this problem we have developed a type of semantic signature that captures the semantics of target content (text). Semantic signatures from a text/document of interest are derived using the software package semantic signature mining tool (SSMinT). This software package has been developed as a part of this thesis work in collaboration with Sri Ramya Peddada. These semantic signatures are used to search and retrieve documents with similar semantic patterns. Effects of different representations of semantic signatures on the document classification outcomes are illustrated. Retrieved document classification accuracies of Euclidean and Spherical K-means clustering algorithms are compared. A Chi-square test is presented to prove that the observed and expected numbers of documents retrieved (from a corpus) are not significantly different. From this Chi-square test it is proved that the semantic signature concept is capable of retrieving documents of interest with high probability. Our findings indicate that this concept has potential for use in commercial text/document searching applications
    corecore