133 research outputs found

    Semantic image retrieval using relevance feedback and transaction logs

    Get PDF
    Due to the recent improvements in digital photography and storage capacity, storing large amounts of images has been made possible, and efficient means to retrieve images matching a user’s query are needed. Content-based Image Retrieval (CBIR) systems automatically extract image contents based on image features, i.e. color, texture, and shape. Relevance feedback methods are applied to CBIR to integrate users’ perceptions and reduce the gap between high-level image semantics and low-level image features. The precision of a CBIR system in retrieving semantically rich (complex) images is improved in this dissertation work by making advancements in three areas of a CBIR system: input, process, and output. The input of the system includes a mechanism that provides the user with required tools to build and modify her query through feedbacks. Users behavioral in CBIR environments are studied, and a new feedback methodology is presented to efficiently capture users’ image perceptions. The process element includes image learning and retrieval algorithms. A Long-term image retrieval algorithm (LTL), which learns image semantics from prior search results available in the system’s transaction history, is developed using Factor Analysis. Another algorithm, a short-term learner (STL) that captures user’s image perceptions based on image features and user’s feedbacks in the on-going transaction, is developed based on Linear Discriminant Analysis. Then, a mechanism is introduced to integrate these two algorithms to one retrieval procedure. Finally, a retrieval strategy that includes learning and searching phases is defined for arranging images in the output of the system. The developed relevance feedback methodology proved to reduce the effect of human subjectivity in providing feedbacks for complex images. Retrieval algorithms were applied to images with different degrees of complexity. LTL is efficient in extracting the semantics of complex images that have a history in the system. STL is suitable for query and images that can be effectively represented by their image features. Therefore, the performance of the system in retrieving images with visual and conceptual complexities was improved when both algorithms were applied simultaneously. Finally, the strategy of retrieval phases demonstrated promising results when the query complexity increases

    Use of data mining for investigation of crime patterns

    Get PDF
    Lot of research is being done to improve the utilization of crime data. This thesis deals with the design and implementation of a crime database and associated search methods to identify crime patterns from the database. The database was created in Microsoft SQL Server (back end). The user interface (front end) and the crime pattern identification software (middle tier) were implemented in ASP.NET. Such a web based approach enables the user to utilize the database from anywhere and at anytime. A general ARFF file can also be generated, for the user in Windows based format to use other Data Mining software such as WEKA for detailed analysis. Further, an effective navigation was provided to make use of the software in a user friendly way

    An analytical study on image databases

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (leaves 87-88).by Francine Ming Fang.M.Eng

    Consortia optimization for European Space Agency proposals based on cognitive computing

    Get PDF
    Trabalho de projeto de mestrado, Matemática Aplicada à Economia e Gestão, Universidade de Lisboa, Faculdade de Ciências, 2019Esta tese de mestrado estuda as relações entre as palavras escritas nos resumos dos concursos da Agência Espacial Europeia (ESA - Invitation to Tender - ITT) e, em particular, se existe alguma correlação entre as palavras e a possibilidade de determinado país ser o ganhador do concurso. Um conjunto de dados de 2013 a 2016, com as informações dos dashboards dos status dos concursos e as informações do site Emits fornecidos pela ESA foram organizadas e compiladas. Em seguida, os códigos necessários para analisar esse conjunto de dados foi desenvolvido em R. Construímos matrizes e representações gráficas com as relações entre os países vencedores, os escritórios da ESA e os diferentes programas da ESA. Com base nisso, os primeiros pontos foram levantados e analisados. Em seguida, selecionamos cinco países com base no número de ITTs premiados e representatividade nos escritórios da ESA para desenvolvimento de modelos estatísticos. Esses países são: Alemanha, França, Grã-Bretanha ( Reino Unido), Itália e Bélgica. Com o uso de pacotes de mineração de dados (text mining), com o “TM” do R, os resumos originais foram organizados, de forma a retirar informação irrelevante que poderiam dificultar a realização deste trabalho. Números, espaços em branco e palavras mais frequentes foram removidas e todo texto foi colocado em minúsculo. Após estas etapas, a matriz documento por termo (DTM) foi construída. Nesta matrix, cada linhas é um documento (neste caso, o resumo de cada um dos ITTs) e cada coluna as variáveis (neste caso, as palavras mais frequentes na base de dados). A DTM é a base de todo o estudo relativo a análise textual. Para cada um dos cinco países com mais ITTs, modelos logísticos foram criados e métodos de seleção Stepwise aplicados. Os modelos criados relacionam palavras com a possibilidade de um determinado país ganhar um ITT. A validade dos modelos foi analisada utilizando parâmetros estatísticos como: sensibilidade x curva de especificidade (ponto de corte), área curva Roc e Odd. Posteriormente, começamos a investigar se os ITTs se aglomeraram em clusters definidos por estas variáveis. Diferentes métodos foram utilizados. O parâmetro da silhueta foi usado para validação dos clusters, porém os resultados não foram satisfatórios. Aplicou-se a análise de componentes principais (PCA), que permaneceu deixando lacunas, sugerindo que estudos mais avançados devem ser feitos para entender essa questão. Com este estudo, podemos inferir que existem relações entre as palavras escritas nos resumos dos ITTS e a chance de um determinado país ser o vencedor de um determinado ITT. Por essa razão, este tema merece continuar a ser desenvolvido em trabalhos futuros.This master thesis intends to study relations between the words written in European Space Agency (ESA) Invitation to Tender (ITT) abstracts, and, if there is any correlation between the words and the chance of a certain country to award a bid. An intermediate task was to compile and organize a proper dataset. A dataset was created using the ESA Dashboards and ESA Emits from 2013 to 2016 as basis. Then, we developed the necessary codes to analyze this dataset in R. We constructed matrices and graphical representations with the relations between Winner Countries, the ESA Offices and the different ESA Programs. Based on this, our firsts points were raised and analyzed. Five countries were selected based in the number of awarded ITTs. They are Germany, France, Great Britain, Italy and Belgium. These countries were scrutinized using text mining techniques and statistics models. Using our dataset, we analyzed the entire text abstract with R packages for text mining, as the TM package. The original abstracts were organized removing numbers, white spaces and most frequent words. After these steps, document term matrix (DTM) were constructed. DTM is a matrix, where the rows are the documents (ITT abstract) and the columns are the variables (most frequent words). The DTM was the basis for all textual analysis study. Regression models (logistic regression) were created for these five countries and stepwise methods used for variables selection. The created models relate words with the chance of a certain country winning an ITT. The validity of the models was analyzed using statistics parameters as: Sensibility x Specificity curve (cut-off point), Area under ROC curve, ODD. Ratio and fitted values. Afterwards, we started to investigate if the ITTs clustered in the DTM defined space. Different methods were used to define clusters. We verified if clustered formed in the word frequency space and also in a principal component analysis transformed space. However, results show that no method results in an automatic clustering using the Silhouette method, suggesting that more advanced techniques might be needed to extract the true number of clusters. The results of the application of PCA do not show agglomeration, suggesting internal clustering tendency. Finally, we can conclude that there seems to exist some relations between words and winner countries, the reasons for which remains to be studied in further works

    Spam Classification Using Machine Learning Techniques - Sinespam

    Get PDF
    Most e-mail readers spend a non-trivial amount of time regularly deleting junk e-mail (spam) messages, even as an expanding volume of such e-mail occupies server storage space and consumes network bandwidth. An ongoing challenge, therefore, rests within the development and refinement of automatic classifiers that can distinguish legitimate e-mail from spam. Some published studies have examined spam detectors using NaĂŻve Bayesian approaches and large feature sets of binary attributes that determine the existence of common keywords in spam, and many commercial applications also use NaĂŻve Bayesian techniques. Spammers recognize these attempts to prevent their messages and have developed tactics to circumvent these filters, but these evasive tactics are themselves patterns that human readers can often identify quickly. This work had the objectives of developing an alternative approach using a neural network (NN) classifier brained on a corpus of e-mail messages from several users. The features selection used in this work is one of the major improvements, because the feature set uses descriptive characteristics of words and messages similar to those that a human reader would use to identify spam, and the model to select the best feature set, was based on forward feature selection. Another objective in this work was to improve the spam detection near 95% of accuracy using Artificial Neural Networks; actually nobody has reached more than 89% of accuracy using ANN

    The Catalog Problem:Deep Learning Methods for Transforming Sets into Sequences of Clusters

    Get PDF
    The titular Catalog Problem refers to predicting a varying number of ordered clusters from sets of any cardinality. This task arises in many diverse areas, ranging from medical triage, through multi-channel signal analysis for petroleum exploration to product catalog structure prediction. This thesis focuses on the latter, which exemplifies a number of challenges inherent to ordered clustering. These include learning variable cluster constraints, exhibiting relational reasoning and managing combinatorial complexity. All of which present unique challenges for neural networks, combining elements of set representation, neural clustering and permutation learning.In order to approach the Catalog Problem, a curated dataset of over ten thousand real-world product catalogs consisting of more than one million product offers is provided. Additionally, a library for generating simpler, synthetic catalog structures is presented. These and other datasets form the foundation of the included work, allowing for a quantitative comparison of the proposed methods’ ability to address the underlying challenge. In particular, synthetic datasets enable the assessment of the models’ capacity to learn higher order compositional and structural rules.Two novel neural methods are proposed to tackle the Catalog Problem, a set encoding module designed to enhance the network’s ability to condition the prediction on the entirety of the input set, and a larger architecture for inferring an input- dependent number of diverse, ordered partitional clusters with an added cardinality prediction module. Both result in an improved performance on the presented datasets, with the latter being the only neural method fulfilling all requirements inherent to addressing the Catalog Problem

    Computational fluids domain reduction to a simplified fluid network

    Get PDF
    The primary goal of this project is to demonstrate the practical use of data mining algorithms to cluster a solved steady-state computational fluids simulation (CFD) flow domain into a simplified lumped-parameter network. A commercial-quality code, “cfdMine” was created using a volume-weighted k-means clustering that that can accomplish the clustering of a 20 million cell CFD domain on a single CPU in several hours or less. Additionally agglomeration and k-means Mahalanobis were added as optional post-processing steps to further enhance the separation of the clusters. The resultant nodal network is considered a reduced-order model and can be solved transiently at a very minimal computational cost. The reduced order network is then instantiated in the commercial thermal solver MuSES to perform transient conjugate heat transfer using convection predicted using a lumped network (based on steady-state CFD). When inserting the lumped nodal network into a MuSES model, the potential for developing a “localized heat transfer coefficient” is shown to be an improvement over existing techniques. Also, it was found that the use of the clustering created a new flow visualization technique. Finally, fixing clusters near equipment newly demonstrates a capability to track temperatures near specific objects (such as equipment in vehicles)

    Analytical study and computational modeling of statistical methods for data mining

    Get PDF
    Today, there is tremendous increase of the information available on electronic form. Day by day it is increasing massively. There are enough opportunities for research to retrieve knowledge from the data available in this information. Data mining and app

    Semi-automated co-reference identification in digital humanities collections

    Get PDF
    Locating specific information within museum collections represents a significant challenge for collection users. Even when the collections and catalogues exist in a searchable digital format, formatting differences and the imprecise nature of the information to be searched mean that information can be recorded in a large number of different ways. This variation exists not just between different collections, but also within individual ones. This means that traditional information retrieval techniques are badly suited to the challenges of locating particular information in digital humanities collections and searching, therefore, takes an excessive amount of time and resources. This thesis focuses on a particular search problem, that of co-reference identification. This is the process of identifying when the same real world item is recorded in multiple digital locations. In this thesis, a real world example of a co-reference identification problem for digital humanities collections is identified and explored. In particular the time consuming nature of identifying co-referent records. In order to address the identified problem, this thesis presents a novel method for co-reference identification between digitised records in humanities collections. Whilst the specific focus of this thesis is co-reference identification, elements of the method described also have applications for general information retrieval. The new co-reference method uses elements from a broad range of areas including; query expansion, co-reference identification, short text semantic similarity and fuzzy logic. The new method was tested against real world collections information, the results of which suggest that, in terms of the quality of the co-referent matches found, the new co-reference identification method is at least as effective as a manual search. The number of co-referent matches found however, is higher using the new method. The approach presented here is capable of searching collections stored using differing metadata schemas. More significantly, the approach is capable of identifying potential co-reference matches despite the highly heterogeneous and syntax independent nature of the Gallery, Library Archive and Museum (GLAM) search space and the photo-history domain in particular. The most significant benefit of the new method is, however, that it requires comparatively little manual intervention. A co-reference search using it has, therefore, significantly lower person hour requirements than a manually conducted search. In addition to the overall co-reference identification method, this thesis also presents: • A novel and computationally lightweight short text semantic similarity metric. This new metric has a significantly higher throughput than the current prominent techniques but a negligible drop in accuracy. • A novel method for comparing photographic processes in the presence of variable terminology and inaccurate field information. This is the first computational approach to do so.AHR
    • …
    corecore