1,768 research outputs found

    Investigating Machine Learning Clustering Methods to Replicate the Human Idea of Structure to Documents

    Get PDF
    Anyone trying to maintain a set of text documents in an information retrieval system will run into problems keeping it relevant and up to date as the amount of data increases. This thesis investigates how a collection of documents can be clustered in a way that resembles how a human would organize it. It also assesses how difficult it is to implement this into an existing information retrieval system with current programming libraries, and in what practical ways this can be useful. The text data in this project is represented by a TF-IDF model. A K-Means clustering algorithm generates one clustering, and a Support Vector Machine is trained with minimal user data to provide another clustering. These two are then evaluated and compared using a set of metrics. This project takes a practical approach to the problem, focusing on what can be implemented using existing programming libraries and what will actually work in a production environment. Software for visualizing the corpus and calculating similar documents, are implemented as well. The supervised method SVM greatly surpasses the unsupervised method K-Means in being able to replicate the given ground truth, but both models are in themselves useful. With a relatively simple understanding of machine learning, any company could set up a similar system. It does, however, take some deeper mathematical knowledge and fine tuning to get the most out of it and tailor it to the dataset

    DBSCAN algoritmin hyperparametri optimisointi käyttäen uudenlaista geneettiseen algoritmiin perustuvaa menetelmää

    Get PDF
    Ship traffic is a major source of global greenhouse gas emissions, and the pressure on the maritime industry to lower its carbon footprint is constantly growing. One easy way for ships to lower their emissions would be to lower their sailing speed. The global ship traffic has for ages followed a practice called "sail fast, then wait", which means that ships try to reach their destination in the fastest possible time regardless and then wait at an anchorage near the harbor for a mooring place to become available. This method is easy to execute logistically, but it does not optimize the sailing speeds to take into account the emissions. An alternative tactic would be to calculate traffic patterns at the destination and use this information to plan the voyage so that the time at anchorage is minimized. This would allow ships to sail at lower speeds without compromising the total length of the journey. To create a model to schedule arrivals at ports, traffic patterns need to be formed on how ships interact with port infrastructure. However, port infrastructure is not widely available in an easy-to-use form. This makes it difficult to develop models that are capable of predicting traffic patterns. However, ship voyage information is readily available from commercial Automatic Information System (AIS) data. In this thesis, I present a novel implementation, which extracts information on the port infrastructure from AIS data using the DBSCAN clustering algorithm. In addition to clustering the AIS data, the implementation presented in this thesis uses a novel optimization method to search for optimal hyperparameters for the DBSCAN algorithm. The optimization process evaluates possible solutions using cluster validity indices (CVI), which are metrics that represent the goodness of clustering. A comparison with different CVIs is done to narrow down the most effective way to cluster AIS data to find information on port infrastructure
    • …
    corecore