7 research outputs found

    Klasterisasi, Klasifikasi dan Peringkasan Teks Berbahasa Indonesia

    Get PDF
    Studi pustaka penelitian di bidang klasterisasi dan klasifikasi dokumen teksberbahasa Indonesia menunjukan bahwa penelitian bidang pemrosesan dokumentelah dimulai pada tahun 2000. Terdapat berbagai metode data mining untukmelakukan pengelompokan dokumen digunakan seperti single pass filtering,Naive Bayes, Hirarki dan metode lainnya. Penelitian ini akan melakukan surveipaper penelitian data mining teks berbahasa Indonesia. Dari paper yangdidapatkan terlihat bahwa sebagian besar topik penelitian data mining bertujuanadalah untuk melakukan pengelompokan suatu berita online maupun cetakberdasar atas acuan tertentu, penelitian lain ditujukanuntuk mengolah teks dimedia sosial seperti twitter. Artikel ini akan memperlihatkan metode yangdigunakan dan tujuan dari paper dalam bidang klasterisasi,klasifikasidanperingakasan dokumen berbahasa Indonesia

    A new weighting scheme and discriminative approach for information retrieval in static and dynamic document collections

    Get PDF
    This paper introduces a new weighting scheme in information retrieval. It also proposes using the document centroid as a threshold for normalizing documents in a document collection. Document centroid normalization helps to achieve more effective information retrieval as it enables good discrimination between documents. In the context of a machine learning application, namely unsupervised document indexing and retrieval, we compared the effectiveness of the proposed weighting scheme to the 'Term Frequency - Inverse Document Frequency' or TF-IDF, which is commonly used and considered as one of the best existing weighting schemes. The paper shows how the document centroid is used to remove less significant weights from documents and how this helps to achieve better retrieval effectiveness. Most of the existing weighting schemes in information retrieval research assume that the whole document collection is static. The results presented in this paper show that the proposed weighting scheme can produce higher retrieval effectiveness compared with the TF-IDF weighting scheme, in both static and dynamic document collections. The results also show the variation in information retrieval effectiveness that is achieved for static and dynamic document collections by using a specific weighting scheme. This type of comparison has not been presented in the literature before

    A framework for the Comparative analysis of text summarization techniques

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceWe see that with the boom of information technology and IOT (Internet of things), the size of information which is basically data is increasing at an alarming rate. This information can always be harnessed and if channeled into the right direction, we can always find meaningful information. But the problem is this data is not always numerical and there would be problems where the data would be completely textual, and some meaning has to be derived from it. If one would have to go through these texts manually, it would take hours or even days to get a concise and meaningful information out of the text. This is where a need for an automatic summarizer arises easing manual intervention, reducing time and cost but at the same time retaining the key information held by these texts. In the recent years, new methods and approaches have been developed which would help us to do so. These approaches are implemented in lot of domains, for example, Search engines provide snippets as document previews, while news websites produce shortened descriptions of news subjects, usually as headlines, to make surfing easier. Broadly speaking, there are mainly two ways of text summarization – extractive and abstractive summarization. Extractive summarization is the approach in which important sections of the whole text are filtered out to form the condensed form of the text. While the abstractive summarization is the approach in which the text as a whole is interpreted and examined and after discerning the meaning of the text, sentences are generated by the model itself describing the important points in a concise way

    Feature engineered embeddings for machine learning on molecular data

    Get PDF
    Mini Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2022.The classification of molecules is of particular importance to the drug discovery process and several other use cases. Data in this domain can be partitioned into structural and sequence/text data. Several tech- niques such as deep learning are able to classify molecules and predict their functions using both types of data. Molecular structure and encoded chemical information are sufficient to classify a characteristic of a molecule. However, the use of a molecule’s structural information typically requires large amounts of computational power with deep learning models that take a long time to train. In this study, we present a different approach to molecule classification that addresses the limitations of other techniques. This approach uses natural language processing techniques in the form of count vectorisation, term frequency- inverse document frequency, word2vec and latent Dirichlet allocation to feature engineer molecular text data. Through this approach we aim to make a robust and explainable embedding that is fast to im- plement and solely dependent on chemical (text) data such as the sequence of a protein. Further, we investigate the usefulness of these explainable embeddings for machine learning models, for representing a corpus of data in vector space and for protein-protein interaction prediction using embedding similarity. We apply the techniques on three different types of molecular text data: FASTA sequence data, Simpli- fied Molecular Input Line Entry Specification data and Protein Data Bank data. We show that these embeddings provide excellent performance for classification and protein-protein bind prediction.StatisticsMSc (Advanced Data Analytics)Unrestricte

    Vers une méthode de classification de fichiers sonores

    Get PDF

    Semantically enhanced document clustering

    Get PDF
    This thesis advocates the view that traditional document clustering could be significantly improved by representing documents at different levels of abstraction at which the similarity between documents is considered. The improvement is with regard to the alignment of the clustering solutions to human judgement. The proposed methodology employs semantics with which the conceptual similarity be-tween documents is measured. The goal is to design algorithms which implement the meth-odology, in order to solve the following research problems: (i) how to obtain multiple deter-ministic clustering solutions; (ii) how to produce coherent large-scale clustering solutions across domains, regardless of the number of clusters; (iii) how to obtain clustering solutions which align well with human judgement; and (iv) how to produce specific clustering solu-tions from the perspective of the user’s understanding for the domain of interest. The developed clustering methodology enhances separation between and improved coher-ence within clusters generated across several domains by using levels of abstraction. The methodology employs a semantically enhanced text stemmer, which is developed for the pur-pose of producing coherent clustering, and a concept index that provides generic document representation and reduced dimensionality of document representation. These characteristics of the methodology enable addressing the limitations of traditional text document clustering by employing computationally expensive similarity measures such as Earth Mover’s Distance (EMD), which theoretically aligns the clustering solutions closer to human judgement. A threshold for similarity between documents that employs many-to-many similarity matching is proposed and experimentally proven to benefit the traditional clustering algorithms in pro-ducing clustering solutions aligned closer to human judgement. 4 The experimental validation demonstrates the scalability of the semantically enhanced document clustering methodology and supports the contributions: (i) multiple deterministic clustering solutions and different viewpoints to a document collection are obtained; (ii) the use of concept indexing as a document representation technique in the domain of document clustering is beneficial for producing coherent clusters across domains; (ii) SETS algorithm provides an improved text normalisation by using external knowledge; (iv) a method for measuring similarity between documents on a large scale by using many-to-many matching; (v) a semantically enhanced methodology that employs levels of abstraction that correspond to a user’s background, understanding and motivation. The achieved results will benefit the research community working in the area of document management, information retrieval, data mining and knowledge management
    corecore