    Sistem Repositori Tugas Akhir Mahasiswa dengan Fungsi Peringkat Okapi BM25

    Saat ini Jurusan Teknik Informatika Universitas ’X’ mewajibkan mahasiswa yang telah selesai tugas akhir untuk mengumpulkan hasil karya mereka dalam bentuk softcopy (CD) yang berisi program aplikasi dan dokumentasi, serta hardcopy (dalam bentuk buku laporan dan jurnal). Karya tersebut disimpan di perpustakaan secara fisik dan beberapa data disimpan di Digital Library Universitas ’X’. Namun keterbatasan sistem yang ada saat ini menyebabkan kesulitan pencarian hasil karya tugas akhir, karena teknik/metode yang digunakan untuk melakukan pencarian dibuat dalam bentuk query sederhana dengan kriteria yang masih terbatas, tanpa pengurutan dengan peringkat. Selain itu, kepala lab di jurusan juga menemui kesulitan dalam melakukan pemetaan bidang keahlian dari tugas akhir yang dikerjakan oleh mahasiswa di masing-masing lab. Berbagai permasalahan tersebut melatarbelakangi penelitian ini, sehingga diperlukan adanya sistem yang dapat membantu jurusan dalam menyimpan hasil karya tugas akhir mahasiswa, mempermudah pencarian, serta menampilkannya. Pencarian tugas akhir pada penelitian ini berdasarkan query yang diinput oleh pengguna menggunakan metode pencarian fungsi Okapi BM25. Dengan fungsi peringkat Okapi BM25 maka hasil karya dapat ditampilkan dengan urutan peringkat sesuai relevansinya

    Investigation of the applicability of natural language processing methods to problems of searching and matching of machinery drawing images

    Проведенные в работе исследования показывают, что применение технологии дескрипторов особых точек в чистом виде к задаче сравнения и поиска чертежей является неэффективным. Выявлено, что основной причиной этому служит наличие в чертежах большого количества идентичных элементов (рамки, основная надпись, выносные линии, элементы шрифтов и др.). Для решения данной проблемы предложено использование метода tf-idf (term frequency-inverse document frequency), широко известного в технологии обработки естественного языка. В исследовании вместо векторов слов, применяемых в оригинальной методике tf-idf, использовались дескрипторы особых точек изображений, вычисленных по алгоритмам ORB и BRISK. В результате исследования получены следующие выводы: 1) показана высокая эффективность предлагаемого подхода для поиска копии изображения-запроса в базе данных. Так, для всех изображений, предложенных для поиска и имеющих свои полные аналоги в базе данных, было выявлено наличие копий. 2) Количество выявленных изображений, являющихся модификациями изображения-запроса, разнится и зависит от алгоритма нахождения особых точек и дескрипторов. Так, при использовании ORB максимальное количество выявленных модифицированных аналогов составило 60%, при использовании BRISK – 80% от всех аналогов изображения, находящихся в базе данных. 3) Предлагаемый подход показывает ограниченную эффективность для нахождения изображений, которые можно отнести к тому же классу, что и изображение-запрос (например, чертеж экскаватора, бульдозера, автомобильного крана). Здесь максимальное количество ложных определений достигло 60%

    Feature selection, optimization and clustering strategies of text documents

    Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments


    The purpose of this work is to develop an AI teacher assistant, who can find answers to online course participants questions among answers previously published at the training forum. Currently, there are already successful experiments on the use of artificial intelligence systems (IBM WATSON) in online training. In this paper, we investigate the possibility of constructing such a system using word2vec technology. A two-stage method for finding an answer to a question is constructed. Method use word2vec technology for vector representation of questions and answers. At the first stage, the subject matter of the issue is determined and, if it corresponds to the theme of the forum, then the articles most relevant to the question are searched. A real situation was simulated with 16 themes and 80 answers to possible questions within the section of the online course “Linear Algebra and Geometry”. The question-answer system was designed and its performance was evaluated. The parameters have been chosen to achieve the best result. In 83% of the cases, the relevant answer to the formulated question was contained among the top 3 responses that the system offered. The issues of further development of applied approaches and increasing utility of the constructed question-answer system are considered.Purpose: developing an AI teacher assistant, who can find answers to online course participants questions among answers previously published at the training forum.Methodology: vectorization of questions and answers, neural network classification of the subject matter, construction of the answers rating.Results: acceptable accuracy in finding a relevant answer to a question are received.Practical implications: The results of the research can be used as a basis for designing an AI teacher assistant in online courses.Целью данной работы является разработка системы интеллектуального поиска ответов на вопросы слушателей онлайн-курса среди ранее опубликованных на учебном форуме вопросов-ответов. В настоящее время уже имеются успешные эксперименты по применению систем искусственного интеллекта (IBM WATSON) в онлайн-обучении. В данной работе исследуется возможность построения такой системы с использованием технологии word2vec. Конструируется двухэтапный метод поиска ответа на вопрос с использованием технологии word2vec для векторного представления вопросов и ответов. На первом этапе определяется тематика вопроса и, если она соответствует теме форума, то среди тематических статей форума проводится поиск статей, наиболее релевантных заданному вопросу. Моделировалась реальная ситуация с 16 тематиками и 80 ответами на возможные вопросы в рамках раздела онлайн-курса “Линейная алгебра и геометрия”. На основе построенной векторной модели предметной области сконструирована вопросно-ответная система и проведена оценка качества её работы. Подобраны параметры для достижения наилучшего результата классификации вопросов и поиска релевантных ответов. В 83% случаях релевантный ответ на сформулированный вопрос содержался среди топ-3 ответов, которые система предлагала. Рассматриваются вопросы дальнейшего развития применяемых подходов и повышения полезности конструируемой вопросно-ответной системы.Цель: разработка системы интеллектуального поиска ответов на вопросы слушателей онлайн-курса среди ранее опубликованных на учебном форуме.Методология: векторизация вопросов и ответов, нейросетевая классификация тематики вопроса, построение рейтинга ответов.Результаты: достижение приемлемой точности в поиске релевантного ответа на вопрос среди имеющихся ответов.Практическое применение: полученные результаты исследования могут быть положены в основу конструирования интеллектуальных помощников учителя на онлайн-курсах

    Are Scopus journal field classifications ever misleading?

    Journal field classifications in Scopus are used for citation-based indicators and by authors choosing appropriate journals to submit to. Whilst prior research has found that Scopus categories are occasionally misleading, it is not known how this varies for different journal types. In response, we assessed whether specialist, cross-field and general academic journals sometimes have publication practices that do not match their Scopus classifications. For this, we compared the Scopus narrow fields of journals with the fields that best fit their articles' titles and abstracts. We also conducted qualitative follow-up to distinguish between Scopus classification errors and misleading journal aims. The results show sharp field differences in the extent to which both cross-field and apparently specialist journals publish articles that match their Scopus narrow fields, and the same for general journals. The results also suggest that a few journals have titles and aims that do not match their contents well, and that some large topics spread themselves across many relevant disciplines. Thus, the likelihood that a journal's Scopus narrow fields reflect its contents varies substantially by field (although without systematic field trends) and some cross-field topics seem to cause difficulties in appropriately classifying relevant journals. These issues undermine citation-based indicators that rely on journal-level classification and may confuse scholars seeking publishing venues

    Leveraging semantic resources in diversified query expansion

    A search query, being a very concise grounding of user intent, could potentially have many possible interpretations. Search engines hedge their bets by diversifying top results to cover multiple such possibilities so that the user is likely to be satisfied, whatever be her intended interpretation. Diversified Query Expansion is the problem of diversifying query expansion suggestions, so that the user can specialize the query to better suit her intent, even before perusing search results. In this paper, we consider the usage of semantic resources and tools to arrive at improved methods for diversified query expansion. In particular, we develop two methods, those that leverage Wikipedia and pre-learnt distributional word embeddings respectively. Both the approaches operate on a common three-phase framework; that of first taking a set of informative terms from the search results of the initial query, then building a graph, following by using a diversity-conscious node ranking to prioritize candidate terms for diversified query expansion. Our methods differ in the second phase, with the first method Select-Link-Rank (SLR) linking terms with Wikipedia entities to accomplish graph construction; on the other hand, our second method, Select-Embed-Rank (SER), constructs the graph using similarities between distributional word embeddings. Through an empirical analysis and user study, we show that SLR ourperforms state-of-the-art diversified query expansion methods, thus establishing that Wikipedia is an effective resource to aid diversified query expansion. Our empirical analysis also illustrates that SER outperforms the baselines convincingly, asserting that it is the best available method for those cases where SLR is not applicable; these include narrow-focus search systems where a relevant knowledge base is unavailable. Our SLR method is also seen to outperform a state-of-the-art method in the task of diversified entity ranking. <br/

    Evaluating Clusterings by Estimating Clarity

    In this thesis I examine clustering evaluation, with a subfocus on text clusterings specifically. The principal work of this thesis is the development, analysis, and testing of a new internal clustering quality measure called informativeness. I begin by reviewing clustering in general. I then review current clustering quality measures, accompanying this with an in-depth discussion of many of the important properties one needs to understand about such measures. This is followed by extensive document clustering experiments that show problems with standard clustering evaluation practices. I then develop informativeness, my new internal clustering quality measure for estimating the clarity of clusterings. I show that informativeness, which uses classification accuracy as a proxy for human assessment of clusterings, is both theoretically sensible and works empirically. I present a generalization of informativeness that leverages external clustering quality measures. I also show its use in a realistic application: email spam filtering. I show that informativeness can be used to select clusterings which lead to superior spam filters when few true labels are available. I conclude this thesis with a discussion of clustering evaluation in general, informativeness, and the directions I believe clustering evaluation research should take in the future

    Cyber Security

    This open access book constitutes the refereed proceedings of the 16th International Annual Conference on Cyber Security, CNCERT 2020, held in Beijing, China, in August 2020. The 17 papers presented were carefully reviewed and selected from 58 submissions. The papers are organized according to the following topical sections: access control; cryptography; denial-of-service attacks; hardware security implementation; intrusion/anomaly detection and malware mitigation; social network security and privacy; systems security