Search CORE

4 research outputs found

Recommended from our members

Applications and Advances in Similarity-based Machine Learning

Author: Spaen Quico Pepijn
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Similarity-based machine learning methods differ from traditional machine learning methods in that they also use pairwise similarity relations between objects to infer the labels of unlabeled objects. A recent comparative study for classification problems by Baumann et al. [2019] demonstrated that similarity-based techniques have superior performance and robustness when compared to well-established machine learning techniques. Similarity-based machine learning methods benefit from two advantages that could explain superior their performance: They can make use of the pairwise relations between unlabeled objects, and they are robust due to the transitive property of pairwise similarities. A challenge for similarity-based machine learning methods on large datasets is that the number of pairwise similarity grows quadratically in the size of the dataset. For large datasets, it thus becomes practically impossible to compute all possible pairwise similarities. In 2016, Hochbaum and Baumann proposed the technique of sparse computation to address this growth by computing only those pairwise similarities that are relevant. Their proposed implementation of sparse computation is still difficult to scale to millions objects. This dissertation focuses on advancing the practical implementations of sparse computation to larger datasets and on two applications for which similarity-based machine learning was particularly effective. The applications that are studied here are cell identification in calcium-imaging movies and detecting aberrant linking behavior in directed networks. For sparse computation we present faster, geometric algorithms and a technique, named sparse-reduced computation, that combines sparse computation with compression. The geometric algorithms compute the exact same output as the original implementation of sparse computation, but identify the relevant pairwise similarities faster by using the concept of data shifting for identifying objects in the same or neighboring blocks. Empirical results on datasets with up to 10 million objects show a significant reduction in running time. Sparse-reduced computation combines sparse computation with a technique for compressing highly-similar or identical objects, enabling the use of similarity-based machine learning on massively-large datasets. The computational results demonstrate that sparse-reduced computation provides a significant reduction in running time with a minute loss in accuracy.A major problem facing neuroscientists today is cell identification in calcium-imaging movies. These movies are in-vivo recordings of thousands of neurons at cellular resolution. There is a great need for automated approaches to extract the activity of single neurons from these movies since manual post-processing takes tens of hours per dataset. We present the HNCcorr algorithm for cell identification in calcium-imaging movies. The name HNCcorr is derived from its use of the similarity-based Hochbaum's Normalized Cut (HNC) model with pairwise similarities derived from correlation. In HNCcorr, the task of cell detection is approached as a clustering problem. HNCcorr utilizes HNC to detect cells in these movies as coherent clusters of pixels that are highly distinct from the remaining pixels. HNCcorr guarantees, unlike existing methodologies for cell identification, a globally optimal solution to the underlying optimization problem. Of independent interest is a novel method, named similarity-squared, that we devised for measuring similarity between pixels. We provide an experimental study and demonstrate that HNCcorr is a top performer on the Neurofinder cell identification benchmark and that it improves over algorithms based on matrix factorization.The second application is detecting aberrant agents, such as fake news sources or spam websites, based on their link behavior in networks. Across contexts, a distinguishing characteristic between normal and aberrant agents is that normal agents rarely link to aberrant ones. We refer to this phenomenon as aberrant linking behavior. We present an Markov Random Fields (MRF) formulation, with links as the pairwise similarities, that detects aberrant agents based on aberrant linking behavior and any prior information (if given). This MRF formulation is solved optimally and in polynomial time. We compare the optimal solution for the MRF formulation to well-known algorithms based on random walks. In our empirical experiment with twenty-three different datasets, the MRF method outperforms the other detection algorithms. This work represents the first use of optimization methods for detecting aberrant agents as well as the first time that MRF is applied to directed graphs

eScholarship - University of California

Web page content adjustment for search engines using machine learning and natural language processing

Author: Matošević Goran
Publication venue: University of Zagreb. Faculty of Organization and Informatics.
Publication date: 03/10/2019
Field of study

Optimizacija mrežnih stranica za tražilice (engl. Search engine optimization, SEO) podrazumijeva tehnike pomoću kojih autor mrežnih stranica provodi nad svojim stranicama kako bi one što bolje rangirale u organskim (prirodnim) rezultatima pretraživanja na internetskim tražilicama za odabrane ključne riječi. Taj proces između ostalog uključuje i optimizaciju sadržaja, odnosno prilagodbu sadržaja mrežnih stranica prema preporukama za optimizaciju mrežnih stranica za tražilice (u daljem tekstu SEO preporukama). Ovim istraživanjem ispituje se mogućnost upotrebe strojnog učenja za klasifikaciju mrežnih stranica u tri predefinirane klase s obzirom na stupanj prilagodbe sadržaja SEO preporukama. Pomoću strojnoga učenja izgrađeni su klasifikatori koji su naučili svrstati nepoznati uzorak (mrežnu stranicu) u predefinirane klase, te utvrditi značajne faktore (varijable) koje utječu na stupanj prilagodbe. Također izgrađen je sustav ispravka „neprilagođenih“ stranica upotrebom tehnika iz domene obrade prirodnog jezika. Rezultati su pokazali da se pomoću strojnog učenja može ocijeniti stupanj prilagođenosti stranice SEO preporukama, da se strojno učenje može koristiti za utvrđivanje značajnih faktora, te da se izgrađeni sustav prilagodbe može koristiti za ispravak tj. poboljšanje mrežnih stranica koje su u prethodnim fazama klasificirane kao "neprilagođene".Search engine optimization (SEO) involves techniques by which the author of the website customizes the website so that it ranks higher in organic (natural) search results on popular Internet search engines for selected keywords. This process includes, among others, the optimization of content (text) to fit SEO recommendations. This study examines the possibility of using machine learning tecniques to classify web pages into three predefined classes related to the degree of content adjustment to the SEO recommendations. Using machine learning algorithms, classifiers are built and trained to classify an unknown sample (web page) in the predefined classes and to identify important factors that affect the degree of adjustment. In addition, using algorithms from the domain of natural language processing a system for correction is built and tested. Results show that machine learning can be used to predict the degree of adjustments of web pages to SEO recommendations, for identifying important SEO factors and that the proposed correction system can be used to correct pages which were classified as "misfits" in prior stages

University of Zagreb Repository

Web page content adjustment for search engines using machine learning and natural language processing

Author: Matošević Goran
Publication venue: University of Zagreb. Faculty of Organization and Informatics.
Publication date: 03/10/2019
Field of study

Croatian Digital Dissertations Repository

Web page content adjustment for search engines using machine learning and natural language processing

Author: Matošević Goran
Publication venue: University of Zagreb. Faculty of Organization and Informatics.
Publication date: 03/10/2019
Field of study

Faculty of Organization and Informatics - Digital Repository

Croatian Digital Dissertations Repository

University of Zagreb Repository