Search CORE

924 research outputs found

Towards the Automatic Classification of Documents in User-generated Classifications

Author: Morshed Ahsan-Ul
Publication venue
Publication date: 01/01/2006
Field of study

There is a huge amount of information scattered on the World Wide Web. As the information flow occurs at a high speed in the WWW, there is a need to organize it in the right manner so that a user can access it very easily. Previously the organization of information was generally done manually, by matching the document contents to some pre-defined categories. There are two approaches for this text-based categorization: manual and automatic. In the manual approach, a human expert performs the classification task, and in the second case supervised classifiers are used to automatically classify resources. In a supervised classification, manual interaction is required to create some training data before the automatic classification task takes place. In our new approach, we intend to propose automatic classification of documents through semantic keywords and building the formulas generation by these keywords. Thus we can reduce this human participation by combining the knowledge of a given classification and the knowledge extracted from the data. The main focus of this PhD thesis, supervised by Prof. Fausto Giunchiglia, is the automatic classification of documents into user-generated classifications. The key benefits foreseen from this automatic document classification is not only related to search engines, but also to many other fields like, document organization, text filtering, semantic index managing

Unitn-eprints Research

Top-Down Skiplists

Author: Barba Luis
Morin Pat
Publication venue
Publication date: 29/07/2014
Field of study

We describe todolists (top-down skiplists), a variant of skiplists (Pugh 1990) that can execute searches using at most

\log_{2-\varepsilon} n + O(1)

binary comparisons per search and that have amortized update time

O(\varepsilon^{-1}\log n)

. A variant of todolists, called working-todolists, can execute a search for any element

x

using

\log_{2-\varepsilon} w(x) + o(\log w(x))

binary comparisons and have amortized search time

O(\varepsilon^{-1}\log w(w))

. Here,

w(x)

is the "working-set number" of

x

. No previous data structure is known to achieve a bound better than

4\log_2 w(x)

comparisons. We show through experiments that, if implemented carefully, todolists are comparable to other common dictionary implementations in terms of insertion times and outperform them in terms of search times.Comment: 18 pages, 5 figure

arXiv.org e-Print Archive

CiteSeerX

Improved Bounds for Multipass Pairing Heaps and Path-Balanced Binary Search Trees

Author: Dorfman Dani
Kaplan Haim
Pettie Seth
Zwick Uri
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 26th Annual European Symposium on Algorithms (ESA 2018)
Publication date: 01/01/2018
Field of study

We revisit multipass pairing heaps and path-balanced binary search trees (BSTs), two classical algorithms for data structure maintenance. The pairing heap is a simple and efficient "self-adjusting" heap, introduced in 1986 by Fredman, Sedgewick, Sleator, and Tarjan. In the multipass variant (one of the original pairing heap variants described by Fredman et al.) the minimum item is extracted via repeated pairing rounds in which neighboring siblings are linked. Path-balanced BSTs, proposed by Sleator (cf. Subramanian, 1996), are a natural alternative to Splay trees (Sleator and Tarjan, 1983). In a path-balanced BST, whenever an item is accessed, the search path leading to that item is re-arranged into a balanced tree. Despite their simplicity, both algorithms turned out to be difficult to analyse. Fredman et al. showed that operations in multipass pairing heaps take amortized O(log n * log log n / log log log n) time. For searching in path-balanced BSTs, Balasubramanian and Raman showed in 1995 the same amortized time bound of O(log n * log log n / log log log n), using a different argument. In this paper we show an explicit connection between the two algorithms and improve both bounds to O(log n * 2^{log^* n} * log^* n), respectively O(log n * 2^{log^* n} * (log^* n)^2), where log^* denotes the slowly growing iterated logarithm function. These are the first improvements in more than three, resp. two decades, approaching the information-theoretic lower bound of Omega(log n)

arXiv.org e-Print Archive

Repository TU/e

Pure OAI Repository

Dagstuhl Research Online Publication Server

08081 Abstracts Collection -- Data Structures

Author: Arge Lars
Sedgewick Robert
Seidel Raimund
Publication venue: Dagstuhl Seminar Proceedings. 08081 - Data Structures
Publication date: 01/01/2008
Field of study

From February 17th to 22nd 2008, the Dagstuhl Seminar 08081 ``Data Structures\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. It brought together 49 researchers from four continents to discuss recent developments concerning data structures in terms of research but also in terms of new technologies that impact how data can be stored, updated, and retrieved. During the seminar a fair number of participants presented their current research. There was discussion of ongoing work, and in addition an open problem session was held. This paper first describes the seminar topics and goals in general, then gives the minutes of the open problem session, and concludes with abstracts of the presentations given during the seminar. Where appropriate and available, links to extended abstracts or full papers are provided

Dagstuhl Research Online Publication Server

Control of Humanoid Robots for Use in Unstructured Environments

Author: Franklin Perry Carl
Publication venue: Digital WPI
Publication date: 28/04/2016
Field of study

Humanoid robots have the potential to replace human beings for dangerous tasks, such as disaster relief. One of the most important abilities for a humanoid robot is the ability to manipulate its surroundings. We developed human-in-the-loop techniques for the Atlas platform to compete in the DARPA Robotics Challenge. Many of the tasks in the event required actuation of the environment, such as turning a valve, pulling a lever, and opening a door. This paper will detail our work on manipulation for humanoid robots. In particular, we will discuss our approaches to effective operator interface design, manipulation techniques and motion planning

DigitalCommons@WPI

Approximate resilience, monotonicity, and the complexity of agnostic learning

Author: Dachman-Soled Dana
Feldman Vitaly
Tan Li-Yang
Wan Andrew
Wimmer Karl
Publication venue
Publication date: 09/07/2014
Field of study

A function

f

d

-resilient if all its Fourier coefficients of degree at most

d

are zero, i.e.,

f

is uncorrelated with all low-degree parities. We study the notion of

\mathit{approximate}

\mathit{resilience}

of Boolean functions, where we say that

f

\alpha

-approximately

d

-resilient if

f

\alpha

-close to a

[-1,1]

-valued

d

-resilient function in

\ell_1

distance. We show that approximate resilience essentially characterizes the complexity of agnostic learning of a concept class

C

over the uniform distribution. Roughly speaking, if all functions in a class

C

are far from being

d

-resilient then

C

can be learned agnostically in time

n^{O(d)}

and conversely, if

C

contains a function close to being

d

-resilient then agnostic learning of

C

in the statistical query (SQ) framework of Kearns has complexity of at least

n^{\Omega(d)}

. This characterization is based on the duality between

\ell_1

approximation by degree-

d

polynomials and approximate

d

-resilience that we establish. In particular, it implies that

\ell_1

approximation by low-degree polynomials, known to be sufficient for agnostic learning over product distributions, is in fact necessary. Focusing on monotone Boolean functions, we exhibit the existence of near-optimal

\alpha

-approximately

\widetilde{\Omega}(\alpha\sqrt{n})

-resilient monotone functions for all

\alpha>0

. Prior to our work, it was conceivable even that every monotone function is

\Omega(1)

-far from any

1

-resilient function. Furthermore, we construct simple, explicit monotone functions based on

{\sf Tribes}

and

{\sf CycleRun}

that are close to highly resilient functions. Our constructions are based on a fairly general resilience analysis and amplification. These structural results, together with the characterization, imply nearly optimal lower bounds for agnostic learning of monotone juntas

arXiv.org e-Print Archive

CiteSeerX

Crossref

Recommended from our members

Statistical Learning Methods for Personalized Medicine

Author: Qiu Xin
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

The theme of this dissertation is to develop simple and interpretable individualized treatment rules (ITRs) using statistical learning methods to assist personalized decision making in clinical practice. Considerable heterogeneity in treatment response is observed among individuals with mental disorders. Administering an individualized treatment rule according to patient-specific characteristics offers an opportunity to tailor treatment strategies to improve response. Black-box machine learning methods for estimating ITRs may produce treatment rules that have optimal benefit but lack transparency and interpretability. Barriers to implementing personalized treatments in clinical psychiatry include a lack of evidence-based, clinically interpretable, individualized treatment rules, a lack of diagnostic measure to evaluate candidate ITRs, a lack of power to detect treatment modifiers from a single study, and a lack of reproducibility of treatment rules estimated from single studies. This dissertation contains three parts to tackle these barriers: (1) methods to estimate the best linear ITR with guaranteed performance among the class of linear rules; (2) a tree-based method to improve the performance of a linear ITR fitted from the overall sample and identify subgroups with a large benefit; and (3) an integrative learning combining information across trials to provide an integrative ITR with improved efficiency and reproducibility. In the first part of the dissertation, we propose a machine learning method to estimate optimal linear individualized treatment rules for data collected from single stage randomized controlled trials (RCTs). In clinical practice, an informative and practically useful treatment rule should be simple and transparent. However, because simple rules are likely to be far from optimal, effective methods to construct such rules must guarantee performance, in terms of yielding the best clinical outcome (highest reward) among the class of simple rules under consideration. Furthermore, it is important to evaluate the benefit of the derived rules on the whole sample and in pre-specified subgroups (e.g., vulnerable patients). To achieve both goals, we propose a robust machine learn- ing algorithm replacing zero-one loss with an authentic approximation loss (ramp loss) for value maximization, referred to as the asymptotically best linear O-learning (ABLO), which estimates a linear treatment rule that is guaranteed to achieve optimal reward among the class of all linear rules. We then develop a diagnostic measure and inference procedure to evaluate the benefit of the obtained rule and compare it with the rules estimated by other methods. We provide theoretical justification for the proposed method and its inference procedure, and we demonstrate via simulations its superior performance when compared to existing methods. Lastly, we apply the proposed method to the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) trial on major depressive disorder (MDD) and show that the estimated optimal linear rule provides a large benefit for mildly depressed and severely depressed patients but manifests a lack-of-fit for moderately depressed patients. The second part of the dissertation is motivated by the results of real data analysis in the first part, where the global linear rule estimated by ABLO from the overall sample performs inadequately on the subgroup of moderately depressed patients. Therefore, we aim to derive a simple and interpretable piece-wise linear ITR to maintain certain optimality that leads to improved benefit in subgroups of patients, as well as the overall sample. In this work, we propose a tree-based robust learning method to estimate optimal piece-wise linear ITRs and identify subgroups of patients with a large benefit. We achieve these goals by simultaneously identifying qualitative and quantitative interactions through a tree model, referred to as the composite interaction tree (CITree). We show that it has improved performance compared to existing methods on both overall sample and subgroups via extensive simulation studies. Lastly, we fit CITree to Research Evaluating the Value of Augmenting Medication with Psychotherapy (REVAMP) trial for treating major depressive disorders, where we identified both qualitative and quantitative interactions and subgroups of patients with a large benefit. The third part deals with the difficulties in the low power of identifying ITRs and replicating ITRs due to small sample sizes of single randomized controlled trials. In this work, a novel integrative learning method is developed to synthesize evidence across trials and provide an integrative ITR that improves efficiency and reproducibility. Our method does not require all studies to collect a common set of variables and thus allows information to be combined from ITRs identified from randomized controlled trials with heterogeneous sets of baseline covariates collected from different domains with different resolution. Based on the research goal, the integrative learning can be used to enhance a high-resolution ITR by borrowing information from coarsened ITRs or improve the coarsened ITR from a high-resolution ITR. With a simple modification, the proposed integrative learning can also be applied to improve the estimation of ITRs for studies with blockwise missing feature variables. We conduct extensive simulation studies to show that our method has improved performance compared to existing methods where only single-trial ITRs are used to learn personalized treatment rules. Lastly, we apply the proposed method to RCTs of major depressive disorder and other comorbid mental disorders. We found that by combining information from two studies, the integrated ITR has a greater benefit and improved efficiency compared to single-trial rules or universal non-personalized treatment rule

Columbia University Academic Commons