5,058 research outputs found
Self-supervised automated wrapper generation for weblog data extraction
Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives
Harvesting Entities from the Web Using Unique Identifiers -- IBEX
In this paper we study the prevalence of unique entity identifiers on the
Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs
(for documents), email addresses, and others. We show how these identifiers can
be harvested systematically from Web pages, and how they can be associated with
human-readable names for the entities at large scale.
Starting with a simple extraction of identifiers and names from Web pages, we
show how we can use the properties of unique identifiers to filter out noise
and clean up the extraction result on the entire corpus. The end result is a
database of millions of uniquely identified entities of different types, with
an accuracy of 73--96% and a very high coverage compared to existing knowledge
bases. We use this database to compute novel statistics on the presence of
products, people, and other entities on the Web.Comment: 30 pages, 5 figures, 9 tables. Complete technical report for A.
Talaika, J. A. Biega, A. Amarilli, and F. M. Suchanek. IBEX: Harvesting
Entities from the Web Using Unique Identifiers. WebDB workshop, 201
HIV as a Chronic Illness: Identity Incorporation and Learning
Abstract: The purpose of this session is twofold: (1) to review tentative findings of a study-in-progress concerning the identity incorporation process and learning of people living with HIV as a chronic illness and (2) to explore issues encountered in conducting research with the chronically ill
“HIV is Only One Part of Me”: HIV and Its Effect on Other Identities
The purpose of this study was to investigate the effect of the HIV identity on other identities. The spiritual and advocate identities increased in salience whereas work and sexual identities decreased. Younger participants fretted about physical appearance. Older participants focused on health. There are implications for adult educators
PoZitively Transformative: The Transformative Learning of People Living with HIV
The purpose of this study was to investigate meaning making in People Living with HIV (PLWH) as a chronic illness. Findings confirm those of Courtenay, Merriam and Reeves (1998) who examined meaning making in PLWHAs when HIV/AIDS was a terminal illness. Contextual factors that mediate meaning making were uncovered
Effectiveness of Hindman's theorem for bounded sums
We consider the strength and effective content of restricted versions of
Hindman's Theorem in which the number of colors is specified and the length of
the sums has a specified finite bound. Let denote the
assertion that for each -coloring of there is an infinite
set such that all sums for and have the same color. We prove that there is a
computable -coloring of such that there is no infinite
computable set such that all nonempty sums of at most elements of
have the same color. It follows that is not provable
in and in fact we show that it implies in
. We also show that there is a computable instance of
with all solutions computing . The proof of this
result shows that implies in
Consanguinity and rare mutations outside of MCCC genes underlie nonspecific phenotypes of MCCD.
Purpose3-Methylcrotonyl-CoA carboxylase deficiency (MCCD) is an autosomal recessive disorder of leucine catabolism that has a highly variable clinical phenotype, ranging from acute metabolic acidosis to nonspecific symptoms such as developmental delay, failure to thrive, hemiparesis, muscular hypotonia, and multiple sclerosis. Implementation of newborn screening for MCCD has resulted in broadening the range of phenotypic expression to include asymptomatic adults. The purpose of this study was to identify factors underlying the varying phenotypes of MCCD.MethodsWe performed exome sequencing on DNA from 33 cases and 108 healthy controls. We examined these data for associations between either MCC mutational status, genetic ancestry, or consanguinity and the absence or presence/specificity of clinical symptoms in MCCD cases.ResultsWe determined that individuals with nonspecific clinical phenotypes are highly inbred compared with cases that are asymptomatic and healthy controls. For 5 of these 10 individuals, we discovered a homozygous damaging mutation in a disease gene that is likely to underlie their nonspecific clinical phenotypes previously attributed to MCCD.ConclusionOur study shows that nonspecific phenotypes attributed to MCCD are associated with consanguinity and are likely not due to mutations in the MCC enzyme but result from rare homozygous mutations in other disease genes.Genet Med 17 8, 660-667
A UIMA wrapper for the NCBO annotator
Summary: The Unstructured Information Management Architecture (UIMA) framework and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which wraps the National Center for Biomedical Ontology (NCBO) Annotator—an ontology-based annotation service—to make it available as a component in UIMA workflows
Two-qutrit Entanglement Witnesses and Gell-Mann Matrices
The Gell-Mann matrices for Lie algebra su(3) are the natural basis
for the Hilbert space of Hermitian operators acting on the states of a
three-level system(qutrit). So the construction of EWs for two-qutrit states by
using these matrices may be an interesting problem. In this paper, several
two-qutrit EWs are constructed based on the Gell-Mann matrices by using the
linear programming (LP) method exactly or approximately. The decomposability
and non-decomposability of constructed EWs are also discussed and it is shown
that the -diagonal EWs presented in this paper are all decomposable
but there exist non-decomposable ones among -non-diagonal EWs.Comment: 25 page
- …