17 research outputs found

    UniversityIE: Information Extraction From University Web Pages

    Get PDF
    The amount of information available on the web is growing constantly. As a result, theproblem of retrieving any desired information is getting more difficult by the day. Toalleviate this problem, several techniques are currently being used, both for locatingpages of interest and for extracting meaningful information from the retrieved pages.Information extraction (IE) is one such technology that is used for summarizingunrestricted natural language text into a structured set of facts. IE is already being appliedwithin several domains such as news transcripts, insurance information, and weatherreports. Various approaches to IE have been taken and a number of significant resultshave been reported.In this thesis, we describe the application of IE techniques to the domain of universityweb pages. This domain is broader than previously evaluated domains and has a varietyof idiosyncratic problems to address. We present an analysis of the domain of universityweb pages and the consequences of having them input to IE systems. We then presentUniversityIE, a system that can search a web site, extract relevant pages, and processthem for information such as admission requirements or general information. TheUniversityIE system, developed as part of this research, contributes three IE methods anda web-crawling heuristic that worked relatively well and predictably over a test set ofuniversity web sites.We designed UniversityIE as a generic framework for plugging in and executing IEmethods over pages acquired from the web. We also integrated in the system a genericweb crawler (built at the University of Kentucky) and ported to Java and integrated anexternal word lexicon (WordNet) and a syntax parser (Link Grammar Parser)

    Abstract 70: A role for the chromatin remodeling protein CHD3 in ovarian cancer therapy response

    Get PDF
    Carboplatin and cisplatin are chemotherapeutic agents that are used extensively for treating epithelial ovarian cancer. These drugs can be highly effective, yet tumors are frequently refractory to treatment or become resistant upon tumor relapse. Epigenetic silencing, particularly at promoter regions of genes regulates important cell function and has been associated with all stages of tumor formation and progression and may contribute to therapy response. We analyzed the epigenome of 50 primary ovarian tumors and 12 normal ovarian samples using an array based method previously developed in our lab and associated Affymetrix U133 expression data. We then identified gene candidates that segregate patients based on platinum sensitivity and patient survival. These candidates were then pooled into a genome-wide RNAi-based screen where we validated a gene encoding a chromatin remodeling protein, CHD3, a member of the Mi-2 NuRD complex, and show that it is linked to chemoresistance. CHD3 is silenced through an epigenetic mechanism in both ovarian cancer cell lines and primary ovarian tumors. When ovarian cancer cell lines that are transcriptionally silenced for CHD3 are challenged with carboplatin they display a striking slow growth phenotype as well as increased resistance to the chemotherapy drugs carboplatin and cisplatin. Taken together, we provide the first evidence for a role for CHD3 as an important mediator of chemoresistance in ovarian cancer. Furthermore, CHD3 might represent a response predictor and potential therapeutic target for predicting chemoresistance in this disease

    Who’s That Actor? The InfoSip TV Agent

    No full text
    We present a content augmentation application that enhances the primary video watching experience by providing related supplemental information. Our goal is to explore the value of cross-referencing related content among different media such as TV and Web. The cross-referencing is based on metadata, which is automatically extracted from the content. Metadata extraction can make a great difference for personalized user experience. In addition, annotations that provide title, genre, description, and cast can be greatly enriched with detailed information at the subprogram level, i.e. at the clip and scene level. We developed a system called InfoSip that performs person identification and scene annotation based on actor presence. The system links this information with actors ’ filmographies and biographies and produces an enriched viewing experience

    Querying faceted databases

    Get PDF
    Abstract. Faceted classification allows one to model applications with complex classification hierarchies using orthogonal dimensions. Recent work has examined the use of faceted classification for browsing and search. In this paper, we go further by developing a general query language, called the entity algebra, for hierarchically classified data. The entity algebra is compositional, with query inputs and outputs being sets of entities. Our language has linear data complexity in terms of space and quadratic data complexity in terms of time. We compare the expressive power of the entity algebra with relational algebra. We also describe an end-to-end query system based on the language in the context of an archeological database.

    A Faceted Query Engine Applied to Archaeology

    No full text
    In this demonstration, we describe a system for storing and querying faceted hierarchies. We have developed a general faceted domain model and a query language for hierarchically classified data. We present here the use of our system on two real archaeological datasets containing thousands of artifacts. Our system is a sharable, evolvable resource that can provide global access to sizeable datasets in queriable format, and can serve as a valuable tool for data analysis and research in many application domains.

    Querying Faceted Databases

    No full text
    Faceted classification allows one to model applications with complex classification hierarchies using orthogonal dimensions. Recent work has examined the use of faceted classification for browsing and search. In this paper, we go further by developing a general query language, called the entity algebra, for hierarchically classified data. The entity algebra is compositional, with query inputs and outputs being sets of entities. Our language has linear data complexity in terms of space and quadratic data complexity in terms of time. We compare the entity algebra with the relational algebra in terms of expressiveness. We also describe an implementation of the language in the context of two application domains, one for an archeological database, and another for a human anatomy database

    Effective normalization for copy number variation detection from whole genome sequencing

    No full text
    Abstract Background Whole genome sequencing enables a high resolution view of the human genome and provides unique insights into genome structure at an unprecedented scale. There have been a number of tools to infer copy number variation in the genome. These tools, while validated, also include a number of parameters that are configurable to genome data being analyzed. These algorithms allow for normalization to account for individual and population-specific effects on individual genome CNV estimates but the impact of these changes on the estimated CNVs is not well characterized. We evaluate in detail the effect of normalization methodologies in two CNV algorithms FREEC and CNV-seq using whole genome sequencing data from 8 individuals spanning four populations. Methods We apply FREEC and CNV-seq to a sequencing data set consisting of 8 genomes. We use multiple configurations corresponding to different read-count normalization methodologies in FREEC, and statistically characterize the concordance of the CNV calls between FREEC configurations and the analogous output from CNV-seq. The normalization methodologies evaluated in FREEC are: GC content, mappability and control genome. We further stratify the concordance analysis within genic, non-genic, and a collection of validated variant regions. Results The GC content normalization methodology generates the highest number of altered copy number regions. Both mappability and control genome normalization reduce the total number and length of copy number regions. Mappability normalization yields Jaccard indices in the 0.07 - 0.3 range, whereas using a control genome normalization yields Jaccard index values around 0.4 with normalization based on GC content. The most critical impact of using mappability as a normalization factor is substantial reduction of deletion CNV calls. The output of another method based on control genome normalization, CNV-seq, resulted in comparable CNV call profiles, and substantial agreement in variable gene and CNV region calls. Conclusions Choice of read-count normalization methodology has a substantial effect on CNV calls and the use of genomic mappability or an appropriately chosen control genome can optimize the output of CNV analysis.</p
    corecore