23 research outputs found

    Moving towards the semantic web: enabling new technologies through the semantic annotation of social contents.

    Get PDF
    La Web Social ha causat un creixement exponencial dels continguts disponibles deixant enormes quantitats de recursos textuals electrònics que sovint aclaparen els usuaris. Aquest volum d’informació és d’interès per a la comunitat de mineria de dades. Els algorismes de mineria de dades exploten característiques de les entitats per tal de categoritzar-les, agrupar-les o classificar-les segons la seva semblança. Les dades per si mateixes no aporten cap mena de significat: han de ser interpretades per esdevenir informació. Els mètodes tradicionals de mineria de dades no tenen com a objectiu “entendre” el contingut d’un recurs, sinó que extreuen valors numèrics els quals esdevenen models en aplicar-hi càlculs estadístics, que només cobren sentit sota l’anàlisi manual d’un expert. Els darrers anys, motivat per la Web Semàntica, molts investigadors han proposat mètodes semàntics de classificació de dades capaços d’explotar recursos textuals a nivell conceptual. Malgrat això, normalment aquests mètodes depenen de recursos anotats prèviament per poder interpretar semànticament el contingut d’un document. L’ús d’aquests mètodes està estretament relacionat amb l’associació de dades i el seu significat. Aquest treball es centra en el desenvolupament d’una metodologia genèrica capaç de detectar els trets més rellevants d’un recurs textual descobrint la seva associació semàntica, es a dir, enllaçant-los amb conceptes modelats a una ontologia, i detectant els principals temes de discussió. Els mètodes proposats són no supervisats per evitar el coll d’ampolla generat per l’anotació manual, independents del domini (aplicables a qualsevol àrea de coneixement) i flexibles (capaços d’analitzar recursos heterogenis: documents textuals o documents semi-estructurats com els articles de la Viquipèdia o les publicacions de Twitter). El treball ha estat avaluat en els àmbits turístic i mèdic. Per tant, aquesta dissertació és un primer pas cap a l'anotació semàntica automàtica de documents necessària per possibilitar el camí cap a la visió de la Web Semàntica.La Web Social ha provocado un crecimiento exponencial de los contenidos disponibles, dejando enormes cantidades de recursos electrónicos que a menudo abruman a los usuarios. Tal volumen de información es de interés para la comunidad de minería de datos. Los algoritmos de minería de datos explotan características de las entidades para categorizarlas, agruparlas o clasificarlas según su semejanza. Los datos por sí mismos no aportan ningún significado: deben ser interpretados para convertirse en información. Los métodos tradicionales no tienen como objetivo "entender" el contenido de un recurso, sino que extraen valores numéricos que se convierten en modelos tras aplicar cálculos estadísticos, los cuales cobran sentido bajo el análisis manual de un experto. Actualmente, motivados por la Web Semántica, muchos investigadores han propuesto métodos semánticos de clasificación de datos capaces de explotar recursos textuales a nivel conceptual. Sin embargo, generalmente estos métodos dependen de recursos anotados previamente para poder interpretar semánticamente el contenido de un documento. El uso de estos métodos está estrechamente relacionado con la asociación de datos y su significado. Este trabajo se centra en el desarrollo de una metodología genérica capaz de detectar los rasgos más relevantes de un recurso textual descubriendo su asociación semántica, es decir, enlazándolos con conceptos modelados en una ontología, y detectando los principales temas de discusión. Los métodos propuestos son no supervisados para evitar el cuello de botella generado por la anotación manual, independientes del dominio (aplicables a cualquier área de conocimiento) y flexibles (capaces de analizar recursos heterogéneos: documentos textuales o documentos semi-estructurados, como artículos de la Wikipedia o publicaciones de Twitter). El trabajo ha sido evaluado en los ámbitos turístico y médico. Esta disertación es un primer paso hacia la anotación semántica automática de documentos necesaria para posibilitar el camino hacia la visión de la Web Semántica.Social Web technologies have caused an exponential growth of the documents available through the Web, making enormous amounts of textual electronic resources available. Users may be overwhelmed by such amount of contents and, therefore, the automatic analysis and exploitation of all this information is of interest to the data mining community. Data mining algorithms exploit features of the entities in order to characterise, group or classify them according to their resemblance. Data by itself does not carry any meaning; it needs to be interpreted to convey information. Classical data analysis methods did not aim to “understand” the content and the data were treated as meaningless numbers and statistics were calculated on them to build models that were interpreted manually by human domain experts. Nowadays, motivated by the Semantic Web, many researchers have proposed semantic-grounded data classification and clustering methods that are able to exploit textual data at a conceptual level. However, they usually rely on pre-annotated inputs to be able to semantically interpret textual data such as the content of Web pages. The usability of all these methods is related to the linkage between data and its meaning. This work focuses on the development of a general methodology able to detect the most relevant features of a particular textual resource finding out their semantics (associating them to concepts modelled in ontologies) and detecting its main topics. The proposed methods are unsupervised (avoiding the manual annotation bottleneck), domain-independent (applicable to any area of knowledge) and flexible (being able to deal with heterogeneous resources: raw text documents, semi-structured user-generated documents such Wikipedia articles or short and noisy tweets). The methods have been evaluated in different fields (Tourism, Oncology). This work is a first step towards the automatic semantic annotation of documents, needed to pave the way towards the Semantic Web vision

    Investigating chromospheric magnetic activity on young stars and the wide field CAMera for UKIRT

    Get PDF
    Hertzprung-Russell diagrams are one of the most important tools for understanding pre-main sequence evolution when combined with theoretical evolutionary tracks. They are not only used to deduce the properties of the stars they are charting but to estimate the ages of clusters that house them and to investigate the age spreads for episodes of star formation. It is therefore vital that the determination of these dia­grams and tracks are built on solid theoretical and observational foundations. How­ever, work in recent years points to a potential problem. It has long been known that pre-main sequence stars exhibit regions of magnetic activity on their stellar surfaces similar to active regions observed on the Sun. What is not yet well known is the extent to which these active regions cover the stellar surfaces.Most spectral classification relies on moderate resolution optical spectra which tend to be dominated by the non-active photosphere which is hotter than the active regions. Resultant effective temperatures are overestimated if a large portion of the pre-main sequence stellar surface is covered in active regions, which in turn can lead to substantial error in mass and age calculations. This thesis presents a novel approach to measuring the distribution of magnetic regions on T Tauri stars which aims to over­come limitations of other observing techniques such as Doppler imaging or Zeeman measurements. The central line emission from the strong visible Call H & K lines are a proxy indicator of surface magnetic fields and are known through observations of the Sun to be enhanced above active plage regions. Simultaneous optical spectroscopic and photometric observations of a significant sample of fast rotating T Tauri stars in the nearby clusters p Ophiuchus & Upper Scorpius have allowed us to ascertain a di­rect correlation between variations in the Call doublet emissions and light intensities. Computer simulations which model the surface conditions as understood on T Tauri stars and generate correlations which mimic those in the observational data offer a manipulable tool for estimating how much of the stellar surface is covered.The Wide Field CAMera for the UKIRT Telescope on Mauna Kea is currently the most capable infrared imaging survey wide field camera in the world. The instrument focal plane consists of four Hawaii-II 2048 x 2048 IR detectors, to facilitate the best operating conditions and practises for the camera these detectors must be carefully characterised such that inherent qualities can either be corrected or accounted for. The second part of this thesis details the detector characterisation work carried out prior to the instrument delivery to the telescope. Obtaining a correct and stable operating temperature regardless of ambient temperature in the dome enclosure is key to the camera functioning optimally to carry out highly successful surveys. Presented here is a full model of the camera's thermal behaviour for the main instrument and the infrared detectors

    Oncogenomic screening strategies to identify driver genes in acute myeloid leukaemia /

    Get PDF
    PhD ThesisDeletions of the short arm of chromosome 12 (12p) are found in around 6% of acute myeloid leukaemia (AML). Particularly in paediatric AML they often occur as the sole cytogenetic change and are associated with a poor prognosis. Amplifications of the long arm of chromosome 11 (11q) are rarer, found in around 1% of AML, but are particularly prevalent in the poor prognosis complex karyotype group. Despite multiple deletion mapping studies, single genes have not been identified from these regions that are responsible for driving leukaemic progression, thus it is clear that a functional approach is required. This study aimed to functionally implicate significant genes through the use of competitive selection assays. Publicly available array data from eight studies were used to determine copy number from AML patient samples to define minimal regions of copy number alteration. Data were obtained from 866 samples which had been analysed on Affymetrix SNP array platforms. In total 58 samples (6.7%) were found to have deletions of 12p and were used to determine a minimally deleted region (MDR), whilst five samples (0.6%) were identified with amplifications of 11q and used to define a minimally amplified region (MAR). AML cell lines with deletions of 12p (NKM-1 and GDM-1) and amplification of 11q (UoC-M1) were selected to investigate the functional relevance of the genes within the regions of interest. Cell lines were used in vitro and in immunocompromised mouse models, where leukaemia was established by intrafemoral injection and monitored by luminescent imaging of luciferase expression constructs. Taking a pooled approach, 11 genes within the 12p MDR were overexpressed and 30 genes from the 11q MAR were knocked down using integrated lentiviral vectors. To evaluate effects of expression on leukaemia growth or survival, changes in construct copy number after cell line expansion in vitro and in vivo were determined by targeted high throughput sequencing on the Illumina MiSeq platform. Demonstrating a strong anti leukaemic effect for expression of this gene, a construct for cyclin dependent kinase inhibitor 1B (CDKN1B) was rapidly selected against in the 12p assay, whilst knockdown of MPZL3 and UBE4A from 11q were selected against in the 11q assay. Functional follow up work mainly focussed on CDKN1B, where its expression was measured in a range of a cell lines and patient samples, its effects on the cell cycle assessed and correlation of CDKN1B expression and treatment with a cyclin dependent kinase inhibitor (Flavopiridol) was established

    What Controls the Controller: Structure and Function Characterizations of Transcription Factor PU.1 Uncover Its Regulatory Mechanism

    Get PDF
    The ETS family transcription factor PU.1/Spi-1 is a master regulator of the self-renewal of hematopoietic stem cells and their differentiation along both major lymphoid and myeloid branches. PU.1 activity is determined in a dosage-dependent manner as a function of both its expression and real-time regulation at the DNA level. While control of PU.1 expression is well established, the molecular mechanisms of its real-time regulation remain elusive. Our work is focused on discovering a complete regulatory mechanism that governs the molecular interactions of PU.1. Structurally, PU.1 exhibits a classic transcription factor architecture in which intrinsically disordered regions (IDR), consisting of 66% of its primary structure, are tethered to a well-structured DNA binding domain. The transcriptionally active form of PU.1 is a monomer that binds target DNA sites as a 1:1 complex. Our investigations show that IDRs of PU.1 reciprocally control two separate inactive dimeric forms, with and without DNA. At high concentrations, PU.1 forms a non-canonical 2:1 complex at a single DNA specific site. In the absence of DNA, PU.1 also forms a dimer, but it is incompatible with DNA binding. The DNA-free PU.1 dimer is further promoted by phosphomimetic mutants of IDR residues that are phosphorylated in B-lymphocytic activation. These results lead us to postulate a model of real-time PU.1 regulation, unknown in the ETS family, where independent dimeric forms antagonize each other to control the dosage of active PU.1 monomer at its target DNA sites. To demonstrate the biological relevance of our model, cellular assays probing PU.1-specific reporters and native target genes show that PU.1 transactivation exhibits a distinct dose response consistent with negative feedback. In summary, we have established the first model for the general real-time regulation of PU.1 at the DNA/protein level, without the need for recruiting specific binding partners. These novel interactions present potential therapeutic targets for correcting de-regulated PU.1 dosage in hematologic disorders, including leukemia, lymphoma, and myeloma

    A Ranking Approach to Summarising Twitter Home Timelines

    Get PDF
    The rise of social media services has changed the ways in which users can communicate and consume content online. Whilst online social networks allow for fast and convenient delivery of knowledge, users are prone to information overload when too much information is presented for them to read and process. Automatic text summarisation is a tool to help mitigate information overload. In automatic text summarisation, short summaries are generated algorithmically from extended text, such as news articles or scientific papers. This thesis addresses the challenges in applying text summarisation to the Twitter social network. It also goes beyond text, exploiting additional information that is unique to social networks to create summaries which are personal to an intended reader. Unlike previous work in tweet summarisation, the experiments here address the home timelines of readers, which contain the incoming posts from authors to whom they have explicitly subscribed. A novel contribution is made in this work the form of a large gold standard (19,35019,350 tweets), the majority of which will be shared with the research community. The gold standard is a collection of timelines that have been subjectively annotated by the readers to whom they belong, allowing fair evaluation of summaries which are not limited to tweets of general interest, but which are specific to the reader. Where the home timeline is used by professional users for social media analysis, automatic text summarisation can be applied to give results which beat all baselines. In the general case, where no limitation is placed on the types of readers, personalisation features which exploit the relationship between author and reader and the reader's own previous posts, were shown to outperform both automatic text summarisation and all baselines

    Early steps towards web scale information extraction with LODIE

    No full text
    Information extraction (IE) is the technique for transforming unstructured textual data into structured representation that can be understood by machines. The exponential growth of the Web generates an exceptional quantity of data for which automatic knowledge capture is essential. This work describes the methodology for web scale information extraction in the LODIE project (linked open data information extraction) and highlights results from the early experiments carried out in the initial phase of the project. LODIE aims to develop information extraction techniques able to scale at web level and adapt to user information needs. The core idea behind LODIE is the usage of linked open data, a very large-scale information resource, as a ground-breaking solution for IE, which provides invaluable annotated data on a growing number of domains. This article has two objectives. First, describing the LODIE project as a whole and depicting its general challenges and directions. Second, describing some initial steps taken towards the general solution, focusing on a specific IE subtask, wrapper induction
    corecore