344 research outputs found
Dynamic hyperparameter optimization for bayesian topical trend analysis
This paper presents a new Bayesian topical trend analysis. We regard the parameters of topic Dirichlet priors in latent Dirichlet allocation as a function of document timestamps and optimize the parameters by a gradient-based algorithm. Since our method gives similar hyperparameters to the documents having similar timestamps, topic assignment in collapsed Gibbs sampling is affected by timestamp similarities. We compute TFIDF-based document similarities by using a result of collapsed Gibbs sampling and evaluate our proposal by link detection task of Topic Detection and Tracking.Proceeding of the 18th ACM conference : Hong Kong, China, 2009.11.02-2009.11.0
Semanttisten luokkien soveltaminen automaattisessa uutisseurannassa
Topic detection and tracking (TDT) is an area of information retrieval research the focus of which revolves around news events. The problems TDT deals with relate to segmenting news text into cohesive stories, detecting something new, previously unreported, tracking the development of a previously reported event, and grouping together news that discuss the same event. The performance of the traditional information retrieval techniques based on full-text similarity has remained inadequate for online production systems. It has been difficult to make the distinction between same and similar events.
In this work, we explore ways of representing and comparing news documents in order to detect new events and track their development. First, however, we put forward a conceptual analysis of the notions of topic and event. The purpose is to clarify the terminology and align it with the process of news-making and the tradition of story-telling.
Second, we present a framework for document similarity that is based on semantic classes, i.e., groups of words with similar meaning. We adopt people, organizations, and locations as semantic classes in addition to general terms. As each semantic class can be assigned its own similarity measure, document similarity can make use of ontologies, e.g., geographical taxonomies. The documents are compared class-wise, and the outcome is a weighted combination of class-wise similarities.
Third, we incorporate temporal information into document similarity. We formalize the natural language temporal expressions occurring in the text, and use them to anchor the rest of the terms onto the time-line. Upon comparing documents for event-based similarity, we look not only at matching terms, but also how near their anchors are on the time-line.
Fourth, we experiment with an adaptive variant of the semantic class similarity system. The news reflect changes in the real world, and in order to keep up, the system has to change its behavior based on the contents of the news stream. We put forward two strategies for rebuilding the topic representations and report experiment results.
We run experiments with three annotated TDT corpora. The use of semantic classes increased the effectiveness of topic tracking by 10-30\% depending on the experimental setup. The gain in spotting new events remained lower, around 3-4\%. The anchoring the text to a time-line based on the temporal expressions gave a further 10\% increase the effectiveness of topic tracking. The gains in detecting new events, again, remained smaller. The adaptive systems did not improve the tracking results.Automaattinen uutistapahtumien seuranta on tietojenkäsittelytieteen ja siinä tiedonhaun piiriin kuuluva tutkimusalue, jossa kehitetään menetelmiä digitaalisen uutisvirran hallintaan. Uutisvirta koostuu useista, mahdollisesti eri kielisistä uutislähteistä, joissa voi olla digitaalisia online-uutisia ja radio- sekä televisiouutisia. Alueen tutkimusongelmat koostuvat uusien, aikaisemmin uutisoimattomien uutistapahtumien havaitsemisesta, tunnistettujen uutistapahtumien kehityksen seuraamisesta ja uutisten ryhmittelystä sisällön perusteella sekä uutisvirran pilkkomisesta uutisjutuiksi. Tässä työssä keskitytään kahteen ensimmäiseen tutkimusongelmaan.
Perinteiset tiedonhakumenetelmät, jotka ovat edelleen internet-tiedonhakujärjestelmien perustana, vertailevat tekstidokumentteja joukkoina sanoja ja käsittelevät sanoja yksinkertaisina merkkijonoja, mikä mahdollistaa nopeat hakuajat ja kohtuullisen hyvä tulokset mutta kadottaa sanojen merkitykset. Perinteiset menetelmät eivät ole kuitenkaan toimineet erityisen hyvin tapahtumapohjaisessa uutisseurannassa. Erityisen vaikeaa on ollut tunnistaa kaksi samantyyppistä uutistapahtumaa, esim. kaksi lento-onnettomuutta, eri tapahtumiksi, koska niiden uutisointi sisältää pitkälti samoja sanoja.
Tässä työssä etsitään uusia tapoja kuvata ja vertailla uutisia. Ensinnäkin sanat ryhmitellään merkitystensä mukaan joukoiksi samankaltaisia sanoja eli semanttisiksi luokiksi. Työssä käytetään semanttisia luokkia kuten yleiset sanat, organisaatiot, henkilöt, paikanilmaukset ja ajanilmaukset, jotka karkeasti ottaen vastaavat kysymyksiin mitä, kuka, milloin ja missä. Jokaisen luokan sisällä sanoja voidaan vertailla hieman eri tavoin, ja niinpä paikanilmausten kohdalla voidaan kaksi eri kaupunkia tai maata huomata maantieteellisesti läheisiksi tai organisaatioiden nimien kohdalla tunnistaa kaksi nimeä viittaavan samaan organisaatioon. Semanttisen luokan taustalle voidaan kytkeä sanojen taksonomia tai jokin muu rakenne, jonka kautta voidaan selvittää luokan sanojen välinen suhde.
Lisäksi tekstistä tunnistetaan ajanilmaukset (esim. 'eilen', 'kaksi vuotta sitten helmikuussa') ja teksti ankkuroidaan niiden avulla aika-akselille. Tällöin tunnistetaan eri uutistapahtumista puhuttaessa samaa sanaa, esim. 'lento-onnettomuus', käytetään eri aikayhteydessä.
Uutisia verrataan semanttinen luokka kerrallaan, ja tunnistaminen nojaa näiden erilaisten luokkakohtaisten tulosten yhdistelmään. Näin kaksi lento-onnettomuusuutista voivat olla samanlaisia yleisten sanojen suhteen mutta erilaisia paikkojen ja ajanilmausten suhteen, koska ne tapahtuvat eri paikoissa eri aikaan.
Uutistapahtumia on monenlaisia, eikä todellisuus tai siitä kertovat uutiset taivu täysin kauniisiin malleihin. Tutkimustuloksissa kuitenkin semanttisten luokkien käyttö parantaa tuntuvasti uutistapahtumien seurannan tarkkuutta verrattuna perinteiseen lähestymistapaan -- uusien tapahtumien tunnistamista hieman vähemmän
Recommended from our members
Entity-based Enrichment for Information Extraction and Retrieval
The goal of this work is to leverage cross-document entity relationships for improved understanding of queries and documents. We define an entity to be a thing or concept that exists in the world, such as a politician, a battle, a film, or a color. Entity-based enrichment (EBE) is a new expansion model for both queries and documents using features from similar entitymentions in the document collection and external knowledge resources. It uses task-specific features from entities beyond words that include: name aliases, fine-grained entity types, categories, and relationships to other entities. EBE addresses the problem of sparse or noisy local evidence due to multiple topics, implicit context, or informal writing. With the ultimate goal of improving information retrieval effectiveness, we start from unstructured text and through information extraction build up rich entity-based representations linked to external knowledge resources. We study the application ofentity-based enrichment to each step in the pipeline: 1) Named entity recognition, 2) Entity linking, and 3) Ad hoc document retrieval. The empirical results for EBE in each of these tasks shows significant improvements. Our first application of entity-based enrichment is the task of detecting entities in named entity recognition. We enrich the representation of observed words likely to represent entities. We perform weighted feature copying of recognition features from similar tokens in the corpus and external collections. The evaluation shows statistically significant improvements on in-domain newswire accuracy and demonstrates that the models are more robust on out-of-domain data. In the second part of this work, we apply EBE to the task of entity linking. The proposed entity linking method for disambiguating the detected mentions to entries in an external knowledge base is based on information retrieval. Theneighborhood relevance model, an enrichment model, identifies salient associations between an entity mention and otherentity mentions in the document. The results show that the enrichment model outperforms other context models and results in a system that is competitive with leading methods. Using the constructed entity representation, the final task is ad hoc document retrieval. We enrich the query representation with entity features. Retrieval is performed over documents annotated with entities linked to Wikipedia and Freebase from our system as well as the publicly available Google FACC1 annotations. To effectively leverage linked entity features, we extend dependency-based retrieval models to include structured attributes. We also define a new query-specific entity context model which builds a model for disambiguated entities from retrieved documents. Our results demonstrate significant improvements over competitive query expansion baselines for several standard test collections
Recommended from our members
SRL2003 IJCAI 2003 Workshop on Learning Statistical Models from Relational Data
Knowledge-Driven Harmonization of Sensor Observations: Exploiting Linked Open Data for IoT Data Streams
The rise of the Internet of Things leads to an unprecedented number of continuous sensor observations that are available as IoT data streams. Harmonization of such observations is a labor-intensive task due to heterogeneity in format, syntax, and semantics. We aim to reduce the effort for such harmonization tasks by employing a knowledge-driven approach. To this end, we pursue the idea of exploiting the large body of formalized public knowledge represented as statements in Linked Open Data
QNRs: toward language for intelligent machines
Impoverished syntax and nondifferentiable vocabularies make natural language a poor medium for neural representation learning and applications. Learned, quasilinguistic neural representations (QNRs) can upgrade words to embeddings and syntax to graphs to provide a more expressive and computationally tractable medium. Graph-structured, embedding-based quasilinguistic representations can support formal and informal reasoning, human and inter-agent communication, and the development of scalable quasilinguistic corpora with characteristics of both literatures and associative memory.
To achieve human-like intellectual competence, machines must be fully literate, able not only to read and learn, but to write things worth retaining as contributions to collective knowledge. In support of this goal, QNR-based systems could translate and process natural language corpora to support the aggregation, refinement, integration, extension, and application of knowledge at scale. Incremental development of QNRbased models can build on current methods in neural machine learning, and as systems mature, could potentially complement or replace today’s opaque, error-prone “foundation models” with systems that are more capable, interpretable, and epistemically reliable. Potential applications and implications are broad
- …