431 research outputs found

    Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments

    Get PDF
    Abstract Search facilitated with agglomerative hierarchical clustering methods was studied in a collection of Finnish newspaper articles (N = 53,893). To allow quick experiments, clustering was applied to a sample (N = 5,000) that was reduced with principal components analysis. The dendrograms were heuristically cut to find an optimal partition, whose clusters were compared with each of the 30 queries to retrieve the best-matching cluster. The fourlevel relevance assessment was collapsed into a binary one by (A) considering all the relevant and (B) only the highly relevant documents relevant, respectively. Single linkage (SL) was the worst method. It created many tiny clusters, and, consequently, searches enabled with it had high precision and low recall. The complete linkage (CL), average linkage (AL), and Ward's methods (WM) returned reasonably-sized clusters typically of 18-32 documents. Their recall (A: 27-52%, B: 50-82%) and precision (A: 83-90%, B: 18-21%) was higher than and comparable to those of the SL clusters, respectively. The AL and WM clustering had 1-8% better effectiveness than nearest neighbor searching (NN), and SL and CL were 1-9% less efficient that NN. However, the differences were statistically insignificant. When evaluated with the liberal assessment A, the results suggest that the AL and WM clustering offer better retrieval ability than NN. Assessment B renders the AL and WM clustering better than NN, when recall is considered more important than precision. The results imply that collections in the highly inflectional and agglutinative languages, such as Finnish, may be clustered as the collections in English, provided that documents are appropriately preprocessed

    Methods and applications for ontology-based recommender systems

    Get PDF
    Recommender systems are a specific type of information filtering systems used to identify a set of objects that are relevant to a user. Instead of a user actively searching for information, recommender systems provide advice to users about objects they might wish to examine. Content-based recommender systems deal with problems related to analyzing the content, making heterogeneous content interoperable, and retrieving relevant content for the user. This thesis explores ontology-based methods to reduce these problems and to evaluate the applicability of the methods in recommender systems. First, the content analysis is improved by developing an automatic annotation method that produces structured ontology-based annotations from text. Second, an event-based method is developed to enable interoperability of heterogeneous content representations. Third, methods for semantic content retrieval are developed to determine relevant objects for the user. The methods are implemented as part of recommender systems in two cultural heritage information systems: CULTURESAMPO and SMARTMUSEUM. The performance of the methods were evaluated through user studies. The results can be divided into five parts. First, the results show improvement in automatic content analysis compared to state of the art methods and achieve performance close to human annotators. Second, the results show that the event-based method developed is suitable for bridging heterogeneous content representations. Third, the retrieval methods show accurate performance compared to user opinions. Fourth, semantic distance measures are compared to study the best query expansion strategy. Finally, practical solutions are developed to enable user profiling and result clustering. The results show that ontology-based methods enable interoperability of heterogeneous knowledge representations and result in accurate recommendations. The deployment of the methods to practical recommender systems show applicability of the results in real life settings

    Nodalida 2005 - proceedings of the 15th NODALIDA conference

    Get PDF

    The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)

    Get PDF
    Humanities researchers are faced with an overwhelming volume of digitised primary source material, and "born digital" information, of relevance to their research as a result of large-scale digitisation projects. The current digital tools do not provide consistent support for analysing the content of digital archives that are potentially large in scale, multilingual, and come in a range of data formats. The current language-dependent, or project specific, approach to tool development often puts the tools out of reach for many research disciplines in the humanities. In addition, the tools can be incompatible with the way researchers locate and compare the relevant sources. For instance, researchers are interested in shared structural text patterns, known as \parallel passages" that describe a specific cultural, social, or historical context relevant to their research topic. Identifying these shared structural text patterns is challenging due to their repeated yet highly variable nature, as a result of differences in the domain, author, language, time period, and orthography. The contribution of the thesis is a novel infrastructure that directly addresses the need for generic, flexible, extendable, and sustainable digital tools that are applicable to a wide range of digital archives and research in the humanities. The infrastructure adopts a character-level n-gram Statistical Language Model (SLM), stored in a space-optimised k-truncated suffix tree data structure as its underlying data model. A character-level n-gram model is a relatively new approach that is competitive with word-level n-gram models, but has the added advantage that it is domain and language-independent, requiring little or no preprocessing of the document text unlike word-level models that require some form of language-dependent tokenisation and stemming. Character-level n-grams capture word internal features that are ignored by word-level n-gram models, which provides greater exibility in addressing the information need of the user through tolerant search, and compensation for erroneous query specification or spelling errors in the document text. Furthermore, the SLM provides a unified approach to information retrieval and text mining, where traditional approaches have tended to adopt separate data models that are often ad-hoc or based on heuristic assumptions. In addition, the performance of the character-level n-gram SLM was formally evaluated through crowdsourcing, which demonstrates that the retrieval performance of the SLM is close to that of the human level performance. The proposed infrastructure, supports the development of the Samtla (Search And Mining Tools for Language Archives), which provides humanities researchers digital tools for search, browsing, and text mining of digital archives in any domain or language, within a single system. Samtla supersedes many of the existing tools for humanities researchers, by supporting the same or similar functionality of the systems, but with a domain-independent and languageindependent approach. The functionality includes a browsing tool constructed from the metadata and named entities extracted from the document text, a hybrid-recommendation system for recommending related queries and documents. However, some tools are novel tools and developed in response to the specific needs of the researchers, such as the document comparison tool for visualising shared sequences between groups of related documents. Furthermore, Samtla is the first practical example of a system with a SLM as its primary data model that supports the real research needs of several case studies covering different areas of research in the humanities

    Content-Based Image Retrieval Using Self-Organizing Maps

    Full text link

    Sustainable Mobility and Transport

    Get PDF
    This Special Issue is dedicated to sustainable mobility and transport, with a special focus on technological advancements. Global transport systems are significant sources of air, land, and water emissions. A key motivator for this Special Issue was the diversity and complexity of mitigating transport emissions and industry adaptions towards increasingly stricter regulation. Originally, the Special Issue called for papers devoted to all forms of mobility and transports. The papers published in this Special Issue cover a wide range of topics, aiming to increase understanding of the impacts and effects of mobility and transport in working towards sustainability, where most studies place technological innovations at the heart of the matter. The goal of the Special Issue is to present research that focuses, on the one hand, on the challenges and obstacles on a system-level decision making of clean mobility, and on the other, on indirect effects caused by these changes

    Meaning in Distributions : A Study on Computational Methods in Lexical Semantics

    Get PDF
    This study investigates the connection between lexical items' distributions and their meanings from the perspective of computational distributional operations. When applying computational methods in meaning-related research, it is customary to refer to the so-called distributional hypothesis, according to which differences in distributions and meanings are mutually correlated. However, making use of such a hypothesis requires critical explication of the concept of distribution and plausible arguments for why any particular distributional structure is connected to a particular meaning-related phenomenon. In broad strokes, the present study seeks to chart the major differences in how the concept of distribution is conceived in structuralist/autonomous and usage-based/functionalist theoretical families of contemporary linguistics. The two theoretical positions on distributions are studied for identifying how meanings could enter as enabling or constraining factors in them. The empirical part of the study comprises two case studies. In the first one, three pairs of antonymical adjectives (köyhä/rikas, sairas/terve and vanha/nuori) are studied distributionally. Very narrow bag-of-word vector representations of distributions show how the dimensions on which relevant distributional similarities are based already conflate unexpected and varied range of linguistic phenomena, spanning from syntax-oriented conceptual constrainment to connotations, pragmatic patterns and affectivity. Thus, the results simultaneously corroborate the distributional hypothesis and challenge its over-generalized, uncritical applicability. For the study of meaning, distributional and semantic spaces cannot be treated as analogous by default. In the second case study, a distributional operation is purposefully built for answering a research question related to historical development of Finnish social law terminology in the period of 1860–1910. Using a method based on interlinked collocation networks, the study shows how the term vaivainen (‘pauper, beggar, measly’) receded from the prestigious legal and administrative registers during the studied period. Corroborating some of the findings of the previous parts of this dissertation, the case study shows how structures found in distributional representations cannot be satisfactorily explained without relying on semantic, pragmatic and discoursal interpretations. The analysis leads to confirming the timeline of the studied word use in the given register. It also shows how the distributional methods based on networked patterns of co-occurrence highlight incomparable structures of very different nature and skew towards frequent occurrence types prevalent in the data.Nykyaikaiset laskennalliset menetelmät suorittavat suurista tekstiaineistoista koottujen tilastollisten mallien avulla lähes virheettömästi monia sanojen merkitysten ymmärtämistä edellyttäviä tehtäviä. Kielitieteellisen metodologian kannalta onkin kiinnostavaa, miten tällaiset menetelmät sopivat kiellisten rakenteiden merkitysten lingvistiseen tutkimukseen. Tämä väitöstutkimus lähestyy kysymystä sanasemantiikan näkökulmasta ja pyrkii sekä teoreettisesti että empiirisesti kuvaamaan minkälaisia merkityksen lajeja pelkkiin sanojen sekvensseihin perustuvat laskennalliset menetelmät kykenevät tavoittamaan. Väitöstutkimus koostuu kahdesta osatutkimuksesta, joista ensimmäisessä tutkitaan kolmea vastakohtaista adjektiiviparia Suomi24-aineistosta kootun vektoriavaruusmallin avulla. Tulokset osoittavat, miten jo hyvin rajatut sekvenssiympäristöt sisältävät informaatiota käsitteellisten merkitysten lisäksi myös muun muassa niiden konnotaatioista ja affektiivisuudesta. Sekvenssiympäristön tuottama kuva merkityksestä on kuitenkin kattavuudeltaan ennalta-arvaamaton ja ne kielekäyttötavat, jotka tutkimusaineistossa ovat yleisiä vaikuttavat selvästi siihen mitä merkityksen piirteitä tulee näkyviin. Toisessa osatutkimuksessa jäljitetään erään sosiaalioikeudellisen termin, vaivaisen, historiaa 1800-luvun loppupuolella Kansalliskirjaston historiallisesta digitaalisesta sanomalehtikokoelmasta. Myötäesiintymäverkostojen avulla pyritään selvittämään miten se katosi oikeuskielestä tunnistamalla aineistosta hallinnollis-juridista rekisteriä vastaava rakenne ja seuraamalla vaivaisen asemaa siinä. Menetelmänä käytetyt myötäesiintymäverkostot eivät kuitenkaan edusta puhtaasti mitään tiettyä rekisteriä, vaan sekoittavat itseensä piirteitä erilaisista kategorioista, joilla kielen käyttöä on esimerkiksi tekstintutkimuksessa kuvattu. Tiheimmät verkostot muodostuvat rekisterien, genrejen, tekstityyppien ja sanastollisen koheesion yhteisvaikutuksesta. Osatutkimuksen tulokset antavat viitteitä siitä, että tämä on yleinen piirre monissa samankaltaisissa menetelmissä, mukaan lukien yleiset aihemallit
    corecore