Search CORE

114 research outputs found

Managing Keyword Variation with Frequency Based Generation of Word Forms in IR

Author: Kettunen Kimmo
Publication venue
Publication date: 01/01/2007
Field of study

Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007. Editors: Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit. University of Tartu, Tartu, 2007. ISBN 978-9985-4-0513-0 (online) ISBN 978-9985-4-0514-7 (CD-ROM) pp. 318-323

CiteSeerX

DSpace at Tartu University Library

Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger

Author: Kettunen Kimmo Tapio
Löfberg Laura
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/05/2017
Field of study

Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent. Also used entity categories vary a lot (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three part categorization of locations, persons and corporations. In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday. Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75%, and its NER evaluation collection consists of 75 931 words (Kettunen and Pääkkönen, 2016; Kettunen et al., 2016). Digitoday’s annotated collection consists of 240 articles in six different sections of the newspaper. Our new evaluated tool for NER tagging is non-conventional: it is a rule-based semantic tagger of Finnish, the FST (Löfberg et al., 2005), and its results are compared to those of a standard rule-based NE tagger, FiNER. The FST achieves up to 55–61 F-score with locations and F-score of 51–52 with persons with the historical newspaper data, and its performance is comparative to FiNER with locations. With the modern Finnish technology news of Digitoday FiNER achieves F-scores of up to 79 with locations at best. Person names show worst performance; their F-score varies from 33 to 66. The FST performs equally well as FiNER with Digitoday’s location names, but is worse with persons. With corporations, FST is at its worst, while FiNER performs reasonably well. Overall our results show that a general semantic tool like the FST is able to perform in a restricted semantic task of name recognition almost as well as a dedicated NE tagger. As NER is a popular task in information extraction and retrieval, our results show that NE tagging does not need to be only a task of dedicated NE taggers, but it can be performed equally well with more general multipurpose semantic tools.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Soft detection and decoding in wideband CDMA systems

Author: Kettunen Kimmo
Publication venue: Teknillinen korkeakoulu
Publication date: 01/01/2003
Field of study

A major shift is taking place in the world of telecommunications towards a communications environment where a range of new data services will be available for mobile users. This shift is already visible in several areas of wireless communications, including cellular systems, wireless LANs, and satellite systems. The provision of flexible high-quality wireless data services requires a new approach on both the radio interface specification and the design and the implementation of the various transceiver algorithms. On the other hand, when the processing power available in the receivers increases, more complex receiver algorithms become feasible. The general problem addressed in this thesis is the application of soft detection and decoding algorithms in the wideband code division multiple access (WCDMA) receivers, both in the base stations and in the mobile terminals, so that good performance is achieved but that the computational complexity remains acceptable. In particular, two applications of soft detection and soft decoding are studied: coded multiuser detection in the CDMA base station and improved RAKE-based reception employing soft detection in the mobile terminal. For coded multiuser detection, we propose a novel receiver structure that utilizes the decoding information for multiuser detection. We analyze the performance and derive lower bounds for the capacity of interference cancellation CDMA receivers when using channel coding to improve the reliability of tentative decisions. For soft decision and decoding techniques in the CDMA downlink, we propose a modified maximal ratio combining (MRC) scheme that is more suitable for RAKE receivers in WCDMA mobile terminals than the conventional MRC scheme. We also introduce an improved soft-output RAKE detector that is especially suitable for low spreading gains and high-order modulation schemes. Finally we analyze the gain obtained through the use of Brennan's MRC scheme and our modified MRC scheme. Throughout this thesis Bayesian networks are utilized to develop algorithms for soft detection and decoding problems. This approach originates from the initial stages of this research, where Bayesian networks and algorithms using such graphical models (e.g. the so-called sum-product algorithm) were used to identify new receiver algorithms. In the end, this viewpoint may not be easily noticeable in the final form of the algorithms, mainly because the practical efficiency considerations forced us to select simplified variants of the algorithms. However, this viewpoint is important to emphasize the underlying connection between the apparently different soft detection and decision algorithms described in this thesis.reviewe

CiteSeerX

Aaltodoc Publication Archive

Sanoja analysoivat ja tuottavat ohjelmat hakutermien vaihtelun hallinnassa tekstitiedonhaussa

Author: Kettunen Kimmo
Publication venue: Informaatiotutkimuksen yhdistys ry
Publication date: 01/01/2008
Field of study

Directory of Open Access Journals

Journal.fi

National Library of Finland DSpace Services

Tavut sananmuotojen vaihtelun hallinnan välineinä tekstitiedonhaussa

Author: Kimmo Kettunen
Publication venue: Informaatiotutkimuksen yhdistys ITY ry
Publication date: 01/10/2010
Field of study

Directory of Open Access Journals

Journal.fi

FiST – towards a Free Semantic Tagger of Modern Standard Finnish

Author: Kettunen Kimmo Tapio
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2019
Field of study

This paper introduces a work in progress for implementing a free full text semantic tagger for Finnish, FiST. The tagger is based on a 46 226 lexeme semantic lexicon of Finnish that was published in 2016. The basis of the semantic lexicon was developed in the early 2000s in an EU funded project Benedict (Löfberg et al., 2005). Löfberg (2017) describes compilation of the lexicon and evaluates a proprietary version of the Finnish Semantic Tagger, the FST2. The FST and its lexicon were developed using the English Semantic Tagger (The EST) of University of Lancaster as a model. This semantic tagger was developed at the University Centre for Corpus Research on Language (UCREL) at Lancaster University as part of the UCREL Semantic Analysis System (USAS3 ) framework. The semantic lexicon of the USAS framework is based on the modified and enriched categories of the Longman Lexicon of Contemporary English (McArthur, 1981). We have implemented a basic working version of a new full text semantic tagger for Finnish based on freely available components. The implementation uses Omorfi and FinnPos for morphological analysis of Finnish words. After the morphological recognition phase words from the 46K semantic lexicon are matched against the morphologically unambiguous base forms. In our comprehensive tests the lexical tagging coverage of the current implementation is around 82–90% with different text types. The present version needs still some enhancements, at least processing of semantic ambiguity of words and analysis of compounds, and perhaps also treatment of multiword expressions. Also a semantically marked ground truth evaluation collection should be established for evaluation of the tagger.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Open Source Tesseract in Re-OCR of Finnish Fraktur from 19th and Early 20th Century Newspapers and Journals – Collected Notes on Quality Improvement

Author: Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Publication venue: CEUR-WS.org
Publication date: 06/03/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Miten menee, markkinointitiede? : professori Rami Olkkosen juhlakirja

Author: Kerttu Kettunen
Kimmo Alajoutsijärvi
Publication venue: Society of Social and Economic Research in the Universities of Turku
Publication date: 28/10/2022
Field of study

UTUPub

Creating and Using Ground Truth OCR Sample Data for Finnish Historical Newspapers and Journals

Author: Kervinen Jukka
Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Publication venue
Publication date: 03/04/2018
Field of study

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.9 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The years 1920–1929 were opened in January 2018. This paper presents the ground truth Optical Character Recognition data of about 500 000 Finnish words that has been compiled at the NLF for development of a new OCR process for the collection. We discuss compilation of the data and show basic results of the new OCR process in comparison to current OCR using the ground truth data.Peer reviewe

Helsingin yliopiston digitaalinen arkisto