19,087 research outputs found

    A random forest system combination approach for error detection in digital dictionaries

    Full text link
    When digitizing a print bilingual dictionary, whether via optical character recognition or manual entry, it is inevitable that errors are introduced into the electronic version that is created. We investigate automating the process of detecting errors in an XML representation of a digitized print dictionary using a hybrid approach that combines rule-based, feature-based, and language model-based methods. We investigate combining methods and show that using random forests is a promising approach. We find that in isolation, unsupervised methods rival the performance of supervised methods. Random forests typically require training data so we investigate how we can apply random forests to combine individual base methods that are themselves unsupervised without requiring large amounts of training data. Experiments reveal empirically that a relatively small amount of data is sufficient and can potentially be further reduced through specific selection criteria.Comment: 9 pages, 7 figures, 10 tables; appeared in Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, April 201

    Full Carbon Account for Russia.

    Get PDF
    The Forestry Project (FOR) at IIASA has produced a full carbon account (FCA) for Russia for 1990, together with scenarios for 2010. Currently, there are rather big question marks regarding the existing carbon accounts for Russia, and Russia is critical to the global carbon balance due to its size. IIASA is in a position to perform solid analysis of Russia because of the databases that the Institute has built over the years. FOR based this work on a comprehensive geographic information system comprising georeferenced descriptions of the environment and land of Russia, which in turn are based on a number of thematic, digitized maps and databases. For the Russian energy sector and other industrial sectors (except the forest industry), the project used emissions estimates from the recent IIASA study "Global Energy Perspectives" (1998). The project carried out a separate substudy for the Russian forest industry sector. According to FOR's estimate, the total fluxes (including energy and industry sectors) in Russia were a net source of 527 teragrams of carbon (Tg C) in 1990. To illustrate the possible development of the carbon pools and fluxes over the next 10 years, FOR developed three different scenarios for the period 1990-2010, reflecting different assumptions regarding Russia's GDP growth. According to these scenarios, Russia will continue to be a net source of carbon to the atmosphere with 156-385 Tg C in 2010, including the emissions from energy and other industrial sectors. However, analysis of the FCA also shows considerable uncertainties involved in the carbon accounting. These uncertainties exceed the calculated changes in the full flux balance for the period 1990-2010. At present, this raises grave questions regarding the reliability of any accounting system used to measure terrestrial ecosystems for compliance with the Kyoto Protocol.

    Machine learning assists the classification of reports by citizens on disease-carrying mosquitoes

    Get PDF
    Mosquito Alert (www.mosquitoalert.com/en) is an expert-validated citizen science platform for tracking and controlling disease-carrying mosquitoes. Citizens download a free app and use their phones to send reports of presumed sightings of two world-wide disease vector mosquito species (the Asian Tiger and the Yellow Fever mosquito). These reports are then supervised by a team of entomologists and, once validated, added to a database. As the platform prepares to scale to much larger geographical areas and user bases, the expert validation by entomologists becomes the main bottleneck. In this paper we describe the use of machine learning on the citizen reports to automatically validate a fraction of them, therefore allowing the entomologists either to deal with larger report streams or to concentrate on those that are more strategic, such as reports from new areas (so that early warning protocols are activated) or from areas with high epidemiological risks (so that control actions to reduce mosquito populations are activated). The current prototype flags a third of the reports as “almost certainly positive” with high confidence. It is currently being integrated into the main workflow of the Mosquito Alert platform.Postprint (published version
    • …
    corecore