6 research outputs found

    語句間の意味構造に基づくニュース記事推薦システムの提案

    Get PDF
    近年,ユーザの行動やソーシャルメディア上での発言を興味関心として分析し,ニュース記事を推薦するキュレーションサービスが普及している.膨大な情報から自分で必要なものを探さなくても,自身の興味に沿った情報が手に入ることで利用者が増加している.既存のコンテンツベースの情報推薦システムに関する研究では記事推薦のために各語句を特徴としているが,頻出する語句を重要視しており語句間の関係を特徴として用いていない.本研究は,ユーザが興味関心を示す記事に表れる語句間の意味構造を用いることで,ユーザが面白いと感じることができるニュース記事を収集,推薦するシステムを提案する.本研究では面白いニュース記事をユーザが興味を示すことができ,意外な情報が得られるものと定義した.語句間の意味構造Linked Dataで表現する.同ニュース記事の同文脈に表れる複数の語句間の意味構造を文構造と定義する.ユーザが興味・関心を示す記事文の文構造の部分グラフを用いることでインターネット上のニュース記事を推薦する手法を提案する.本手法の有効性を確かめるため,20人の被験者に提案手法,ベースライン手法それぞれによるニュース記事推薦をして評価を得る比較実験を行った.ベースライン手法は単語の重要度を出現頻度から計算するtf-idfを用いた.提案手法によるニュース記事推薦での関連度の指標の平均値は4点満点中3.06,興味度は3.30,意外度は2.93という結果であった.ベースライン手法では関連度が3.22,興味度が3.03,意外度が2.79という結果であった.ベースライン手法との比較実験により,提案手法は推薦するニュース記事の関連度は下がるものの,ユーザが興味を持つことができ,また意外と感じることができるニュース記事推薦手法であることがわかった.これによりユーザに面白い記事を推薦できる手法として提案手法は有効であることが明らかになった.電気通信大学201

    Extracting relevant named entities for automated expense reimbursement

    No full text
    Expense reimbursement is a time consuming and labor intensive process across organizations. In this paper, we present a prototype expense reimbursement system that dramatically reduces the elapsed time and costs involved, by eliminating paper from the process life cycle. Our complete solution involves (1) an electronic submission infrastructure that provides multi-channel image capture, secure transport and centralized storage of paper documents; (2) an unconstrained data mining approach to extracting relevant named entities from un-structured document images; (3) automation of auditing procedures that enables automatic expense validation with minimum human interaction. Extracting relevant named entities robustly from document images with unconstrained layouts and diverse formatting is a fundamental technical challenge to image-based data mining, question answering, and other information retrieval tasks. In many applications that require such capability, applying traditional language modeling techniques to the stream of OCR text does not give satisfactory result due to the absence of linguistic context. We present an approach for extracting relevant named entities from document images by combining rich page layout features in the image space with language content in the OCR text using a discriminative conditional random field (CRF) framework. We integrate this named entity extraction engine into our expense reimbursement solution and evaluate the system performance on large collections of real-world receipt image

    Extracting Relevant Named Entities for Automated Expense Reimbursement

    No full text
    Expense reimbursement is a time-consuming and labor-intensive process across organizations. In this paper, we present a prototype expense reimbursement system that dramatically reduces the elapsed time and costs involved, by eliminating paper from the process life cycle. Our complete solution involves (1) an electronic submission infrastructure that provides multi-channel image capture, secure transport and centralized storage of paper documents; (2) an unconstrained data mining approach to extracting relevant named entities from un-structured document images; (3) automation of auditing procedures that enables automatic expense validation with minimum human interaction. Extracting relevant named entities robustly from document images with unconstrained layouts and diverse formatting is a fundamental technical challenge to image-based data mining, question answering, and other information retrieval tasks. In many applications that require such capability, applying traditional language modeling techniques to the stream of OCR text does not give satisfactory result due to the absence of linguistic context. We present an approach for extracting relevant named entities from document images by combining rich page layout features in the image space with language content in the OCR text using a discriminative conditional random field (CRF) framework. We integrate this named entity extraction engine into our expense reimbursement solution and evaluate the system performance on large collections of real-world receipt image

    A teachable semi-automatic web information extraction system based on evolved regular expression patterns

    Get PDF
    This thesis explores Web Information Extraction (WIE) and how it has been used in decision making and to support businesses in their daily operations. The research focuses on a WIE system based on Genetic Programming (GP) with an extensible model to enhance the automatic extractor. This uses a human as a teacher to identify and extract relevant information from the semi-structured HTML webpages. Regular expressions, which have been chosen as the pattern matching tool, are automatically generated based on the training data to provide an improved grammar and lexicon. This particularly benefits the GP system which may need to extend its lexicon in the presence of new tokens in the web pages. These tokens allow the GP method to produce new extraction patterns for new requirements

    A scientific-research activities information system

    No full text
    Cilj - Cilj istraživanja je razvoj modela, implementacija prototipa i verifikacija sistema za ekstrakciju metodologija iz naučnih članaka iz oblasti Informatike. Da bi se, pomoću tog sistema, naučnicima mogao obezbediti bolji uvid u metodologije u svojim oblastima potrebno je ekstrahovane metodolgije povezati sa metapodacima vezanim za publikaciju iz koje su ekstrahovani. Iz tih razloga istraživanje takoñe za cilj ima i razvoj modela sistema za automatsku ekstrakciju metapodataka iz naučnih članaka. Metodologija - Ekstrahovane metodologije se kategorizuju u četiri kategorije: kategorizuju se u četiri semantičke kategorije: zadatak (Task), metoda (Method), resurs/osobina (Resource/Feature) i implementacija (Implementation). Sistem se sastoji od dva nivoa: prvi je automatska identifikacija metodoloških rečenica; drugi nivo vrši prepoznavanje metodoloških fraza (segmenata). Zadatak ekstrakcije i kategorizacije formalizovan je kao problem označavanja sekvenci i upotrebljena su četiri zasebna Conditional Random Fields modela koji su zasnovani na sintaktičkim frazama. Sistem je evaluiran na ručno anotiranom korpusu iz oblasti Automatske Ekstrakcije Termina koji se sastoji od 45 naučnih članaka. Sistem za automatsku ekstrakciju metapodataka zasnovan je na klasifikaciji. Klasifikacija metapodataka vrši se u osam unapred definisanih sematičkih kategorija: Naslov, Autori, Pripadnost, Adresa, Email, Apstrakt, Ključne reči i Mesto publikacije. Izvršeni su eksperimenti sa svim standardnim modelima za klasifikaciju: naivni bayes, stablo odlučivanja, k-najbližih suseda i mašine potpornih vektora. Rezultati - Sistem za ekstrakciju metodologija postigao je sledeće rezultate: F-mera od 53% za identifikaciju Task i Method kategorija (sa preciznošću od 70%) dok su vrednosti za F-mere za Resource/Feature i Implementation kategorije bile 60% (sa preciznošću od 67%) i 75% (sa preciznošću od 85%) respektivno. Nakon izvršenih klasifikacionih eksperimenata, za sistem za ekstrakciju metapodataka, utvrñeno je da mašine potpornih vektora (SVM) pružaju najbolje performanse. Dobijeni rezultati SVM modela su generalno dobri, F-mera preko 85% kod skoro svih kategorija, a preko 90% kod većine. Ograničenja istraživanja/implikacije - Sistem za ekstrakciju metodologija, kao i sistem za esktrakciju metapodataka primenljivi su samo na naučne članke na engleskom jeziku. Praktične implikacije - Predloženi modeli mogu se, pre svega, koristiti za analizu i pregled razvoja naučnih oblasti kao i za kreiranje sematički bogatijih informacionih sistema naučno-istraživačke delatnosti. Originalnost/vrednost - Originalni doprinosi su sledeći: razvijen je model za ekstrakciju i semantičku kategorijzaciju metodologija iz naučnih članaka iz oblasti Informatike, koji nije opisan u postojećoj literaturi. Izvršena je analiza uticaja različitih vrsta osobina na ekstrakciju metodoloških fraza. Razvijen je u potpunosti automatizovan sistem za ekstrakciju metapodataka u informacionim sistemima naučno-istraživačke delatnosti.Purpose - The purpose of this research is model development, software prototype implementation and verification of the system for the identification of methodology mentions in scientific publications in a subdomain of automatic terminology extraction. In order to provide a better insight for scientists into the methodologies in their fields extracted methodologies should be connected with the metadata associated with the publication from which they are extracted. For this reason the purpose of this research was also a development of a system for the automatic extraction of metadata from scientific publications. Design/methodology/approach - Methodology mentions are categorized in four semantic categories: Task, Method, Resource/Feature and Implementation. The system comprises two major layers: the first layer is an automatic identification of methodological sentences; the second layer highlights methodological phrases (segments). Extraction and classification of the segments was 171 formalized as a sequence tagging problem and four separate phrase-based Conditional Random Fields were used to accomplish the task. The system has been evaluated on a manually annotated corpus comprising 45 full text articles. The system for the automatic extraction of metadata from scientific publications is based on classification. The metadata are classified eight pre-defined categories: Title, Authors, Affiliation, Address, Email, Abstract, Keywords and Publication Note. Experiments were performed with standard classification models: Decision Tree, Naive Bayes, K-nearest Neighbours and Support Vector Machines. Findings - The results of the system for methodology extraction show an Fmeasure of 53% for identification of both Task and Method mentions (with 70% precision), whereas the Fmeasures for Resource/Feature and Implementation identification was 60% (with 67% precision) and 75% (with 85% precision) respectively. As for the system for the automatic extraction of metadata Support Vector Machines provided the best performance. The Fmeasure was over 85% for almost all of the categories and over 90% for the most of them. Research limitations/implications - Both the system for the extractions of methodologies and the system for the extraction of metadata are only applicable to the scientific papers in English language. 172 Practical implications - The proposed models can be used in order to gain insight into a development of a scientific discipline and also to create semantically rich research activity information systems. Originality/Value - The main original contributions are: a novel model for the extraction of methodology mentions from scientific publications. The impact of the various types of features on the performance of the system was determined and presented. A fully automated system for the extraction of metadata for the rich research activity information systems was developed
    corecore