    Designing a System Prototype for Construction Document Management Using Automated Tagging and Visualization

    학위논문 (석사)-- 서울대학교 대학원 : 건설환경공학부, 2015. 8. 지석호.A large amount of text data have been accumulated over time in the construction industry. Important and useful information collected from previous construction projects as experience is mainly recorded in document form. Such information can be used as best practice for upcoming projects by delivering lessons learned for better risk management and project control. Thus, text-based information plays an important role in business strategy development in the highly competitive construction industry. To experience benefits from this text-based information, practical and usable text data management systems are vital. A significant amount of construction text data are rarely utilized for new construction projects because of the difficulty in accessing them. As the technology that can handle text data has been developed a number of document management systems have been proposed based on text mining techniques. However, most of them focus on classifying documents and they are unable to deal with construction documents complex and diverse features. In addition, unnecessary time and energy is still wasted to skim the whole database in order to uncover data of interest. Lastly, because the majority of research has focused on English data ? with only a few studies using Korean data ? there are plenty of constraints to applying existing English-based systems to Koreas domestic construction industry. Thus, a construction document management system was designed to manage text data effectively and efficiently, and to activate data and information transfer among system users in the domestic construction industry. To achieve this a system prototype was developed. The proposed construction document management system comprises data collection, data processing, and automated document tagging and dataset visualization. About 25,000 Korean Internet documents were collected to develop the system prototype using a web crawler. Collected data were processed by using text mining techniques, including POS tagging, to calculate the weight of each term in a document. Each term was clarified using a construction corpus which was also developed in this study. Five keywords were automatically extracted and tagged for each document and a tags sub-dataset was visualized as a form of wordcloud based on the processed data. The proposed system prototype was evaluated both qualitatively and quantitatively by surveying ten experts. Questionnaire scores on the significance of the systems results, the usability of and the need for the proposed system design were all above four on a five-point scale. Moreover, on the quantitative evaluation, estimating the accuracy of the systems results, the accuracy of the proposed system prototype was 84 percent on average. Thus the evaluation results confirm the potential for and feasibility of the proposed system.Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Problem Statement 3 1.3 Research Objectives 7 1.4 Research Scope 8 Chapter 2 Literature Review 10 2.1 Document Management 10 2.1.1 Document Management System 10 2.1.2 Document Management in the Construction Industry 12 2.2 Crawling 13 2.3 Text Mining 15 2.3.1 Classification 17 2.3.2 Clustering 18 2.4 Visualization 19 2.5 Summary 19 Chapter 3 A System Prototype Design for Construction Document Management 21 3.1 Data Investigation and Collection 24 3.1.1 Text Data from Construction Sites 25 3.1.2 Construction Related Text Data on the Web 25 3.1.3 Text Data Collection Approaches 27 3.2 Data Processing for Tagging and Visualization 28 3.2.1 Weight Calculation of Each Term in a Document 29 3.2.2 Clarification with Construction Corpus 33 3.3 Automated Document Tagging 35 3.3.1 Tags Representing Documents Specifications 35 3.3.2 Tag Based System 36 3.4 Dataset Visualization 37 3.4.1 Wordcloud Representing a Tags Sub-dataset 38 3.4.2 Visualization-based System 39 Chapter 4 System Implementation and Evaluation 41 4.1 Database of System Prototype 41 4.2 Implementation of Data Processing 44 4.2.1 Weight Calculation of Each Term in a Document 45 4.2.2 Clarification with Construction Corpus 48 4.3 Developed System Prototype 55 4.3.1 Implementation of Automated Document Tagging 56 4.3.2 Implementation of Dataset Visualization 57 4.4 Evaluation 59 4.4.1 Qualitative Evaluation 60 4.4.2 Quantitative Evaluation 61 Chapter 5 Conclusions 64 5.1 Summary 64 5.2 Contributions and Future Study 65 5.2.1 Contributions 65 5.2.2 Future Study 66 Bibliography 67 Abstract (Korean) 73Maste

    HEALTH GeoJunction: place-time-concept browsing of health publications

    <p>Abstract</p> <p>Background</p> <p>The volume of health science publications is escalating rapidly. Thus, keeping up with developments is becoming harder as is the task of finding important cross-domain connections. When geographic location is a relevant component of research reported in publications, these tasks are more difficult because standard search and indexing facilities have limited or no ability to identify geographic foci in documents. This paper introduces <it><smcaps>HEALTH</smcaps> GeoJunction</it>, a web application that supports researchers in the task of quickly finding scientific publications that are relevant geographically and temporally as well as thematically.</p> <p>Results</p> <p><it><smcaps>HEALTH</smcaps> GeoJunction </it>is a geovisual analytics-enabled web application providing: (a) web services using computational reasoning methods to extract place-time-concept information from bibliographic data for documents and (b) visually-enabled place-time-concept query, filtering, and contextualizing tools that apply to both the documents and their extracted content. This paper focuses specifically on strategies for visually-enabled, iterative, facet-like, place-time-concept filtering that allows analysts to quickly drill down to scientific findings of interest in PubMed abstracts and to explore relations among abstracts and extracted concepts in place and time. The approach enables analysts to: find publications without knowing all relevant query parameters, recognize unanticipated geographic relations within and among documents in multiple health domains, identify the thematic emphasis of research targeting particular places, notice changes in concepts over time, and notice changes in places where concepts are emphasized.</p> <p>Conclusions</p> <p>PubMed is a database of over 19 million biomedical abstracts and citations maintained by the National Center for Biotechnology Information; achieving quick filtering is an important contribution due to the database size. Including geography in filters is important due to rapidly escalating attention to geographic factors in public health. The implementation of mechanisms for iterative place-time-concept filtering makes it possible to narrow searches efficiently and quickly from thousands of documents to a small subset that meet place-time-concept constraints. Support for a <it>more-like-this </it>query creates the potential to identify unexpected connections across diverse areas of research. Multi-view visualization methods support understanding of the place, time, and concept components of document collections and enable comparison of filtered query results to the full set of publications.</p

    Empirical machine translation and its evaluation

    Aquesta tesi estudia l'aplicació de les tecnologies del Processament del Llenguatge Natural disponibles actualment al problema de la Traducció Automàtica basada en Mètodes Empírics i la seva Avaluació.D'una banda, tractem el problema de l'avaluació automàtica. Hem analitzat les principals deficiències dels mètodes d'avaluació actuals, les quals es deuen, al nostre parer, als principis de qualitat superficials en els que es basen. En comptes de limitar-nos al nivell lèxic, proposem una nova direcció cap a avaluacions més heterogènies. El nostre enfocament es basa en el disseny d'un ric conjunt de mesures automàtiques destinades a capturar un ampli ventall d'aspectes de qualitat a diferents nivells lingüístics (lèxic, sintàctic i semàntic). Aquestes mesures lingüístiques han estat avaluades sobre diferents escenaris. El resultat més notable ha estat la constatació de que les mètriques basades en un coneixement lingüístic més profund (sintàctic i semàntic) produeixen avaluacions a nivell de sistema més fiables que les mètriques que es limiten a la dimensió lèxica, especialment quan els sistemes avaluats pertanyen a paradigmes de traducció diferents. Tanmateix, a nivell de frase, el comportament d'algunes d'aquestes mètriques lingüístiques empitjora lleugerament en comparació al comportament de les mètriques lèxiques. Aquest fet és principalment atribuïble als errors comesos pels processadors lingüístics. A fi i efecte de millorar l'avaluació a nivell de frase, a més de recòrrer a la similitud lèxica en absència d'anàlisi lingüística, hem estudiat la possibiliat de combinar les puntuacions atorgades per mètriques a diferents nivells lingüístics en una sola mesura de qualitat. S'han presentat dues estratègies no paramètriques de combinació de mètriques, essent el seu principal avantatge no haver d'ajustar la contribució relativa de cadascuna de les mètriques a la puntuació global. A més, el nostre treball mostra com fer servir el conjunt de mètriques heterogènies per tal d'obtenir detallats informes d'anàlisi d'errors automàticament.D'altra banda, hem estudiat el problema de la selecció lèxica en Traducció Automàtica Estadística. Amb aquesta finalitat, hem construit un sistema de Traducció Automàtica Estadística Castellà-Anglès basat en -phrases', i hem iterat en el seu cicle de desenvolupament, analitzant diferents maneres de millorar la seva qualitat mitjançant la incorporació de coneixement lingüístic. En primer lloc, hem extès el sistema a partir de la combinació de models de traducció basats en anàlisi sintàctica superficial, obtenint una millora significativa. En segon lloc, hem aplicat models de traducció discriminatius basats en tècniques d'Aprenentatge Automàtic. Aquests models permeten una millor representació del contexte de traducció en el que les -phrases' ocorren, efectivament conduint a una millor selecció lèxica. No obstant, a partir d'avaluacions automàtiques heterogènies i avaluacions manuals, hem observat que les millores en selecció lèxica no comporten necessàriament una millor estructura sintàctica o semàntica. Així doncs, la incorporació d'aquest tipus de prediccions en el marc estadístic requereix, per tant, un estudi més profund.Com a qüestió complementària, hem estudiat una de les principals crítiques en contra dels sistemes de traducció basats en mètodes empírics, la seva forta dependència del domini, i com els seus efectes negatius poden ésser mitigats combinant adequadament fonts de coneixement externes. En aquest sentit, hem adaptat amb èxit un sistema de traducció estadística Anglès-Castellà entrenat en el domini polític, al domini de definicions de diccionari.Les dues parts d'aquesta tesi estan íntimament relacionades, donat que el desenvolupament d'un sistema real de Traducció Automàtica ens ha permès viure en primer terme l'important paper dels mètodes d'avaluació en el cicle de desenvolupament dels sistemes de Traducció Automàtica.In this thesis we have exploited current Natural Language Processing technology for Empirical Machine Translation and its Evaluation.On the one side, we have studied the problem of automatic MT evaluation. We have analyzed the main deficiencies of current evaluation methods, which arise, in our opinion, from the shallow quality principles upon which they are based. Instead of relying on the lexical dimension alone, we suggest a novel path towards heterogeneous evaluations. Our approach is based on the design of a rich set of automatic metrics devoted to capture a wide variety of translation quality aspects at different linguistic levels (lexical, syntactic and semantic). Linguistic metrics have been evaluated over different scenarios. The most notable finding is that metrics based on deeper linguistic information (syntactic/semantic) are able to produce more reliable system rankings than metrics which limit their scope to the lexical dimension, specially when the systems under evaluation are different in nature. However, at the sentence level, some of these metrics suffer a significant decrease, which is mainly attributable to parsing errors. In order to improve sentence-level evaluation, apart from backing off to lexical similarity in the absence of parsing, we have also studied the possibility of combining the scores conferred by metrics at different linguistic levels into a single measure of quality. Two valid non-parametric strategies for metric combination have been presented. These offer the important advantage of not having to adjust the relative contribution of each metric to the overall score. As a complementary issue, we show how to use the heterogeneous set of metrics to obtain automatic and detailed linguistic error analysis reports.On the other side, we have studied the problem of lexical selection in Statistical Machine Translation. For that purpose, we have constructed a Spanish-to-English baseline phrase-based Statistical Machine Translation system and iterated across its development cycle, analyzing how to ameliorate its performance through the incorporation of linguistic knowledge. First, we have extended the system by combining shallow-syntactic translation models based on linguistic data views. A significant improvement is reported. This system is further enhanced using dedicated discriminative phrase translation models. These models allow for a better representation of the translation context in which phrases occur, effectively yielding an improved lexical choice. However, based on the proposed heterogeneous evaluation methods and manual evaluations conducted, we have found that improvements in lexical selection do not necessarily imply an improved overall syntactic or semantic structure. The incorporation of dedicated predictions into the statistical framework requires, therefore, further study.As a side question, we have studied one of the main criticisms against empirical MT systems, i.e., their strong domain dependence, and how its negative effects may be mitigated by properly combining outer knowledge sources when porting a system into a new domain. We have successfully ported an English-to-Spanish phrase-based Statistical Machine Translation system trained on the political domain to the domain of dictionary definitions.The two parts of this thesis are tightly connected, since the hands-on development of an actual MT system has allowed us to experience in first person the role of the evaluation methodology in the development cycle of MT systems

    Lexical coverage in ELF

    The aim of this study was to determine how much vocabulary is needed to understand English in contexts where it is spoken internationally as a lingua franca (ELF). This information is critical to inform vocabulary size targets for second language (L2) learners of English. The current research consensus, based on native-English-speaker data, is that 6,000–7,000 word families plus proper nouns are needed. However, since English has become a global lingua franca, native speakers of English have become a minority: in fact, today, there are around two billion speakers of English worldwide, of which less than a quarter are native speakers. This means that non-native speakers of English are more likely to interact with other non-native speakers than with native speakers. Thus, using findings based on solely native-speaker data may not provide the most accurate information needed to inform vocabulary size targets for L2 learners of English. Indeed, this information needs to be supplemented with data from competent non-native speakers of English who can represent a legitimate model for L2 learners of English. This study uses the largest freely available corpus of general, spoken ELF in Europe: the one million-word Vienna-Oxford International Corpus of English (VOICE). The word family was used as a lexical counting unit, and the lexical coverage of VOICE was calculated for various thresholds of the most frequent word families in the corpus. A comparative analysis was carried out to determine the lexical coverage of VOICE provided by frequency ranked word lists based on data from the British National Corpus of English and the Contemporary Corpus of American English. The main findings of this study indicate that fewer than 3,000–4,000 word families plus proper nouns can provide the lexical resources needed to understand English in international contexts where it is spoken as a lingua franca. This is approximately half the number of word families (i.e. 6,000–7,000 word families plus proper nouns) which scholars have claimed are needed to understand spoken English. The findings of this study represent a substantial saving in vocabulary size targets for L2 learners of English who wish to be functional in understanding English spoken as an international lingua franca

    Toponym Disambiguation in Information Retrieval

    In recent years, geography has acquired a great importance in the context of Information Retrieval (IR) and, in general, of the automated processing of information in text. Mobile devices that are able to surf the web and at the same time inform about their position are now a common reality, together with applications that can exploit this data to provide users with locally customised information, such as directions or advertisements. Therefore, it is important to deal properly with the geographic information that is included in electronic texts. The majority of such kind of information is contained as place names, or toponyms. Toponym ambiguity represents an important issue in Geographical Information Retrieval (GIR), due to the fact that queries are geographically constrained. There has been a struggle to nd speci c geographical IR methods that actually outperform traditional IR techniques. Toponym ambiguity may constitute a relevant factor in the inability of current GIR systems to take advantage from geographical knowledge. Recently, some Ph.D. theses have dealt with Toponym Disambiguation (TD) from di erent perspectives, from the development of resources for the evaluation of Toponym Disambiguation (Leidner (2007)) to the use of TD to improve geographical scope resolution (Andogah (2010)). The Ph.D. thesis presented here introduces a TD method based on WordNet and carries out a detailed study of the relationship of Toponym Disambiguation to some IR applications, such as GIR, Question Answering (QA) and Web retrieval. The work presented in this thesis starts with an introduction to the applications in which TD may result useful, together with an analysis of the ambiguity of toponyms in news collections. It could not be possible to study the ambiguity of toponyms without studying the resources that are used as placename repositories; these resources are the equivalent to language dictionaries, which provide the di erent meanings of a given word.Buscaldi, D. (2010). Toponym Disambiguation in Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8912Palanci

    Combining granularity-based topic-dependent and topic-independent evidences for opinion detection

    Fouille des opinion, une sous-discipline dans la recherche d'information (IR) et la linguistique computationnelle, fait référence aux techniques de calcul pour l'extraction, la classification, la compréhension et l'évaluation des opinions exprimées par diverses sources de nouvelles en ligne, social commentaires des médias, et tout autre contenu généré par l'utilisateur. Il est également connu par de nombreux autres termes comme trouver l'opinion, la détection d'opinion, l'analyse des sentiments, la classification sentiment, de détection de polarité, etc. Définition dans le contexte plus spécifique et plus simple, fouille des opinion est la tâche de récupération des opinions contre son besoin aussi exprimé par l'utilisateur sous la forme d'une requête. Il y a de nombreux problèmes et défis liés à l'activité fouille des opinion. Dans cette thèse, nous nous concentrons sur quelques problèmes d'analyse d'opinion. L'un des défis majeurs de fouille des opinion est de trouver des opinions concernant spécifiquement le sujet donné (requête). Un document peut contenir des informations sur de nombreux sujets à la fois et il est possible qu'elle contienne opiniâtre texte sur chacun des sujet ou sur seulement quelques-uns. Par conséquent, il devient très important de choisir les segments du document pertinentes à sujet avec leurs opinions correspondantes. Nous abordons ce problème sur deux niveaux de granularité, des phrases et des passages. Dans notre première approche de niveau de phrase, nous utilisons des relations sémantiques de WordNet pour trouver cette association entre sujet et opinion. Dans notre deuxième approche pour le niveau de passage, nous utilisons plus robuste modèle de RI i.e. la language modèle de se concentrer sur ce problème. L'idée de base derrière les deux contributions pour l'association d'opinion-sujet est que si un document contient plus segments textuels (phrases ou passages) opiniâtre et pertinentes à sujet, il est plus opiniâtre qu'un document avec moins segments textuels opiniâtre et pertinentes. La plupart des approches d'apprentissage-machine basée à fouille des opinion sont dépendants du domaine i.e. leurs performances varient d'un domaine à d'autre. D'autre part, une approche indépendant de domaine ou un sujet est plus généralisée et peut maintenir son efficacité dans différents domaines. Cependant, les approches indépendant de domaine souffrent de mauvaises performances en général. C'est un grand défi dans le domaine de fouille des opinion à développer une approche qui est plus efficace et généralisé. Nos contributions de cette thèse incluent le développement d'une approche qui utilise de simples fonctions heuristiques pour trouver des documents opiniâtre. Fouille des opinion basée entité devient très populaire parmi les chercheurs de la communauté IR. Il vise à identifier les entités pertinentes pour un sujet donné et d'en extraire les opinions qui leur sont associées à partir d'un ensemble de documents textuels. Toutefois, l'identification et la détermination de la pertinence des entités est déjà une tâche difficile. Nous proposons un système qui prend en compte à la fois l'information de l'article de nouvelles en cours ainsi que des articles antérieurs pertinents afin de détecter les entités les plus importantes dans les nouvelles actuelles. En plus de cela, nous présentons également notre cadre d'analyse d'opinion et tâches relieés. Ce cadre est basée sur les évidences contents et les évidences sociales de la blogosphère pour les tâches de trouver des opinions, de prévision et d'avis de classement multidimensionnel. Cette contribution d'prématurée pose les bases pour nos travaux futurs. L'évaluation de nos méthodes comprennent l'utilisation de TREC 2006 Blog collection et de TREC Novelty track 2004 collection. La plupart des évaluations ont été réalisées dans le cadre de TREC Blog track.Opinion mining is a sub-discipline within Information Retrieval (IR) and Computational Linguistics. It refers to the computational techniques for extracting, classifying, understanding, and assessing the opinions expressed in various online sources like news articles, social media comments, and other user-generated content. It is also known by many other terms like opinion finding, opinion detection, sentiment analysis, sentiment classification, polarity detection, etc. Defining in more specific and simpler context, opinion mining is the task of retrieving opinions on an issue as expressed by the user in the form of a query. There are many problems and challenges associated with the field of opinion mining. In this thesis, we focus on some major problems of opinion mining

    A Semi-Supervised Information Extraction Framework for Large Redundant Corpora

    The vast majority of text freely available on the Internet is not available in a form that computers can understand. There have been numerous approaches to automatically extract information from human- readable sources. The most successful attempts rely on vast training sets of data. Others have succeeded in extracting restricted subsets of the available information. These approaches have limited use and require domain knowledge to be coded into the application. The current thesis proposes a novel framework for Information Extraction. From large sets of documents, the system develops statistical models of the data the user wishes to query which generally avoid the lim- itations and complexity of most Information Extractions systems. The framework uses a semi-supervised approach to minimize human input. It also eliminates the need for external Named Entity Recognition systems by relying on freely available databases. The final result is a query-answering system which extracts information from large corpora with a high degree of accuracy