32 research outputs found
Automatic indexing of scientific articles on Library and Information Science with SISA, KEA and MAUI
This article evaluates the SISA (Automatic Indexing System), KEA (Keyphrase Extraction Algorithm) and MAUI (Multi-Purpose Automatic Topic Indexing) automatic indexing systems to find out how they perform in relation to human indexing. SISA algorithm is based on rules about the position of terms in the different structural components of the document, while the algorithms for KEA and MAUI are based on machine learning and the statistical features of terms. For evaluation purposes, a document collection of 230 scientific articles from the Revista Española de Documentación CientÃfica published by the Consejo Superior de Investigaciones CientÃficas (CSIC) was used, of which 30 were used for training tasks and were not part of the evaluation test set. The articles were written in Spanish and indexed by human indexers using a controlled vocabulary in the InDICES database, also belonging to the CSIC. The human indexing of these documents constitutes the baseline or golden indexing, against which to evaluate the output of the automatic indexing systems by comparing terms sets using the evaluation metrics of precision, recall, F-measure and consistency. The results show that the SISA system performs best, followed by KEA and MAUI
Indización automática de artículos científicos sobre Biblioteconomía y Documentación con SISA, KEA y MAUI
This article evaluates the SISA (Automatic Indexing System), KEA (Keyphrase Extraction Algorithm) and MAUI (Multi-Purpose Automatic Topic Indexing) automatic indexing systems to find out how they perform in relation to human indexing. SISA’s algorithm is based on rules about the position of terms in the different structural components of the document, while the algorithms for KEA and MAUI are based on machine learning and the statistical features of terms. For evaluation purposes, a document collection of 230 scientific articles from the Revista Española de Documentación Científica published by the Consejo Superior de Investigaciones Científicas (CSIC) was used, of which 30 were used for training tasks and were not part of the evaluation test set. The articles were written in Spanish and indexed by human indexers using a controlled vocabulary in the InDICES database, also belonging to the CSIC. The human indexing of these documents constitutes the baseline or golden indexing, against which to evaluate the output of the automatic indexing systems by comparing terms sets using the evaluation metrics of precision, recall, F-measure and consistency. The results show that the SISA system performs best, followed by KEA and MAUI.Este artículo evalúa los sistemas de indización automática SISA (Automatic Indexing System), KEA (Keyphrase Extraction Algorithm) y MAUI (Multi-Purpose Automatic Topic Indexing) para averiguar cómo funcionan en relación con la indización realzada por especialistas. El algoritmo de SISA se basa en reglas sobre la posición de los términos en los diferentes componentes estructurales del documento, mientras que los algoritmos de KEA y MAUI se basan en el aprendizaje automático y las frecuencia estadística de los términos. Para la evaluación se utilizó una colección documental de 230 artículos científicos de la Revista Española de Documentación Científica, publicada por el Consejo Superior de Investigaciones Científicas (CSIC), de los cuales 30 se utilizaron para tareas formativas y no formaban parte del conjunto de pruebas de evaluación. Los artículos fueron escritos en español e indizados por indizadores humanos utilizando un vocabulario controlado en la base de datos InDICES, también perteneciente al CSIC. La indización humana de estos documentos constituye la referencia contra la cual se evalúa el resultado de los sistemas de indización automáticos, comparando conjuntos de términos usando métricas de evaluación de precisión, recuperación, medida F y consistencia. Los resultados muestran que el sistema SISA funciona mejor, seguido de KEA y MAUI
Exploiting extensible background knowledge for clustering-based automatic keyphrase extraction
Keyphrases are single- or multi-word phrases that are used to describe the essential content of a document. Utilizing an external knowledge source such as WordNet is often used in keyphrase extraction methods to obtain relation information about terms and thus improves the result, but the drawback is that a sole knowledge source is often limited. This problem is identified as the coverage limitation problem. In this paper, we introduce SemCluster, a clustering-based unsupervised keyphrase extraction method that addresses the coverage limitation problem by using an extensible approach that integrates an internal ontology (i.e., WordNet) with other knowledge sources to gain a wider background knowledge. SemCluster is evaluated against three unsupervised methods, TextRank, ExpandRank, and KeyCluster, and under the F1-measure metric. The evaluation results demonstrate that SemCluster has better accuracy and computational efficiency and is more robust when dealing with documents from different domains
A new paradigm for open data-driven language learning systems design in higher education
This doctoral thesis presents three studies in collaboration with the open source FLAX project (Flexible Language Acquisition flax.nzdl.org). This research makes an original contribution to the fields of language education and educational technology by mobilising knowledge from computer science, corpus linguistics and open education, and proposes a new paradigm for open data-driven language learning systems design in higher education. Furthermore, the research presented in this thesis uncovers and engages with an infrastructure of open educational practices (OEP) that push at the parameters of policy for the reuse of open access research and pedagogic content in the design, development, distribution, adoption and evaluation of data-driven language learning systems.
Study 1 employs automated content analysis to mine the concept of open educational systems and practices from qualitative reflections spanning 2012-2019 with stakeholders from an on-going multi-site design-based research study with the FLAX project. Design considerations are presented for remixing domain-specific open access content for academic English language provision across formal and non-formal higher education contexts. Primary stakeholders in this ongoing research collaboration include the following: knowledge organisations – libraries and archives including the British Library and the Oxford Text Archive, universities in collaboration with Massive Open Online Course (MOOC) providers; an interdisciplinary team of researchers; and knowledge users in formal higher education – English for Academic Purposes (EAP) practitioners. Themes arising from the qualitative dataset point to affordances as well as barriers with the adoption of open policies and practices for remixing open access content for data-driven language learning applications in higher education against the backdrop of different business models and cultural practices present within participating knowledge organisations.
Study 2 presents a data-driven experiment in non-formal higher education by triangulating user query system log data with learner participant data from surveys (N=174) on the interface designs and usability of an automated open source digital library scheme, FLAX. Text and data mining approaches (TDM) common to natural language processing (NLP) were applied to pedagogical English language corpora, derived from the content of two MOOCs, (Harvard University with edX, and the University of London with Coursera), and one networked course (Harvard Law School with the Berkman Klein Center for Internet and Society), which were then linked to external open resources (e.g. Wikipedia, the FLAX Learning Collocations system, WordNet), so that learners could employ the information discovery techniques (e.g. searching and browsing) that they have become accustomed to using through search engines (e.g. Google, Bing) for discovering and learning the domain-specific language features of their interests. Findings indicate a positive user experience with interfaces that include advanced affordances for course content browse, search and retrieval that transcend the MOOC platform and Learning Management System (LMS) standard. Further survey questions derived from an open education research bank from the Hewlett Foundation are reused in this study and presented against a larger dataset from the Hewlett Foundation (N=1921) on motivations for the uptake of open educational resources.
Study 3 presents a data-driven experiment in formal higher education from the legal English field to measure quantitatively the usefulness and effectiveness of employing the open Law Collections in FLAX in the teaching of legal English at the University of Murcia in Spain. Informants were divided into an experimental and a control group and were asked to write an essay on a given set of legal English topics, defined by the subject instructor as part of their final assessment. The experimental group only consulted the FLAX English Common Law MOOC collection as the single source of information to draft their essays, and the control group used any information source available from the Internet to draft their essays. Findings from an analysis of the two learner corpora of essays indicate that members of the experimental group appear to have acquired the specialised terminology of the area better than those in the control group, as attested by the higher term average obtained by the texts in the FLAX-based corpus (56.5) as opposed to the non-FLAX-based text collection, at 13.73 points below
Linked Data Supported Information Retrieval
Um Inhalte im World Wide Web ausfindig zu machen, sind Suchmaschienen nicht mehr wegzudenken. Semantic Web und Linked Data Technologien ermöglichen ein detaillierteres und eindeutiges Strukturieren der Inhalte und erlauben vollkommen neue Herangehensweisen an die Lösung von Information Retrieval Problemen. Diese Arbeit befasst sich mit den Möglichkeiten, wie Information Retrieval Anwendungen von der Einbeziehung von Linked Data profitieren können. Neue Methoden der computer-gestützten semantischen Textanalyse, semantischen Suche, Informationspriorisierung und -visualisierung werden vorgestellt und umfassend evaluiert. Dabei werden Linked Data Ressourcen und ihre Beziehungen in die Verfahren integriert, um eine Steigerung der Effektivität der Verfahren bzw. ihrer Benutzerfreundlichkeit zu erzielen. Zunächst wird eine Einführung in die Grundlagen des Information Retrieval und Linked Data gegeben. Anschließend werden neue manuelle und automatisierte Verfahren zum semantischen Annotieren von Dokumenten durch deren Verknüpfung mit Linked Data Ressourcen vorgestellt (Entity Linking). Eine umfassende Evaluation der Verfahren wird durchgeführt und das zu Grunde liegende Evaluationssystem umfangreich verbessert. Aufbauend auf den Annotationsverfahren werden zwei neue Retrievalmodelle zur semantischen Suche vorgestellt und evaluiert. Die Verfahren basieren auf dem generalisierten Vektorraummodell und beziehen die semantische Ähnlichkeit anhand von taxonomie-basierten Beziehungen der Linked Data Ressourcen in Dokumenten und Suchanfragen in die Berechnung der Suchergebnisrangfolge ein. Mit dem Ziel die Berechnung von semantischer Ähnlichkeit weiter zu verfeinern, wird ein Verfahren zur Priorisierung von Linked Data Ressourcen vorgestellt und evaluiert. Darauf aufbauend werden Visualisierungstechniken aufgezeigt mit dem Ziel, die Explorierbarkeit und Navigierbarkeit innerhalb eines semantisch annotierten Dokumentenkorpus zu verbessern. Hierfür werden zwei Anwendungen präsentiert. Zum einen eine Linked Data basierte explorative Erweiterung als Ergänzung zu einer traditionellen schlüsselwort-basierten Suchmaschine, zum anderen ein Linked Data basiertes Empfehlungssystem
Computer Science & Technology Series : XIX Argentine Congress of Computer Science. Selected papers
CACIC’13 was the nineteenth Congress in the CACIC series. It was organized by the Department of Computer Systems at the CAECE University in Mar del Plata.
The Congress included 13 Workshops with 165 accepted papers, 5 Conferences, 3 invited tutorials, different meetings related with Computer Science Education (Professors, PhD students, Curricula) and an International School with 5 courses.
CACIC 2013 was organized following the traditional Congress format, with 13 Workshops covering a diversity of dimensions of Computer Science Research. Each topic was supervised by a committee of 3-5 chairs of different Universities.
The call for papers attracted a total of 247 submissions. An average of 2.5 review reports were collected for each paper, for a grand total of 676 review reports that involved about 210 different reviewers.
A total of 165 full papers, involving 489 authors and 80 Universities, were accepted and 25 of them were selected for this book.Red de Universidades con Carreras en Informática (RedUNCI
Feature extraction and duplicate detection for text mining: A survey
Text mining, also known as Intelligent Text Analysis is an important research area. It is very
difficult to focus on the most appropriate information due to the high dimensionality of data. Feature
Extraction is one of the important techniques in data reduction to discover the most important
features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task.
Several pre-processing methods and algo- rithms are needed to extract useful features from huge
amount of data. The survey covers different text summarization, classi- fication, clustering methods to
discover useful features and also discovering query facets which are multiple groups of words or
phrases that explain and summarize the content covered by a query thereby reducing time taken by
the user