Search CORE

3 research outputs found

Wikification of learning objects using metadata as an alternative context for disambiguation

Author: GELBUKH ALEXANDER
GELBUKH ALEXANDER
López Morteo Gabriel Aejandro
López Morteo Gabriel Aejandro
Martínez Reyes Magally
Martínez Reyes Magally
MELARA ABARCA REYNA
MELARA ABARCA REYNA
PEREZ LOPEZ MOISES
PEREZ LOPEZ MOISES
PEREZ MARTINEZ CLAUDIA
PEREZ MARTINEZ CLAUDIA
Publication venue: 'Instituto Politecnico Nacional/Centro de Investigacion en Computacion'
Publication date: 01/01/2014
Field of study

We present a methodology to wikify learning objects. Our proposal is focused on two processes: word sense disambiguation and relevant phrase selection. The disambiguation process involves the use of the learning objects metadata as either additional or alternative context. This increases the probability of success when a learning object has a low quality context. The selection of relevant phrases is perf ormed by identifying the highest values of semantic relat edness between the main subject of a learning object and t he phrases. This criterion is useful for achieving the didactic objectives of the learning object

Red Mexicana de Repositorios Institucionales

Repositorio Institucional de la Universidad Autónoma del Estado de México

Methods for extracting data from the Internet

Author: Willers Joel
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2017
Field of study

The advent of the Internet has yielded exciting new opportunities for the collection of large amounts of structured and unstructured social scientific data. This thesis describes two such methods for harvesting data from websites and web services: web-scraping and connecting to an application programming interface (API). I describe the development and implementation of tools for each of these methods. In my review of the two related, yet distinct data collection methods, I provide concrete examples of each. To illustrate the first method, ‘scraping’ data from publicly available data repositories (specifically the Google Books Ngram Corpus), I developed a tool and made it available to the public on a web site. The Google Books Ngram Corpus contains groups of words used in millions of books that were digitized and catalogued. The corpus has been made available for public use, but in current form, accessing the data is tedious, time consuming and error prone. For the second method, utilizing an API from a web service (specifically the Twitter Streaming API), I used a code library and the R programming language to develop a program that connects to the Twitter API to collect public posts known as tweets. I review prior studies that have used these data, after which, I report results from a case study involving references to countries. The relative prestige of nations are compared based on the frequency of mentions in English literature and mentions in tweets

Digital Repository @ Iowa State University (ISU)

Extração estruturada de dados em fontes heterogêneas com Web Crawlers

Author: Fabro Gustavo
Publication venue
Publication date: 01/07/2018
Field of study

Trabalho de Conclusão de Curso, apresentado para obtenção do grau de Bacharel no Curso de Ciência da Computação da Universidade do Extremo Sul Catarinense, UNESC.Com crescimento de dados na web torna-se cada vez maior a necessidade de ferramentas que auxiliam no consumo dessas informações. Dentre as categorias desses dados estão as fontes de notícias, em que há um grande número de portais disponíveis e no qual um determinado assunto pode ser tratado por diferentes sites. Com isso, o objetivo deste trabalho foi determinar formas de extração estruturada desses dados ao mesmo tempo em que as fontes são adquiridas automaticamente de acordo o assunto desejado. Tanto para a extração da notícia como para as suas respectivas fontes, fez-se o uso de web crawlers, um agente que realiza a coleta e o parser de dados na web. A extração estruturada das fontes, previamente desconhecidas, foi possível através da leitura das novas tags semânticas do HTML5 e de metadados que são utilizados para o compartilhamento de artigos em redes sociais. Ambos, quando utilizados da forma correta, se mostraram eficientes na indicação das partes do documento, sendo portanto um meio comum de definir a informação. Já a obtenção das sementes do rastreador foi realizada através de requisições ao motor de busca do Google. Por fim foi possível identificar padrões semânticos de representação dos dados nas tecnologias envolvidas no desenvolvimento web, possibilitando distribuí-los de formas suscetíveis ao processamento automático

UNESC