8 research outputs found

    Data Mining Revision Controlled Document History Metadata for Automatic Classification

    Get PDF
    Version controlled documents provide a complete history of the changes to the document, including everything from what was changed to who made the change and much more. Through the use of cluster analysis and several sets of manipulated data, this research examines the revision history of Wikipedia in an attempt to find language-independent patterns that could assist in automatic page classification software. Utilizing two sample data sets and applying the aforementioned cluster analysis, no conclusive evidence was found that would indicate that such patterns exist. Our work on the software, however, does provide a foundation for more possible types of data manipulation and refined clustering algorithms to be used for further research into finding such patterns

    La détection automatique multilingue d’énoncés biaisés dans Wikipédia

    Full text link
    Nous proposons une méthode multilingue pour l'extraction de phrases biaisées de Wikipédia, et l'utilisons pour créer des corpus en bulgare, en français et en anglais. En parcourant l'historique des révisions des articles, nous cherchons ceux qui, à un moment donné, avaient été considérés en violation de la politique de neutralité de Wikipédia (et corrigés par la suite). Pour chacun de ces articles, nous récupérons la révision signalée comme biaisée et la révision qui semble avoir corrigé le biais. Ensuite, nous extrayons les phrases qui ont été supprimées ou réécrites dans cette révision. Cette approche permet d'obtenir suffisamment de données même dans le cas de Wikipédias relativement petites, comme celle en bulgare, où de 62 000 articles nous avons extrait 5 000 phrases biaisées. Nous évaluons notre méthode en annotant manuellement 520 phrases pour le bulgare et le français, et 744 pour l'anglais. Nous évaluons le niveau de bruit, ses sources et analysons les formes d’expression de biais. Enfin, nous utilisons les données pour entrainer et évaluer la performance d’algorithmes de classification bien connus afin d’estimer la qualité et le potentiel des corpus.We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia’s neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5 thousand biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences

    Rich Linguistic Structure from Large-Scale Web Data

    Get PDF
    The past two decades have shown an unexpected effectiveness of Web-scale data in natural language processing. Even the simplest models, when paired with unprecedented amounts of unstructured and unlabeled Web data, have been shown to outperform sophisticated ones. It has been argued that the effectiveness of Web-scale data has undermined the necessity of sophisticated modeling or laborious data set curation. In this thesis, we argue for and illustrate an alternative view, that Web-scale data not only serves to improve the performance of simple models, but also can allow the use of qualitatively more sophisticated models that would not be deployable otherwise, leading to even further performance gains.Engineering and Applied Science

    Enabling entity retrieval by exploiting Wikipedia as a semantic knowledge source

    Get PDF
    This dissertation research, PanAnthropon FilmWorld, aims to demonstrate direct retrieval of entities and related facts by exploiting Wikipedia as a semantic knowledge source, with the film domain as its proof-of-concept domain of application. To this end, a semantic knowledge base concerning the film domain has been constructed with the data extracted/derived from 10,640 Wikipedia pages on films and additional pages on film awards. The knowledge base currently contains 209,266 entities and 2,345,931 entity-centric facts. Both the knowledge base and the corresponding semantic search interface are based on the coherent classification of entities. Entity-centric facts are also consistently represented as tuples. The semantic search interface (http://dlib.ischool.drexel.edu:8080/sofia/PA/) supports multiple types of semantic search functions, which go beyond the traditional keyword-based search function, including the main General Entity Retrieval Query (GERQ) function, which is concerned with retrieving all entities that match the specified entity type, subtype, and semantic conditions and thus corresponds to the main research problem. Two types of evaluation have been performed in order to evaluate (1) the quality of information extraction and (2) the effectiveness of information retrieval using the semantic interface. The first type of evaluation has been performed by inspecting 11,495 film-centric facts concerning 100 films. The results have confirmed high data quality with 99.96% average precision and 99.84% average recall. The second type of evaluation has been performed by conducting an experiment with human subjects. The experiment involved having the subjects perform a retrieval task by using both the PanAnthropon interface and the Internet Movie Database (IMDb) interface and comparing their task performance between the two interfaces. The results have confirmed higher effectiveness of the PanAnthropon interface vs. the IMDb interface (83.11% vs. 40.78% average precision; 83.55% vs. 40.26% average recall). Moreover, the subjects’ responses to the post-task questionnaire indicate that the subjects found the PanAnthropon interface to be highly usable and easily understandable as well as highly effective. The main contribution from this research therefore consists in achieving the set research goal, namely, demonstrating the utility and feasibility of semantics-based direct entity retrieval.Ph.D., Information Studies -- Drexel University, 201
    corecore