6 research outputs found
Data Mining Revision Controlled Document History Metadata for Automatic Classification
Version controlled documents provide a complete history of the changes to the document, including everything from what was changed to who made the change and much more. Through the use of cluster analysis and several sets of manipulated data, this research examines the revision history of Wikipedia in an attempt to find language-independent patterns that could assist in automatic page classification software. Utilizing two sample data sets and applying the aforementioned cluster analysis, no conclusive evidence was found that would indicate that such patterns exist. Our work on the software, however, does provide a foundation for more possible types of data manipulation and refined clustering algorithms to be used for further research into finding such patterns
Towards the automatic evaluation of stylistic quality of natural texts: constructing a special-Âpurpose corpus of stylistic edits from the Wikipedia revision history
This thesis proposes an approach to automatic evaluation of the stylistic quality of natural texts through data-driven methods of Natural Language Processing. Advantages of data driven methods and their dependency on the size of training data are discussed. Also the advantages of using Wikipedia as a source for textual data mining are presented. The method in this project crucially involves a program for quick automatic extraction of sentences edited by users from the Wikipedia Revision History. The resulting edits have been compiled in a large-scale corpus of examples of stylistic editing. The complete modular structure of the extraction program is described and its performance is analyzed. Furthermore, the need to separate stylistic edits stylistic edits from factual ones is discussed and a number of Machine Learning classification algorithms for this task are proposed and tested. The program developed in this project was able to process approximately 10% of the whole Russian Wikipedia Revision history (200 gigabytes of textual data) in one month, resulting in the extraction of more than two millions of user edits. The best algorithm for the classification of edits into factual and stylistic ones achieved 86.2% cross-validation accuracy, which is comparable with state-of-the-art performance of similar models described in published papers.Master i Datalingvistikk og sprÄkteknologiMAHF-DASPDASP35
Recommended from our members
The classification of gene products in the molecular biology domain: Realism, objectivity, and the limitations of the Gene Ontology
Background: Controlled vocabularies in the molecular biology domain exist to facilitate data integration across database resources. One such tool is the Gene Ontology (GO), a classification designed to act as a universal index for gene products from any species. The Gene Ontology is used extensively in annotating gene products and analysing gene expression data, yet very little research exists from a library and information science perspective exploring the design principles, philosophy and social role of ontologies in biology.
Aim: To explore how molecular biologists, in creating the Gene Ontology, devised guidelines and rules for determining which scientific concepts are included in the ontology, and the criteria for how these concepts are represented.
Methods: A domain analysis approach was used to devise a mixed methodology to study the design of the Gene Ontology. Concept analysis of a GO term and a critical discourse analysis of GO developer mailing list texts were used to test whether ontological realism is a tenable basis for constructing objective ontologies. A comparison of the current GO vocabulary construction guidelines and a study of the reasons why GO terms are removed from the ontology further explored the justifications for the design of the Gene Ontology. Finally, a content analysis of published GO papers examined how authors use and cite GO data and terminology.
Results: Gene Ontology terms can be presented according to different epistemologies for concepts, indicating that ontological realism is not the only way objective ontologies can be designed. Social roles and the exercise of power were found to play an important role in determining ontology content, and poor synonym control, a lack of clear warrant for deciding terminology and arbitrary decisions to delete and invent new terms undermine the objectivity and universal applicability of the Gene Ontology. Authors exhibited poor compliance with GO data citation policies, and in re-wording and misquoting GO terminology, risk exacerbating the semantic problems this controlled vocabulary was designed to solve.
Conclusions: The failure of the Gene Ontology to define what is meant by a molecular function, the exercise of power by GO developers in clearing contentious concepts from the ontology, and the strict adherence to ontological realism, which marginalises social and subjective ways of classifying scientific concepts, limits the utility of the ontology as a tool to unify the molecular biology domain. These limitations to the Gene Ontology design could be overcome with the development of lighter, pluralistic, user-controlled âopen ontologiesâ for gene products that can work alongside more traditional, âtop-downâ developed vocabularies
Autopoietic-extended architecture: can buildings think?
To incorporate bioremedial functions into the performance of buildings and to balance
generative architecture's dominant focus on computational programming and digital
fabrication, this thesis first hybridizes theories of autopoiesis into extended cognition in order to
research biological domains that include synthetic biology and biocomputation. Under the
rubric of living technology I survey multidisciplinary fields to gather perspective for student
design of bioremedial and/or metabolic components in generative architecture where
generative not only denotes the use of computation but also includes biochemical,
biomechanical, and metabolic functions.
I trace computation and digital simulations back to Alan Turing's early 1950s
Morphogenetic drawings, reaction-diffusion algorithms, and pioneering artificial intelligence
(AI) in order to establish generative architecture's point of origin. I ask provocatively: Can
buildings think? as a question echoing Turing's own "Can machines think?" Thereafter, I
anticipate not only future bioperformative materials but also theories capable of underpinning
strains of metabolic intelligences made possible via AI, synthetic biology, and living technology.
I do not imply that metabolic architectural intelligence will be like human cognition. I
suggest, rather, that new research and pedagogies involving the intelligence of bacteria, plants,
synthetic biology, and algorithms define approaches that generative architecture should take in
order to source new forms of autonomous life that will be deployable as corrective
environmental interfaces. I call the research protocol autopoietic-extended design, theorizing it
as an operating system (OS), a research methodology, and an app schematic for design studios
and distance learning that makes use of in-field, e-, and m-learning technologies.
A quest of this complexity requires scaffolding for coordinating theory-driven teaching
with practice-oriented learning. Accordingly, I fuse Maturana and Varela's biological autopoiesis
and its definitions of minimal biological life with Andy Clark's hypothesis of extended cognition
and its cognition-to-environment linkages. I articulate a generative design strategy and student
research method explained via architectural history interpreted from Louis Sullivan's 1924
pedagogical drawing system, Le Corbusier's Modernist pronouncements, and Greg Lynn's
Animate Form. Thus, autopoietic-extended design organizes thinking about the generation of
ideas for design prior to computational production and fabrication, necessitating a fresh
relationship between nature/science/technology and design cognition. To systematize such a
program requires the avoidance of simple binaries (mind/body, mind/nature) as well as the
stationing of tool making, technology, and architecture within the ream of nature. Hence, I argue,
in relation to extended phenotypes, plant-neurobiology, and recent genetic research:
Consequently, autopoietic-extended design advances design protocols grounded in morphology,
anatomy, cognition, biology, and technology in order to appropriate metabolic and intelligent
properties for sensory/response duty in buildings.
At m-learning levels smartphones, social media, and design apps source data from
nature for students to mediate on-site research by extending 3D pedagogical reach into new
university design programs. I intend the creation of a dialectical investigation of animal/human
architecture and computational history augmented by theory relevant to current algorithmic
design and fablab production. The autopoietic-extended design dialectic sets out ways to
articulate opposition/differences outside the Cartesian either/or philosophy in order to
prototype metabolic architecture, while dialectically maintaining: Buildings can think