Search CORE

16 research outputs found

Deep learning-based survival prediction for multiple cancer types using histopathology images

Author: Chen Po-Hsuan Cameron
Flament Isabelle
Liu Yun
Mermel Craig H.
Sadhwani Apaar
Steiner David F.
Stumpe Martin C.
Wang Hongwu
Wulczyn Ellery
Xu Zhaoyang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 16/12/2019
Field of study

Prognostic information at diagnosis has important implications for cancer treatment and monitoring. Although cancer staging, histopathological assessment, molecular features, and clinical variables can provide useful prognostic insights, improving risk stratification remains an active research area. We developed a deep learning system (DLS) to predict disease specific survival across 10 cancer types from The Cancer Genome Atlas (TCGA). We used a weakly-supervised approach without pixel-level annotations, and tested three different survival loss functions. The DLS was developed using 9,086 slides from 3,664 cases and evaluated using 3,009 slides from 1,216 cases. In multivariable Cox regression analysis of the combined cohort including all 10 cancers, the DLS was significantly associated with disease specific survival (hazard ratio of 1.58, 95% CI 1.28-1.70, p<0.0001) after adjusting for cancer type, stage, age, and sex. In a per-cancer adjusted subanalysis, the DLS remained a significant predictor of survival in 5 of 10 cancer types. Compared to a baseline model including stage, age, and sex, the c-index of the model demonstrated an absolute 3.7% improvement (95% CI 1.0-6.5) in the combined cohort. Additionally, our models stratified patients within individual cancer stages, particularly stage II (p=0.025) and stage III (p<0.001). By developing and evaluating prognostic models across multiple cancer types, this work represents one of the most comprehensive studies exploring the direct prediction of clinical outcomes using deep learning and histopathology images. Our analysis demonstrates the potential for this approach to provide prognostic information in multiple cancer types, and even within specific pathologic stages. However, given the relatively small number of clinical events, we observed wide confidence intervals, suggesting that future work will benefit from larger datasets

arXiv.org e-Print Archive

Directory of Open Access Journals

Wikipedia Navigation Vectors

Author: Ellery Wulczyn (689409)
Publication venue
Publication date
Field of study

In this project, we learned embeddings for Wikipedia articles and <a href="https://www.wikidata.org/wiki/Wikidata:Main_Page">Wikidata</a> items by applying <a href="https://en.wikipedia.org/wiki/Word2vec">Word2vec</a> models to a corpus of reading sessions.Although Word2vec models were developed to learn word embeddings from a corpus of sentences, they can be applied to any kind of sequential data. The learned embeddings have the property that items with similar neighbors in the training corpus have similar representations (as measured by the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a>, for example). Consequently, applying Wor2vec to reading sessions results in article embeddings, where articles that tend to be read in close succession have similar representations. Since people usually generate sequences of semantically related articles while reading, these embeddings also capture semantic similarity between articles.There have been several approaches to learning vector representations of Wikipedia articles that capture semantic similarity by using the article text or the links between articles. An advantage of training Word2vec models on reading sessions, is that they learn from the actions of millions of humans who are using a diverse array of signals, including the article text, links, third-party search engines, and their existing domain knowledge, to determine what to read next in order to learn about a topic.An additional feature of not relying on text or links, is that we can learn representations for <a href="https://www.wikidata.org/wiki/Help:Items">Wikidata items</a> by simply mapping article titles within each session to Wikidata items using <a href="https://www.wikidata.org/wiki/Help:Sitelinks">Wikidata sitelinks</a>. As a result, these Wikidata vectors are jointly trained over reading sessions for all Wikipedia language editions, allowing the model to learn from people across the globe. This approach also overcomes data sparsity issues for smaller Wikipedias, since the representations for articles in smaller Wikipedias are shared across many other potentially larger ones. Finally, instead of needing to generate a separate embedding for each Wikipedia in each language, we have a single model that gives a vector representation for any article in any language, provided the article has been mapped to a Wikidata item.For detailed documentation, see the <a href="https://meta.wikimedia.org/wiki/Research:Wikipedia_Vectors" rel="mw:ExtLink">wiki page</a>.</p

FigShare

Trust Propagation with Mixed-Effects Models

Author: Overgoor Jan
Potts Christopher
Wulczyn Ellery
Publication venue: 'Association for the Advancement of Artificial Intelligence (AAAI)'
Publication date: 20/05/2012
Field of study

Web-based social networks typically use public trust systems to facilitate interactions between strangers. These systems can be corrupted by misleading information spread under the cover of anonymity, or exhibit a strong bias towards positive feedback, originating from the fear of reciprocity. Trust propagation algorithms seek to overcome these shortcomings by inferring trust ratings between strangers from trust ratings between acquaintances and the structure of the network that connects them. We investigate a trust propagation algorithm that is based on user triads where the trust one user has in another is predicted based on an intermediary user. The propagation function can be applied iteratively to propagate trust along paths between a source user and a target user. We evaluate this approach using the trust network of the CouchSurfing community, which consists of 7.6M trust-valued edges between 1.1M users. We show that our model out-performs one that relies only on the trustworthiness of the target user (a kind of public trust system). In addition, we show that performance is significantly improved by bringing in user-level variability using mixed-effects regression models

Association for the Advancement of Artificial Intelligence: AAAI Publications

Wikipedia Clickstream

Author: Dario Taraborelli (678193)
Ellery Wulczyn (689409)
Publication venue
Publication date
Field of study

This project contains data sets containing counts of (referer, resource) pairs extracted from the request logs of Wikipedia. A referer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on. In other words, it gives a weighted network of articles, where each edge weight corresponds to how often people navigate from one page to another. For more information and documentation, see the link in the references section below. </p

FigShare

Wikipedia Talk Labels: Aggression

Author: Ellery Wulczyn (689409)
Lucas Dixon (3235236)
Nithum Thain (3224223)
Publication venue
Publication date
Field of study

This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it has aggressive tone. We also include some demographic data for each crowd-worker. See our <a href="https://meta.wikimedia.org/wiki/Research:Detox/Data_Release" target="_blank">wiki</a> for documentation of the schema of each file and our <a href="https://arxiv.org/abs/1610.08914" target="_blank">research paper</a> for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this <a href="https://github.com/ewulczyn/wiki-detox/blob/master/src/figshare/Wikipedia%20Talk%20Data%20-%20Getting%20Started.ipynb" rel="mw:ExtLink" target="_blank">ipython notebook</a>

FigShare

Wikipedia Talk Labels: Toxicity

Author: Ellery Wulczyn (689409)
Lucas Dixon (3235236)
Nithum Thain (3224223)
Publication venue
Publication date
Field of study

This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our <a href="https://meta.wikimedia.org/wiki/Research:Detox/Data_Release" target="_blank">wiki</a> for documentation of the schema of each file and our <a href="https://arxiv.org/abs/1610.08914" target="_blank">research paper</a> for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this <a href="https://github.com/ewulczyn/wiki-detox/blob/master/src/figshare/Wikipedia%20Talk%20Data%20-%20Getting%20Started.ipynb" rel="mw:ExtLink" target="_blank">ipython notebook</a>

FigShare