312 research outputs found
Effect of forename string on author name disambiguation
In author name disambiguation, author forenames are used to decide which name instances are disambiguated together and how much they are likely to refer to the same author. Despite such a crucial role of forenames, their effect on the performance of heuristic (string matching) and algorithmic disambiguation is not well understood. This study assesses the contributions of forenames in author name disambiguation using multiple labeled data sets under varying ratios and lengths of full forenames, reflecting realāworld scenarios in which an author is represented by forename variants (synonym) and some authors share the same forenames (homonym). The results show that increasing the ratios of full forenames substantially improves both heuristic and machineālearningābased disambiguation. Performance gains by algorithmic disambiguation are pronounced when many forenames are initialized or homonyms are prevalent. As the ratios of full forenames increase, however, they become marginal compared to those by string matching. Using a small portion of forename strings does not reduce much the performances of both heuristic and algorithmic disambiguation methods compared to using fullālength strings. These findings provide practical suggestions, such as restoring initialized forenames into a fullāstring format via record linkage for improved disambiguation performances.Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/155924/1/asi24298.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/155924/2/asi24298_am.pd
A Data Science Approach to Understanding Residential Water Contamination in Flint
When the residents of Flint learned that lead had contaminated their water
system, the local government made water-testing kits available to them free of
charge. The city government published the results of these tests, creating a
valuable dataset that is key to understanding the causes and extent of the lead
contamination event in Flint. This is the nation's largest dataset on lead in a
municipal water system.
In this paper, we predict the lead contamination for each household's water
supply, and we study several related aspects of Flint's water troubles, many of
which generalize well beyond this one city. For example, we show that elevated
lead risks can be (weakly) predicted from observable home attributes. Then we
explore the factors associated with elevated lead. These risk assessments were
developed in part via a crowd sourced prediction challenge at the University of
Michigan. To inform Flint residents of these assessments, they have been
incorporated into a web and mobile application funded by \texttt{Google.org}.
We also explore questions of self-selection in the residential testing program,
examining which factors are linked to when and how frequently residents
voluntarily sample their water.Comment: Applied Data Science track paper at KDD 2017. For associated
promotional video, see https://www.youtube.com/watch?v=0g66ImaV8A
Development of Computer Science Disciplines - A Social Network Analysis Approach
In contrast to many other scientific disciplines, computer science considers
conference publications. Conferences have the advantage of providing fast
publication of papers and of bringing researchers together to present and
discuss the paper with peers. Previous work on knowledge mapping focused on the
map of all sciences or a particular domain based on ISI published JCR (Journal
Citation Report). Although this data covers most of important journals, it
lacks computer science conference and workshop proceedings. That results in an
imprecise and incomplete analysis of the computer science knowledge. This paper
presents an analysis on the computer science knowledge network constructed from
all types of publications, aiming at providing a complete view of computer
science research. Based on the combination of two important digital libraries
(DBLP and CiteSeerX), we study the knowledge network created at
journal/conference level using citation linkage, to identify the development of
sub-disciplines. We investigate the collaborative and citation behavior of
journals/conferences by analyzing the properties of their co-authorship and
citation subgraphs. The paper draws several important conclusions. First,
conferences constitute social structures that shape the computer science
knowledge. Second, computer science is becoming more interdisciplinary. Third,
experts are the key success factor for sustainability of journals/conferences
- ā¦