41 research outputs found

    Understanding U.S. regional linguistic variation with Twitter data analysis

    Get PDF
    We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S

    Analyzing the Population Density Pattern in China with a GIS-Automated Regionalization Method: Hu Line Revisited

    Get PDF
    The famous “Hu Line”, proposed by Hu Huanyong in 1935, divided China into two regions of comparable area sizes that drastically differ in population: about 4% in the northwest part and 96% in the southeast. However, the Hu Line was proposed largely by visual examination of hand-made maps and arduous experiments of numerous configurations, and has been subject to criticism of lack of scientific rigor and accuracy. Furthermore, it has been over eight decades since the Hu Line was proposed. During the time, China sustained several major man-made and natural disasters (e.g., the World War II, the subsequent Civil War and the 1958-62 Great Famine), and also experienced some major government-sponsored migrations, economic growth and unprecedented urbanization. It is necessary to revisit the (in) stability of Hu Line. By using a GIS-automated regionalization method, termed REDCAP (Regionalization with Dynamically Constrained Agglomerative Clustering and Partitioning), this study re-visits the Hu Line in three aspects. First, by reconstructing the demarcation line based on the latest census of 2010 county-level population by REDCAP, this study largely validates and refines the classic Hu Line. Secondly, this research also seeks to uncover the underlying physical environment factors that shape such a contrast by proposing a habitation environment suitability index (HESI) model. In the third part, this study examines the population density change and disparity change over time by using all the six censuses (1953, 1964, 1982, 1990, 2000, and 2010) since the founding of the People’s Republic of China. This study advances the methodological rigor in defining the Hu Line, solidifies the inherent connection between physical environment and population settlement, and strengthens the findings by extending the analysis across time epochs

    A Bibliographic View on Constrained Clustering

    Full text link
    A keyword search on constrained clustering on Web-of-Science returned just under 3,000 documents. We ran automatic analyses of those, and compiled our own bibliography of 183 papers which we analysed in more detail based on their topic and experimental study, if any. This paper presents general trends of the area and its sub-topics by Pareto analysis, using citation count and year of publication. We list available software and analyse the experimental sections of our reference collection. We found a notable lack of large comparison experiments. Among the topics we reviewed, applications studies were most abundant recently, alongside deep learning, active learning and ensemble learning.Comment: 18 pages, 11 figures, 177 reference

    Uncovering the Structures In Ecological Networks: Multiple Techniques For Multiple Purposes

    Get PDF
    Ecosystem structure and function are the product of biological and ecological elements and their connections and interactions. Understanding structure and process in ecosystems is critical to ecological studies. Ecological networks, based on simple concepts in which biological and ecological elements are depicted as nodes with relationships between them described as links, have been recognized as a valuable means of clarifying the relationship between structures and process in ecosystems. Ecological network analysis has benefited from the advancement of techniques in social science, computer science, and mathematics, but attention must be paid to whether the designs of these techniques follow ecological principles and produce results that are ecologically meaningful and interpretable. The objective of this dissertation is to examine the suitability of these methods for various applications addressing different ecological concerns. Specifically, the studies that comprise this dissertation test methods that reveal the structure of various ecological networks by decomposing networks of interest into groups of nodes or aggregating nodes into groups. The key findings in each specific application are summarized below. In the first paper, REgionalization with Clustering And Partitioning (GraphRECAP) (Guo 2009) and Girvan and Newman\u27s method (Girvan and Newman 2002) were compared in the study of finding compartments in the habitat network of ring-tailed lemurs (Lemur catta). The compartments are groups of nodes in which lemur movements are more prevalent among the groups than across the groups. GraphRECAP found compartments with a larger minimum number of habitat patches in compartments. These compartments are considered to be more robust to local extinctions because they had stronger within-compartment dispersal, greater traversability, and more alternative routes for escape from disturbance. The potential defect of the Girvan and Newman\u27s method, an unbalanced partitioning of graphs under certain circumstances, was believed to account for its lower performance. In the second study, Modularity based Hierarchical Region Discovery (MHRD) and Edge ratio-based Hierarchical Region Discovery (EHRD) were used to detect movement patterns in trajectories of 34 cattle (Bos taurus), 30 mule deer (Odocoileus hemionus), and 38 elk (Cervus elaphus) tracked by an Automated Telemetry at Starkey National Forest, in northeastern Oregon, USA. Both methods treated animal trajectories as a spatial and ecological graph, regionalized the graph such that animals have more movement within the regions than across the regions, and then investigated the movement patterns on the basis of regions. EHRD identified regions that more effectively captured the characteristics of different species movement than MHRD. Clusters of trajectories identified by EHRD had higher cohesion within clusters and better separation between clusters on the basis of attributes of trajectories extracted from the regions. The regions detected by EHRD also served as more effective predictors for classifying trajectories of different species, achieving a higher classification accuracy with more simplicity. EHRD had better performance, because it did not rely on the null model that MHRD compared to, but invalid in this application. In the third study, a proposed Extended Additive Jaccard Similarity index (EAJS) overcame the weakness of the Additive Jaccard Similarity index (AJS) (Yodzis and Winemiller 1999) in the aggregation of species for the mammalian food web in the Serengeti ecosystem. As compared to AJS, the use of the EAJS captured the similarity between species that have equivalent trophic roles. Clusters grouped using EAJS showed higher trophic similarities between species within clusters and stronger separation between species across clusters as compared to AJS. The EAJS clusters also exhibited patterns related to habitat structure of plants and network topology associated with animal weights. The consideration of species feeding relations at a broader scale (i.e., not limited in adjacent trophic levels) accounted for the advantages of EAJS over AJS. The concluding chapter summarizes how the methods examined in the previous chapters perform in different ecological applications and examines the designs of these algorithms and whether the designs make ecological sense. It then provides valuable suggestions on the selections of methods to answer different ecological questions in practice and on the development and improvement of more ecological-oriented techniques

    Remote Sensing in Applications of Geoinformation

    Get PDF
    Remote sensing, especially from satellites, is a source of invaluable data which can be used to generate synoptic information for virtually all parts of the Earth, including the atmosphere, land, and ocean. In the last few decades, such data have evolved as a basis for accurate information about the Earth, leading to a wealth of geoscientific analysis focusing on diverse applications. Geoinformation systems based on remote sensing are increasingly becoming an integral part of the current information and communication society. The integration of remote sensing and geoinformation essentially involves combining data provided from both, in a consistent and sensible manner. This process has been accelerated by technologically advanced tools and methods for remote sensing data access and integration, paving the way for scientific advances in a broadening range of remote sensing exploitations in applications of geoinformation. This volume hosts original research focusing on the exploitation of remote sensing in applications of geoinformation. The emphasis is on a wide range of applications, such as the mapping of soil nutrients, detection of plastic litter in oceans, urban microclimate, seafloor morphology, urban forest ecosystems, real estate appraisal, inundation mapping, and solar potential analysis

    Potential contributions of remote sensing to ecosystem service assessments

    Get PDF
    Ecological and conservation research has provided a strong scientific underpinning to the modeling of ecosystem services (ESs) over space and time, by identifying the ecological processes and components of biodiversity (ecosystem service providers, functional traits) that drive ES supply. Despite this knowledge, efforts to map the distribution of ESs often rely on simple spatial surrogates that provide incomplete and non-mechanistic representations of the biophysical variables they are intended to proxy. However, alternative data sets are available that allow for more direct, spatially nuanced inputs to ES mapping efforts. Many spatially explicit, quantitative estimates of biophysical parameters are currently supported by remote sensing, with great relevance to ES mapping. Additional parameters that are not amenable to direct detection by remote sensing may be indirectly modeled with spatial environmental data layers. We review the capabilities of modern remote sensing for describing biodiversity, plant traits, vegetation condition, ecological processes, soil properties, and hydrological variables and highlight how these products may contribute to ES assessments. Because these products often provide more direct estimates of the ecological properties controlling ESs than the spatial proxies currently in use, they can support greater mechanistic realism in models of ESs. By drawing on the increasing range of remote sensing instruments and measurements, data sets appropriate to the estimation of a given ES can be selected or developed. In so doing, we anticipate rapid progress to the spatial characterization of ecosystem services, in turn supporting ecological conservation, management, and integrated land use planning

    Metrics and methods for social distance

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Urban Studies and Planning, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 171-189).Distance measures are important for scientists because they illustrate the dynamics of geospatial topologies for physical and social processes. Two major types of distance are generally used for this purpose: Euclidean Distance measures the geodesic dispersion between fixed locations and Cost Distance characterizes the ease of travel between two places. This dissertation suggests that close inter-place ties may be an effect of human decisions and relationships and so embraces a third tier of distance, Social Distance, as the conceptual or physical connectivity between two places as measured by the relative or absolute frequency, volume or intensity of agent-based choices to travel, communicate or relate from one distinct place to another. In the spatial realm, Social Distance measures have not been widely developed, and since the concept is relatively new, Chapter 1 introduces and defines geo-contextual Social Distance, its operationalization, and its novelty. With similar intentions, Chapter 2 outlines the challenges facing the integration of social flow data into the Geographic Information community. The body of this dissertation consists of three separate case studies in Chapters 3, 4 and 5 whose common theme is the integration of Social Distance as models of social processes in geographic space. Each chapter addresses one aspect of this topic. Chapter 3 looks at a new visualization and classification method, called Weighted Radial Variation, for flow datasets. U.S. Migration data at the county level for 2008 is used for this case study. Chapter 4 discusses a new computational method for predicting geospatial interaction, based on social theory of trip chaining and communication. U.S. Flight, Trip and Migration data for the years 1995-2008 are used in this study. Chapter 5 presents the results of the tandem analysis for social networks and geographic clustering. Roll call vote data for the U.S. House of Representatives in the 111th Congress are used to create a social network, which is then analyzed with regards to the geographic districts of each congressperson.by Clio Andris.Ph.D

    Analyzing complex data using domain constraints

    Get PDF
    Data-driven research approaches are becoming increasingly popular in a growing number of scientific disciplines. While a data-driven research approach can yield superior results, generating the required data can be very costly. This frequently leads to small and complex data sets, in which it is impossible to rely on volume alone to compensate for all shortcomings of the data. To counter this problem, other reliable sources of information must be incorporated. In this work, domain knowledge, as a particularly reliable type of additional information, is used to inform data-driven analysis methods. This domain knowledge is represented as constraints on the possible solutions, which the presented methods can use to inform their analysis. It focusses on spatial constraints as a particularly common type of constraint, but the proposed techniques are general enough to be applied to other types of constraints. In this thesis, new methods using domain constraints for data-driven science applications are discussed. These methods have applications in feature evaluation, route database repair, and Gaussian Mixture modeling of spatial data. The first application focuses on feature evaluation. The presented method receives two representations of the same data: one as the intended target and the other for investigation. It calculates a score indicating how much the two representations agree. A presented application uses this technique to compare a reference attribute set with different subsets to determine the importance and relevance of individual attributes. A second technique analyzes route data for constraint compliance. The presented framework allows the user to specify constraints and possible actions to modify the data. The presented method then uses these inputs to generate a version of the data, which agrees with the constraints, while otherwise reducing the impact of the modifications as much as possible. Two extensions of this schema are presented: an extension to continuously valued costs, which are minimized, and an extension to constraints involving more than one moving object. Another addressed application area is modeling of multivariate measurement data, which was measured at spatially distributed locations. The spatial information recorded with the data can be used as the basis for constraints. This thesis presents multiple approaches to building a model of this kind of data while complying with spatial constraints. The first approach is an interactive tool, which allows domain scientists to generate a model of the data, which complies with their knowledge about the data. The second is a Monte Carlo approach, which generates a large number of possible models, tests them for compliance with the constraints, and returns the best one. The final two approaches are based on the EM algorithm and use different ways of incorporating the information into their models. At the end of the thesis, two applications of the models, which have been generated in the previous chapter, are presented. The first is prediction of the origin of samples and the other is the visual representation of the extracted models on a map. These tools can be used by domain scientists to augment their tried and tested tools. The developed techniques are applied to a real-world data set collected in the archaeobiological research project FOR 1670 (Transalpine mobility and cultural transfer) of the German Science Foundation. The data set contains isotope ratio measurements of samples, which were discovered at archaeological sites in the Alps region of central Europe. Using the presented data analysis methods, the data is analyzed to answer relevant domain questions. In a first application, the attributes of the measurements are analyzed for their relative importance and their ability to predict the spatial location of samples. Another presented application is the reconstruction of potential migration routes between the investigated sites. Then spatial models are built using the presented modeling approaches. Univariate outliers are determined and used to predict locations based on the generated models. These are cross-referenced with the recorded origins. Finally, maps of the isotope distribution in the investigated regions are presented. The described methods and demonstrated analyses show that domain knowledge can be used to formulate constraints that inform the data analysis process to yield valid models from relatively small data sets and support domain scientists in their analyses.Datengetriebene Forschungsansätze werden für eine wachsende Anzahl von wissenschaftlichen Disziplinen immer wichtiger. Obwohl ein datengetriebener Forschungsansatz bessere Ergebnisse erzielen kann, kann es sehr teuer sein die notwendigen Daten zu gewinnen. Dies hat häufig zur Folge, dass kleine und komplexe Datensätze entstehen, bei denen es nicht möglich ist sich auf die Menge der Datenpunkte zu verlassen um Probleme bei der Analyse auszugleichen. Um diesem Problem zu begegnen müssen andere Informationsquellen verwendet werden. Fachwissen als eine besonders zuverlässige Quelle solcher Informationen kann herangezogen werden, um die datengetriebenen Analysemethoden zu unterstützen. Dieses Fachwissen wird ausgedrückt als Constraints (Nebenbedingungen) der möglichen Lösungen, die die vorgestellten Methoden benutzen können um ihre Analyse zu steuern. Der Fokus liegt dabei auf räumlichen Constraints als eine besonders häufige Art von Constraints, aber die vorgeschlagenen Methoden sind allgemein genug um auf andere Arte von Constraints angewendet zu werden. Es werden neue Methoden diskutiert, die Fachwissen für datengetriebene wissenschaftliche Anwendungen verwenden. Diese Methoden haben Anwendungen auf Feature-Evaluation, die Reparatur von Bewegungsdatenbanken und auf Gaussian-Mixture-Modelle von räumlichen Daten. Die erste Anwendung betrifft Feature-Evaluation. Die vorgestellte Methode erhält zwei Repräsentationen der selben Daten: eine als Zielrepräsentation und eine zur Untersuchung. Sie berechnet einen Wert, der aussagt, wie einig sich die beiden Repräsentationen sind. Eine vorgestellte Anwendung benutzt diese Technik um eine Referenzmenge von Attributen mit verschiedenen Untermengen zu vergleichen, um die Wichtigkeit und Relevanz einzelner Attribute zu bestimmen. Eine zweite Technik analysiert die Einhaltung von Constraints in Bewegungsdaten. Das präsentierte Framework erlaubt dem Benutzer Constraints zu definieren und mögliche Aktionen zur Veränderung der Daten anzuwenden. Die präsentierte Methode benutzt diese Eingaben dann um eine neue Variante der Daten zu erstellen, die die Constraints erfüllt ohne die Datenbank mehr als notwendig zu verändern. Zwei Erweiterungen dieser Grundidee werden vorgestellt: eine Erweiterung auf stetige Kostenfunktionen, die minimiert werden, und eine Erweiterung auf Bedingungen, die mehr als ein bewegliches Objekt betreffen. Ein weiteres behandeltes Anwendungsgebiet ist die Modellierung von multivariaten Messungen, die an räumlich verteilten Orten gemessen wurden. Die räumliche Information, die zusammen mit diesen Daten erhoben wurde, kann als Grundlage genutzt werden um Constraints zu formulieren. Mehrere Ansätze zum Erstellen von Modellen auf dieser Art von Daten werden vorgestellt, die räumliche Constraints einhalten. Der erste dieser Ansätze ist ein interaktives Werkzeug, das Fachwissenschaftlern dabei hilft, Modelle der Daten zu erstellen, die mit ihrem Wissen über die Daten übereinstimmen. Der zweite ist eine Monte-Carlo-Simulation, die eine große Menge möglicher Modelle erstellt, testet ob sie mit den Constraints übereinstimmen und das beste Modell zurückgeben. Zwei letzte Ansätze basieren auf dem EM-Algorithmus und benutzen verschiedene Arten diese Information in das Modell zu integrieren. Am Ende werden zwei Anwendungen der gerade vorgestellten Modelle vorgestellt. Die erste ist die Vorhersage der Herkunft von Proben und die andere ist die grafische Darstellung der erstellten Modelle auf einer Karte. Diese Werkzeuge können von Fachwissenschaftlern benutzt werden um ihre bewährten Methoden zu unterstützen. Die entwickelten Methoden werden auf einen realen Datensatz angewendet, der von dem archäo-biologischen Forschungsprojekt FOR 1670 (Transalpine Mobilität und Kulturtransfer der Deutschen Forschungsgemeinschaft erhoben worden ist. Der Datensatz enthält Messungen von Isotopenverhältnissen von Proben, die in archäologischen Fundstellen in den zentraleuropäischen Alpen gefunden wurden. Die präsentierten Datenanalyse-Methoden werden verwendet um diese Daten zu analysieren und relevante Forschungsfragen zu klären. In einer ersten Anwendung werden die Attribute der Messungen analysiert um ihre relative Wichtigkeit und ihre Fähigkeit zu bewerten, die räumliche Herkunft der Proben vorherzusagen. Eine weitere vorgestellte Anwendung ist die Wiederherstellung von möglichen Migrationsrouten zwischen den untersuchten Fundstellen. Danach werden räumliche Modelle der Daten unter Verwendung der vorgestellten Methoden erstellt. Univariate Outlier werden bestimmt und ihre möglich Herkunft basierend auf der erstellten Karte wird bestimmt. Die vorhergesagte Herkunft wird mit der tatsächlichen Fundstelle verglichen. Zuletzt werden Karten der Isotopenverteilung der untersuchten Region vorgestellt. Die beschriebenen Methoden und vorgestellten Analysen zeigen, dass Fachwissen verwendet werden kann um Constraints zu formulieren, die den Datenanalyseprozess unterstützen, um gültige Modelle aus relativ kleinen Datensätzen zu erstellen und Fachwissenschaftler bei ihren Analysen zu unterstützen
    corecore