27 research outputs found
CONSTRAINED MULTI-GROUP PROJECT ALLOCATION USING MAHALANOBIS DISTANCE
Optimal allocation is one of the most active research areas in operation research using binary integer variables. The allocation of multi constrained projects among several options available along a given planning horizon is an especially significant problem in the general area of item classification. The main goal of this dissertation is to develop an analytical approach for selecting projects that would be most attractive from an economic point of view to be developed or allocated among several options, such as in-house engineers and private contractors (in transportation projects). A relevant limiting resource in addition to the availability of funds is the in-house manpower availability.
In this thesis, the concept of Mahalanobis distance (MD) will be used as the classification criterion. This is a generalization of the Euclidean distance that takes into account the correlation of the characteristics defining the scope of a project. The desirability of a given project to be allocated to an option is defined in terms of its MD to that particular option. Ideally, each project should be allocated to its closest option. This, however, may not be possible because of the available levels of each relevant resource.
The allocation process is formulated mathematically using two Binary Integer Programming (BIP) models. The first formulation maximizes the dollar value of benefits derived by the traveling public from those projects being implemented subject to a budget, total sum of MD, and in-house manpower constraints. The second formulation minimizes the total sum of MD subject to a budget and the in-house manpower constraints.
The proposed solution methodology for the BIP models is based on the branchand- bound method. In particular, one of the contributions of this dissertation is the development of a strategy for branching variables and node selection that is consistent with allocation priorities based on MD to improve the branch-and-bound performance level as well as handle a large scale application. The suggested allocation process includes: (a) multiple allocation groups; (b) multiple constraints; (c) different BIP models. Numerical experiments with different projects and options are considered to illustrate the application of the proposed approach
GlySpy: A software suite for assigning glycan topologies from sequential mass spectral data
GlySpy is a suite of algorithms used to determine the structure of glycans. Glycans, which are orderly aggregations of monosaccharides such as glucose, mannose, and fucose, are often attached to proteins and lipids, and provide a wide range of biological functions. Previous biomolecule-sequencing algorithms have operated on linear polymers such as proteins or DNA but, because glycans form complicated branching structures, new approaches are required. GlySpy uses data derived from sequential mass spectrometry (MSn), in which a precursor molecule is fragmented to form products, each of which may then be fragmented further, gradually disassembling the glycan. GlySpy resolves the structures of the original glycans by examining these disassembly pathways.
The four main components of GlySpy are: (1) OSCAR (the Oligosaccharide Subtree Constraint Algorithm), which accepts analyst-selected MSn disassembly pathways and produces a set of plausible glycan structures; (2) IsoDetect, which reports the MSn disassembly pathways that are inconsistent with a set of expected structures, and which therefore may indicate the presence of alternative isomeric structures; (3) IsoSolve, which attempts to assign the branching structures of multiple isomeric glycans found in a complex mixture; and (4) Intelligent Data Acquisition (IDA), which provides automated guidance to the mass spectrometer operator, selecting glycan fragments for further MSn disassembly.
This dissertation provides a primer for the underlying interdisciplinary topics---carbohydrates, glycans, MSn, and so on-and also presents a survey of the relevant literature with a focus on currently-available tools. Each of GlySpy\u27s four algorithms is described in detail, along with results from their application to biologically-derived glycan samples. A summary enumerates GlySpy\u27s contributions, which include de novo glycan structural analysis, favorable performance characteristics, interpretation of higher-order MSn data, and the automation of both data acquisition and analysis
Unsupervised multilingual learning
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 241-254).For centuries, scholars have explored the deep links among human languages. In this thesis, we present a class of probabilistic models that exploit these links as a form of naturally occurring supervision. These models allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Besides these traditional NLP tasks, we also present a multilingual model for lost language deciphersment. We test this model on the ancient Ugaritic language. Our results show that we can automatically uncover much of the historical relationship between Ugaritic and Biblical Hebrew, a known related language.by Benjamin Snyder.Ph.D
Real Time Crime Prediction Using Social Media
There is no doubt that crime is on the increase and has a detrimental influence on a nation's economy despite several attempts of studies on crime prediction to minimise crime rates. Historically, data mining techniques for crime prediction models often rely on historical information and its mostly country specific. In fact, only a few of the earlier studies on crime prediction follow standard data mining procedure. Hence, considering the current worldwide crime trend in which criminals routinely publish their criminal intent on social media and ask others to see and/or engage in different crimes, an alternative, and more dynamic strategy is needed. The goal of this research is to improve the performance of crime prediction models. Thus, this thesis explores the potential of using information on social media (Twitter) for crime prediction in combination with historical crime data. It also figures out, using data mining techniques, the most relevant feature engineering needed for United Kingdom dataset which could improve crime prediction model performance. Additionally, this study presents a function that could be used by every state in the United Kingdom for data cleansing, pre-processing and feature engineering. A shinny App was also use to display the tweets sentiment trends to prevent crime in near-real time.Exploratory analysis is essential for revealing the necessary data pre-processing and feature engineering needed prior to feeding the data into the machine learning model for efficient result. Based on earlier documented studies available, this is the first research to do a full exploratory analysis of historical British crime statistics using stop and search historical dataset. Also, based on the findings from the exploratory study, an algorithm was created to clean the data, and prepare it for further analysis and model creation. This is an enormous success because it provides a perfect dataset for future research, particularly for non-experts to utilise in constructing models to forecast crime or conducting investigations in around 32 police districts of the United Kingdom.Moreover, this study is the first study to present a complete collection of geo-spatial parameters for training a crime prediction model by combining demographic data from the same source in the United Kingdom with hourly sentiment polarity that was not restricted to Twitter keyword search. Six unique base models that were frequently mentioned in the previous literature was selected and used to train stop-and-search historical crime dataset and evaluated on test data and finally validated with dataset from London and Kent crime datasets.Two different datasets were created from twitter and historical data (historical crime data with twitter sentiment score and historical data without twitter sentiment score). Six of the most prevalent machine learning classifiers (Random Forest, Decision Tree, K-nearest model, support vector machine, neural network and naĂŻve bayes) were trained and tested on these datasets. Additionally, hyperparameters of each of the six models developed were tweaked using random grid search. Voting classifiers and logistic regression stacked ensemble of different models were also trained and tested on the same datasets to enhance the individual model performance.In addition, two combinations of stack ensembles of multiple models were constructed to enhance and choose the most suitable models for crime prediction, and based on their performance, the appropriate prediction model for the UK dataset would be selected. In terms of how the research may be interpreted, it differs from most earlier studies that employed Twitter data in that several methodologies were used to show how each attribute contributed to the construction of the model, and the findings were discussed and interpreted in the context of the study. Further, a shiny app visualisation tool was designed to display the tweetsâ sentiment score, the text, the usersâ screen name, and the tweetsâ vicinity which allows the investigation of any criminal actions in near-real time. The evaluation of the models revealed that Random Forest, Decision Tree, and K nearest neighbour outperformed other models. However, decision trees and Random Forests perform better consistently when evaluated on test data
Detecting and mapping forest nutrient deficiencies: eucalyptus variety (Eucalyptus grandis x and Eucalyptus urophylla) trees in KwaZulu-Natal, South Africa.
Doctoral Degree. University of KwaZulu-Natal, Pietermaritzburg.Abstract available in PDF
Recommended from our members
Simulation of population changes of western dwarf mistletoe on Ponderosa pine
Western dwarf mistletoe (Arceuthobium campylopodum Engelm.
1. campylopodum) is a parasite of ponderosa pine (Pinus ponderosa
Laws. ). The objectives of this investigation are: (a) to formulate a
mathematical description of the process of dwarf mistletoe disease
spread in a pine forest, (b) to use this description to predict the spread
in a few cases of interest, and (c) from the result to make some
general hypotheses concerning the process. The simulation is based
on a young-growth, managed ponderosa pine stand, where the trees
are evenly spaced (9 to 18 feet apart), are of uniform height (10 to
25 feet), and have a light to moderate infection level.
The model consists of four major submodels: tree growth,
mistletoe seed production, seed dispersal, and infection establishment.
The tree growth submodel provides information concerning size,
position, and number of susceptible branches. The seed production
submodel relates the amount of inoculum present to plant age. The process of disease spread is partitioned into a series of sequentially
operating events. The probabilities associated with the events from
mistletoe seed production to seed interception by a susceptible branch
are computed in the seed dispersal submodel. The probabilities of
subsequent events leading to infection are in the infection establishment
submodel. Each submodel provides information for the next
one, forming an interlocking set.
Seven cases are examined using the complete simulation model.
These include three tree spacings (9, 13, and 18 feet) with two
moderate levels of infection (2 and 4 plants per infected tree) simulated
for five years and one with a heavy infection level (15 plants and
9 feet spacing) simulated for ten years. The results are examined
to assess changes in (a) the probability of infection with respect to
tree spacing within a hypothetical stand, branchlet height, infection
level, and time, and (b) the expected number of new infections.
The model shows that the probability of reinfection decreases
as the crown volume around a given height becomes larger and the
foliage becomes sparser. The probability of infection due to contagion
is found to decrease by about half for an increase in stand spacing of
five feet. In a stand with an initial infection rate of 0.60 and a spacing
of 9 feet, the expected number of new infections per 100 trees at the
end of the fifth year is found to be 283 plants where there is an initial
level of 2 plants per infected tree and to be 644 plants where there is
a level of 4 plants per infected tree. Based on examination of the behavior of the model, five hypotheses
concerning the disease spread process are formulated.
(1) Plants high in the crown of the pine trees are the most important
ones with respect to disease spread. (2) Where infection levels are
moderate (fewer than 5 infections per tree) and where spacing is
greater than 8 feet, vertical spread is accomplished primarily by
reinfection. (3) It is possible for a tree to "outgrow" its infections.
(4) In stands with spacing distances greater than 8 feet and a sparse
mistletoe population, new infections are more likely to occur as a
result of reinfection than as a result of contagion. (5) Increasing the
spacing between trees reduces the probability of mistletoe infection
from both reinfection and contagion. These hypotheses have a
practi[c]al importance to the management of young pine forests. They
indicate that selective thinning should discriminate against trees with
infections at greatest heights. Also, in young stands with moderate
infection levels, the chances are favorable for the trees to outgrow
their infections, if they are spaced such that growth conditions are
optimum
The role of structured induction in expert systems
A "structured induction" technique was developed and tested using a
rules- from -examples generator together with a chess -specific application
package. A drawback of past experience with computer induction, reviewed
in this thesis, has been the generation of machine -oriented rules opaque to
the user. By use of the structured approach humanly understandable rules
were synthesized from expert supplied examples. These rules correctly performed chess endgame classifications of sufficient complexity to be regarded as difficult by international master standard players. Using the "Interactive ID3" induction tools developed by the author, chess experts, with
a little programming support, were able to generate rules which solve problems considered difficult or impossible by conventional programming techniques. Structured induction and associated programming tools were
evaluated using the chess endgames Icing and Pawn vs. King (Black -tomove) and King and Pawn vs. King and Rook (White -to -move, White Pawn on
a7) as trial problems of measurable complexity.Structured solutions to both trial problems are presented, and implications of this work for the design of expert systems languages are assessed