28 research outputs found
Research Enterprise Office Search Portal
All the employees in University Technology Petronas need to access information
instantaneously in order to enhance their functionality and efficacy. Is it easy to
collaborate and gather the right information at the right time? Is all the research within a
company documented? Is it easily available to all employees? And what happens when
an employee leaves the company?
This project is an analysis of current practices and outcomes of the search portal and the
nature of it as they are evolving in most of the organizations. The findings suggest that
interest in search engines across a variety of industries is very high, the technological
foundations are varied, and the major concerns revolve around achieving the correct
amount and type of accurate research and garnering support for contributing to the search
portal. Implications for practice and suggestions for future research are drawn from the
study findings.
This project focused on the search function. The research is on how to make this search
portal useful to the University Technology Petronas (UTP) community that is the UTP
staff and lecturers. These search portal solutions are ideal for operations and maintenance
manuals that once were reserved for 3-inch thick binders sitting on the shelves of many
treatment plants. Moving the manual standard procedures, troubleshooting, theory,
alarms, and equipment descriptions to an electronic, web-based solution offers many
benefits. For one, the information can be updated and kept current much more effectively
because it can be changed in one place and instantly updated at all access points. By
developing this search portal, the staff and lecturers will be able to get information fast
and efficiently
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms
for processing a text database: either we can scan, or "crawl," the text database or, alternatively,
we can exploit search engine indexes and retrieve the documents of interest via carefully crafted
queries constructed in task-specific ways. The choice between crawl- and query-based execution
plans can have a substantial impact on both execution time and output "completeness" (e.g.,
in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain
intuition. In this article, we present fundamental building blocks to make the choice of execution
plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to
analyze query- and crawl-based plans in terms of both execution time and output completeness.
We adapt results from random-graph theory and statistics to develop a rigorous cost model for
the execution plans. Our cost model reflects the fact that the performance of the plans depends
on fundamental task-specific properties of the underlying text databases. We identify these
properties and present efficient techniques for estimating the associated parameters of the cost
model. We also present two optimization approaches for text-centric tasks that rely on the cost-model
parameters and select efficient execution plans. Overall, our optimization approaches
help build efficient execution plans for a task, resulting in significant efficiency and output
completeness benefits. We complement our results with a large-scale experimental evaluation
for three important text-centric tasks and over multiple real-life data sets.Information Systems Working Papers Serie
Information Inflation: Can The Legal System Adapt?
Information is fundamental to the legal system. Accordingly, lawyers must understand that information, as a cultural and technological edifice, has profoundly and irrevocably changed. There has been a civilization- wide morph, or pulse, or one might say that information has evolved. This article discusses the new inflationary dynamic, which has caused written information to multiply by as much as ten thousand-fold recently. The resulting landscape has stressed the legal system and indeed, it is becoming prohibitively expensive for lawyers even to search through information. This is particularly true in litigation
Observable Behavior for Implicit User Modeling: A Framework and User Studies
This paper presents a framework for observable behavior that can be used as a basis for user modeling, and it reports the results of a pair of user studies that examine the joint utility of two specific behaviors. User models can be constructed by hand, or they can be learned automatically based on feedback provided by the user about the relevance of documents that they have examined. By observing user behavior, it is possible to obtain implicit feedback without requiring explicit relevance judgments. Four broad categories of potentially observable behavior are identified: examine, retain, reference, and annotate, and examples of specific behaviors within a category are further subdivided based on the natural scope of information objects being manipulated: segment, object, or class. Previous studies using Internet discussion groups (USENET news) have shown reading time to be a useful source of implicit feedback for predicting a user’s preferences. The experiments reported in this paper extend that work to academic and professional journal articles and abstracts, and explore the relationship between printing behavior and reading time. Two user studies were conducted in which undergraduate students examined articles or abstracts from the telecommunications or pharmaceutical literature. The results showed that reading time can be used to predict the user’s assessment of relevance, that the mean reading time for journal articles and technical abstracts is longer than has been reported for USENET news documents, and that printing events provide additional useful evidence about relevance beyond that which can be inferred from reading time. The paper concludes with a brief discussion of the implications of the reported results
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the Web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for
processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can
exploit search engine indexes and retrieve the documents of interest via carefully crafted queries
constructed in task-specific ways. The choice between crawl- and query-based execution plans can
have a substantial impact on both execution time and output “completeness” (e.g., in terms of
recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition.
In this article, we present fundamental building blocks to make the choice of execution plans
for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze
query- and crawl-based plans in terms of both execution time and output completeness. We adapt
results from random-graph theory and statistics to develop a rigorous cost model for the execution
plans. Our cost model reflects the fact that the performance of the plans depends on fundamental
task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model.We also present two
optimization approaches for text-centric tasks that rely on the cost-model parameters and select
efficient execution plans. Overall, our optimization approaches help build efficient execution plans
for a task, resulting in significant efficiency and output completeness benefits. We complement our
results with a large-scale experimental evaluation for three important text-centric tasks and over
multiple real-life data sets.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Personalized classification for keyword-based category profiles
Personalized classification refers to allowing users to define their own categories and automating the assignment of documents to these categories. In this paper, we examine the use of keywords to define personalized categories and propose the use of Support Vector Machine (SVM) to perform personalized classification. Two scenarios have been investigated. The first assumes that the personalized categories are defined in a flat category space. The second assumes that each personalized category is defined within a pre-defined general category that provides a more specific context for the personalized category. The training documents for personalized categories are obtained from a training document pool using a search engine and a set of keywords. Our experiments have delivered better classification results using the second scenario. We also conclude that the number of keywords used can be very small and increasing them does not always lead to better classification performance