44 research outputs found
Empirical Study of Deep Learning for Text Classification in Legal Document Review
Predictive coding has been widely used in legal matters to find relevant or
privileged documents in large sets of electronically stored information. It
saves the time and cost significantly. Logistic Regression (LR) and Support
Vector Machines (SVM) are two popular machine learning algorithms used in
predictive coding. Recently, deep learning received a lot of attentions in many
industries. This paper reports our preliminary studies in using deep learning
in legal document review. Specifically, we conducted experiments to compare
deep learning results with results obtained using a SVM algorithm on the four
datasets of real legal matters. Our results showed that CNN performed better
with larger volume of training dataset and should be a fit method in the text
classification in legal industry.Comment: 2018 IEEE International Conference on Big Data (Big Data
Explainable Text Classification in Legal Document Review A Case Study of Explainable Predictive Coding
In today's legal environment, lawsuits and regulatory investigations require
companies to embark upon increasingly intensive data-focused engagements to
identify, collect and analyze large quantities of data. When documents are
staged for review the process can require companies to dedicate an
extraordinary level of resources, both with respect to human resources, but
also with respect to the use of technology-based techniques to intelligently
sift through data. For several years, attorneys have been using a variety of
tools to conduct this exercise, and most recently, they are accepting the use
of machine learning techniques like text classification to efficiently cull
massive volumes of data to identify responsive documents for use in these
matters. In recent years, a group of AI and Machine Learning researchers have
been actively researching Explainable AI. In an explainable AI system, actions
or decisions are human understandable. In typical legal `document review'
scenarios, a document can be identified as responsive, as long as one or more
of the text snippets in a document are deemed responsive. In these scenarios,
if predictive coding can be used to locate these responsive snippets, then
attorneys could easily evaluate the model's document classification decision.
When deployed with defined and explainable results, predictive coding can
drastically enhance the overall quality and speed of the document review
process by reducing the time it takes to review documents. The authors of this
paper propose the concept of explainable predictive coding and simple
explainable predictive coding methods to locate responsive snippets within
responsive documents. We also report our preliminary experimental results using
the data from an actual legal matter that entailed this type of document
review.Comment: 2018 IEEE International Conference on Big Dat
An Empirical Study of the Application of Machine Learning and Keyword Terms Methodologies to Privilege-Document Review Projects in Legal Matters
Protecting privileged communications and data from disclosure is paramount
for legal teams. Unrestricted legal advice, such as attorney-client
communications or litigation strategy. are vital to the legal process and are
exempt from disclosure in litigations or regulatory events. To protect this
information from being disclosed, companies and outside counsel must review
vast amounts of documents to determine those that contain privileged material.
This process is extremely costly and time consuming. As data volumes increase,
legal counsel employ methods to reduce the number of documents requiring review
while balancing the need to ensure the protection of privileged information.
Keyword searching is relied upon as a method to target privileged information
and reduce document review populations. Keyword searches are effective at
casting a wide net but return over inclusive results -- most of which do not
contain privileged information -- and without detailed knowledge of the data,
keyword lists cannot be crafted to find all privilege material.
Overly-inclusive keyword searching can also be problematic, because even while
it drives up costs, it also can cast `too far of a net' and thus produce
unreliable results.To overcome these weaknesses of keyword searching, legal
teams are using a new method to target privileged information called predictive
modeling. Predictive modeling can successfully identify privileged material but
little research has been published to confirm its effectiveness when compared
to keyword searching. This paper summarizes a study of the effectiveness of
keyword searching and predictive modeling when applied to real-world data. With
this study, this group of collaborators wanted to examine and understand the
benefits and weaknesses of both approaches to legal teams with identifying
privilege material in document populations.Comment: 2018 IEEE International Conference on Big Data (Big Data
The Role of Document Structure and Citation Analysis in Literature Information Retrieval
Literature Information Retrieval (IR) is the task of searching relevant publications given a particular information need expressed as a set of queries. With the staggering growth of scientific literature, it is critical to design effective retrieval solutions to facilitate efficient access to them. We hypothesize that particular genre specific characteristics of scientific literature such as metadata and citations are potentially helpful for enhancing scientific literature search. We conducted systematic and extensive IR experiments on open information retrieval test collections to investigate their roles in enhancing literature information retrieval effectiveness. This thesis consists of three major parts of studies. First, we examined the role of document structure in literature search through comprehensive studies on the retrieval effectiveness of a set of structure-aware retrieval models on ad hoc scientific literature search tasks. Second, under the language modeling retrieval framework, we studied exploiting citation and co-citation analysis results as sources of evidence for enhancing literature search. Specifically, we examined relevant document distribution patterns over partitioned clusters of document citation and co-citation graphs; we examined seven ways of modeling document prior probabilities of being relevant based on document citation and co-citation analysis; we studied the effectiveness of boosting retrieved documents with scores of their neighborhood documents in terms co-citation counts, co-citation similarities and Howard White's pennant scores. Third, we combined both structured retrieval features and citation related features in developing machine learned retrieval models for literatures search and assessed the effectiveness of learning to rank algorithms and various literature-specific features. Our major findings are as follows. State-of-the-art structure-ware retrieval models though reportedly perform well in known item finding tasks do not significantly outperform non-fielded baseline retrieval models in ad hoc literature information retrieval. Though relevant document distributions over citation and co-citation network graph partitions reveal favorable pattern, citation and co-citation analysis results on the current iSearch test collection only modestly improve retrieval effectiveness. However, priors derived from co-citation analysis outperform that derived from citation analysis, and pennant score for document expansion outperforms raw co-citation count or cosine similarity of co-citation counts. Our learning to rank experiments show that in a heterogeneous collection setting, citation related features can significantly outperform baselines.Ph.D., Information Studies -- Drexel University, 201