Keyword Culling: A Best Practice Approach
By Bobby Malhotra, Esq., Senior Project Manager, Professional Services Group, Encore Legal Solutions
INTRODUCTION
In the days preceding electronic discovery, a large corporation may have had to produce several thousand documents during the course of a typical litigation. Today, that same organization may have to produce several million documents because relevant data may exist in many forms, locations, and on various media and devices. This may include data from email messages, databases, home computers, PDA’s, corporate voice recordings, and instant messages. As the volume of electronic discoverable information expands, the need to cullor select out only the useful information fromdata sets becomes a critical component to a successful electronic document review project. However, an accurate and defensible culling process is not simple. The best practice is to leave data culling to the experts who can collect and filter data in a forensically sound manner utilizing tools that have been vetted and tested in the litigation industry in order to significantly decrease the possibility of altering or omitting relevant electronic data.
WHAT IS DATA CULLING?
Data culling is a method of searching and processing a pool of documents for specific information useful to your case. Documents containing information you are looking for are extracted from the total pool of data. A good data culling tool that is properly applied will give you the capability to search the text in the body of a document, the email recipient fields (i.e. To, From, CC, and BCC fields), and dates or date ranges in the properties, or “metadata,” of a document.
Data culling is vitally important because it can weed out hundreds or thousands of documents that are not relevant to your case, saving time, money, and effort later on in litigation. On the other hand, if data culling tool is not properly applied it can alter the metadata of your documents, give you an under-inclusive pool of documents to review, and can lead to harsh sanctions or other difficulties with judges or adversaries.
NULL AND NON-INDEXED DOCUMENTS
Most data culling tools create an index of the words within a document. This indexing process allows the data set to be searchable. However, there are files that are not capable of being indexed because the filtering tool cannot extract and index the words within the document. These types of files are commonly referred to as “non-indexed documents.” In addition, there are also documents that do not contain text. These types of documents, commonly referred to as “null-indexed documents”, contain only images or are blank. JPG’s, TIFF’s, and GIF’s are common examples of null-indexed documents because they do not contain text. Furthermore, PDFs, PowerPoints, and Word documents, among others, can also be null-indexed documents, if they contain only images.
It is important to point out that there is nothing inherently wrong with Null or Non-Indexed documents and they are quite common. However, because the indexing process could not extract text from these files, they will likely not return as hits when you attempt to search the body text using onsite culling tools, even if are genuinely relevant. For instance, many indexing tools will miss image files such as JPG’s or GIF’s or non-text embedded TIFF's or PDF's because there is no text in the file to search. Information in these types of files, such as schematics or photographs, could be a major source of relevant information that will be overlooked by using an onsite culling tool. Moreover, many business Internet faxing programs, such as the Biscom fax server or Windows, convert faxes to JPG or other image formats. Thus, these documents are also non-text embedded and could easily contain relevant information that can be missed by many onsite culling tools.
SPOLIATION AND METADATA
Often times it is not clear what common onsite culling tools actually “do” to data when they are run. Some tools may physically access files and some tools may change original files. Running a search tool that has not been properly tested, runs the risk that the application could alter the data without even opening a file. For example, it is quite common for the Google Desktop application to create and alter various files while in operation.
Experts collect data without altering metadata by writing out scripts to collect thousands of documents in a forensically sound manner.
TYPES OF DATA
Many onsite indexing programs work on some but not all types of data. For example, Lotus Notes NSF files contain multiple layers of encryption in their email storage databases. As a result, an encrypted Lotus Notes NSF file will likely only be searchable from within the Lotus Notes application. Also, network stored data will not be considered by onsite culling tools if they are only used at the desktop level.
VALIDATION
Expert testimony is recognized as a sound method for a court to determine the admissibility of digital evidence. When an expert is not utilized, there is no neutral third party validating the data collection and search method. Even if a litigant collected and searched data correctly, opposing counsel is likely to allege that documents were left out of the production because they know that a neutral third party is not available to testify on the validity and accuracy of the collection and search method. Consequently, this also increases the chances that opposing counsel will eventually seek discovery sanctions for the failure to properly search for and produce relevant information.
CONCLUSION
Because tools such as Google Desktop are becoming so commonplace, our clients often inquire if their internal IT staff can conduct an onsite keyword and culling themselves in order to save money on collection, processing, and review.
As a best practice, Encore recommends retaining the services of an expert for your data collection and searching needs. Specifically, you should seek the assistance of a reputable vendor or consulting firm, who can collect and search data in a forensically sound manner. Once the data is collected, the vendor or consultants should have the capability of filtering that large pool of data to a smaller, more specific results pool based on the search criteria (i.e. keywords) that you provide.
Retaining the services of an expert will ensure data integrity and defensibility and minimize evidentiary issues, such as foundational objections, pertaining to the handling of electronic data. If it is not practical to retain an expert, then a simple date search is often the most effective method to cull data. A date restricted collection will still allow for collection of non-text embedded Tiffs and PDF's and will ensure a more complete collection than most common onsite culling tools.
Bobby Malhotra, Esq. is National Project Manager for Encore Legal Solutions, Professional Services Group. In this role, he manages large-scale electronic discovery, paper discovery and web hosting projects. During his career Mr. Malhotra has served both as an information technologistboth in and out of the legal industryand a litigation attorney. He has been a frequent speaker on electronic discovery and records management and can be reached at bmalhotra@encorelegal.com.
|