7 eDiscovery Processing Tips

This post was originally published by H5, a Blue-level Relativity Best in Service Partner. We thought it provided great tips and best practices for processing. It's the first in a series—check out the H5 blog for more e-discovery workflow tips.

The collection is done and you finally have the data. Now what? What’s the best way to tackle processing without breaking a sweat, the bank, or your timeline?

Alongside early case assessment (ECA), this phase of e-discovery presents an opportunity to undertake a defensible reduction of data and pass along only the material that will ultimately be necessary to review. Here are a few tips that may help.

Tip 1: Don’t assume that because you collected it, you have to process it.

Consider the data that has been collected. Are you dealing with a select set of individual, cherry-picked files? If so, you’re in luck—kick off the processing set and go have lunch!

More likely, you’re facing hard drive images and data that has come from a variety of sources in myriad formats. In most matters, it’s only the user-created content that is of interest.

The size of a hard drive collection can cause a mini-heart attack, but user-created content is likely only a fraction of it. Processing everything rarely makes sense when there are so many files that probably don’t matter.

Consider what you want to target (or omit). Can you focus on particular areas on the drive? Targeting the User folder may make sense. If you’re concerned you may miss something that way, try excluding files that are clearly not user-generated (e.g., program files, Windows folders, and other system-related data) and process the rest. More on this below.

Tip 2: Be prepared to recalibrate. Facts about the case may have changed when you weren’t looking.

Litigation is an evolving, iterative process, and things change that could impact what needs to be processed. For example, consider the collection timeline. Did collection occur early and broadly due to time constraints? Often, a wider-than-necessary net is cast to include custodians and data sources on the fringe, “just in case.” Has the focus changed? Have some custodians been disqualified? Has the timeline been revised? Even minor changes can result in significant data reduction.

Tip 3: Prioritize the data.

Prioritize what you know to be the most important from a document-count perspective. Identifying high-priority custodians and targeting email content will help yield the most documents in the shortest time and get reviewers started while you supplement with other data sources and custodians.

Tip 4: Reduce, reduce, reduce.

Each step you take to limit the size of the data set saves time and chips away at exponential costs. This is the perfect time to implement an ECA workflow to cull your data down even further. Here are some processing techniques to employ:

DeNIST: As a first pass, your best friend in reducing the volume of a collected hard drive image is the use of a DeNIST hash list to identify common system files.
Filter by file type: Determine what you may be able to eliminate, given your goals. You may not have known about hidden pockets of system or program files, but they’ll be obvious once you receive a report showing volumes and portion for different file types.
Filter by date: If you can cull by date, do it. Understand the date range that comprises the data population so you can cut data that could not possibly be in scope.
Keyword cull:Maybe you don’t know enough to keyword-target for inclusion, but what about getting rid of what’s clearly junk? Consider eliminating mass company email blasts sent to particular distribution lists or generic notifications that contain text such as “Do not reply.”
De-duplication and email threading: If you’re working on an internal investigation without pending production obligations, what about cutting out other redundant content? Consider email threading and funneling to only non-duplicative email families. While results can vary, we’ve seen reductions as high as 80 percent.

Tip 5: Avoid blind OCR—not all images are created equal.

OCR can eat up time and money, so make sure every page counts. Does OCRing every image type make sense? How often does OCR of a .png or .jpg result in usable text? Think about where your data is coming from and how aggressive you need to be.

Consider targeting image types more likely to yield text of reasonable quality. TIFFs and non-searchable PDFs are a great place to start. You can always run additional OCR later if you find that unexpected image types might benefit.

Tip 6: Understand what you’re leaving behind.

While it can be helpful to get rid of as much noise as possible, you need to be confident about what you’re leaving on the cutting room floor:

Random sampling and review of excluded sets is an effective way to understand what data is being left behind and make adjustments, if necessary. The proper sample size and process is important (your vendor can help), so know what you need to look at for your data set and engagement goals.
Data landscape reporting is a quick way to understand the file type makeup of your data set. If you’re targeting user content and a bunch of Word documents show up in what you’re leaving behind, that may suggest a problem.
Don’t have time for in-depth review? Take a look at a file name list organized by source folder to provide context. You might be able to home in on files of interest with a fairly quick scan.
When utilizing key terms, consider treating processing exceptions (e.g., protected or corrupt files) and non-text files (e.g., audio/visual files or proprietary software files) as pass-throughs for reviewers, just in case they may be considered responsive.
Make sure you (or your vendor) keep the processing content live and active so you can return to the well quickly if scope or needs change.

Tip 7: Leverage expertise.

The world of data is rife with complexities. Don’t underestimate the knowledge and ability of technical and statistical experts to provide guidance. Tools and methods to analyze, parse, sample, and otherwise manage data are constantly being developed and enhanced. Working in close partnership with your vendor at each stage to consider the inputs, goals, and time and cost constraints for each particular matter can lead to a much less stressful, more efficient, and likely more defensible engagement.