Seeing Triple: How 3 Culprits Complicate De-duplication

This article was first published on Acorn’s blog.

Over the past years, we’ve heard increasing complaints from clients of a lot of duplicates in their hosting and review databases. When we look into the matter further, these “duplicates,” although substantively the same content, are not exact copies of each other from an evidentiary perspective. For example, migrating email to an archiving system might cause slight changes to the formatting of the header that results in the emails not being viewed as duplicates by the processing platform.

We are seeing an increasing need for defensible, technology-based de-duplication beyond just MD5 hashing. The industry standard is to use MD5 hash to identify any duplicates of a document that has been processed and suppressing them, saving clients the time, money, and headache of reviewing duplicative data.

As a first step, clients use solutions provided by RelativityOne such as the Processing Duplication Workflow application. Within this application, users can identify primary and duplicate documents, all custodians, and all source files for documents. It also provides capabilities to identify unique, primary and duplicate files based on a relational field.

This worked fairly well for a long time, but as email and archive systems have continued to evolve, the MD5 hash is no longer as effective at identifying duplicates as it once was. I want to lay out some of the business practices that are causing challenges with the MD5 hashing strategy for de-duplication and propose that we start using Message-ID in addition to metadata as a supplement to de-duplication.

Culprit #1: URL Rewriting

We’ve all seen the typical phishing email. It is usually formatted in a non-standard way, or requesting an invoice or documents we’ve never heard of, or from a person we know wouldn’t be sending or requesting this information. Those are the obvious ones. Every year we see them, all slightly different but always evolving to appear authentic. Phishing scams typically go about disguising hyperlinks to appear authentic on the surface but will send users to a malicious site. One way to bypass this bait and switch is to hover over the hyperlink, and see if you are being directed to the location you should be. One example I have seen is an email or text from my bank that is asking me if I authorized a charge and includes a link to log in and approve that charge, however when I look at the link it wants me to click it has nothing to do with my bank.

A security feature you may not even be aware of that is working hard in the background is URL rewriting. IT security professionals created this as a mechanism to catch these attacks before users can click on the malicious link. When an email is received, URLs are rewritten in the body of the email. When a user clicks the link, it passes through a service where it is analyzed for security before sending the user to the original link location. If the link is determined to be bad, the site is blocked. While this is an excellent technology to thwart malicious attacks, URL rewriting can create unique hyperlinks in each email copy—making them unique when hashing.

Culprit #2: External Email Warnings

Another feature used to protect against phishing emails is the use of external email warnings.

Attackers will sometimes emulate an email to look like it is coming from a trusted associate or organization. By putting a warning at the top of the email that it was received by an external source, readers can be aware that the email did not originate from inside the organization before clicking any links. This external warning message will also make an email unique when hashing.

The MD5 hash takes into account the body and header of an email during processing, so the system is going to think that this email is now a unique document even if the only difference between that email and the sender’s email is “External” at the top of the email. Not to say you shouldn’t implement these policies, but many companies have instituted this policy wholesale and may not have thought about the discovery implications on cost to review and size to host, nevermind the frustration of having to review multiple copies of the same document.

Culprit #3: Emails Rebranded

In addition to phishing, email marketing strategies are another culprit of email duplicates. A prominent marketing innovation companies have started using is centralized signature management. This allows a company to customize signatures for marketing campaigns by adding or omitting information, promotions, content, et cetera, from a centralized third-party platform. These signatures are not integrated directly into the users email account but are applied when the email is sent.

The sender’s copy doesn’t have the signature, but a recipient’s email copy does, even within the same organization. This creates unique copies of an email just due to the signature block being different between the two versions, even though ostensibly this is the exact same document.

Consider how this might affect your electronic discovery obligations and timelines if you must review multiple copies of the same document. It can quickly become an exponentially growing headache for companies.

What Can We Do About It?

Acorn has been working behind the scenes to support our clients in the never-ending duplication struggle. We have developed custom proprietary tools to analyze email data and identify duplicates within RelativityOne using a combination of the Message-ID and additional metadata fields.

Message-ID is a property added to outgoing emails by the sending mail system. It is retained in the recipient’s copy of the email as well. It isn’t always unique and it isn’t always populated, which is why it shouldn’t be the only method of de-duplication. This property, in association with other metadata, can identify duplicative emails. In one project alone, we were able to reduce the review population by 15 percent. This translated into roughly a savings of $40,000 by reducing the number of documents to review.

I’m not here to argue the efficacy of using the Message-ID and metadata for de-duplication purposes; I will leave that to the attorneys. What I can verify is the increasing number of duplicates in email collections that need to be managed.

About Acorn

Acorn provides high-touch, customized litigation support services that specialize in AI and advanced analytics for litigation applications, while providing rigorous customer service to the e-discovery industry. Want a free consultation of your case? Email info@acornls.com or visit acornls.com.

The banner graphic for this post was created by Sarah Vachlon.

Understanding Collection Woes: New Research by Ari Kaplan

Understanding Collection Woes: Research by Ari Kaplan

The way corporations operate is changing rapidly—leveraging new data types and channels, accommodating a dispersed workforce, and adding data volumes. These activities pave a path for innovation, but also present significant challenges for collections by in-house teams when that data becomes subject to litigation or an investigation.

LEARN MORE

From project management, IT, and administration, Tracey Oldenburg wears every hat that there is. From her 30 years litigation experience in the industry, along with being a Relativity Certified Master and Relativity Certified Administrator, Tracey’s expertise expands through project management, IT, and administration. She has designed custom templates, workflows, quality control protocols, and proprietary project management & reporting tools used in over 600 active litigation matters. She blends a pragmatic, business-oriented approach to e-discovery process management with a deep understanding of advanced technology offerings and implementation considerations.