Your single source for new lessons on legal technology, e-discovery, and the people innovating behind the scenes.

Introducing the EDRM Message ID Hash: Simplify Cross-Platform Email Duplicate Identification

Beth Patterson – ESPconnect

Have you ever received a supplementary production of email and needed to know what’s new and what’s been produced before? How about finding duplicate email messages across productions from different vendors or in different forms of production?

Identifying such “cross-platform” duplicates traditionally entailed reprocessing native email from multiple sources—assuming native forms were produced. It’s expensive and time consuming. Often, it just isn’t feasible.

Did you ever think, there must be a better way? We did. And that’s how the EDRM Message ID Hash (EDRM MIH) was born.

The Challenge: Unravelling the Cross-Platform Puzzle

Digitally fingerprinting or “hashing” email messages to identify duplicates requires that precisely the same parts of email messages be processed in precisely the same way to obtain matching “hash values.” However, no electronic discovery tool approaches the task quite like another, so duplicate messages between productions couldn’t be identified when the productions were processed by different tools or produced in different forms.

From opposite sides of the planet, these authors—Craig Ball and Beth Patterson—reasoned there must be a way to achieve cross-platform duplicate identification without obliging service providers to change their workflows. That idea brought us together at ILTA’s 2016 conference (see Craig’s 2016 blog post on the subject!) and served as the impetus to found the EDRM Duplicate Identification Project in March 2021.

What better chance to tackle the task than convening e-discovery thought leaders from around the globe (via Zoom) during a worldwide pandemic lockdown?

Backed by the EDRM, the independent e-discovery thought leadership organization, Beth led the project and corralled a brilliant and diverse Dupe ID Project Team of experts hailing from Australia, Finland, Japan, Israel, the UK, and the US: 30 volunteers and a mix of lawyers, technologists, and business leaders, with representation from law firms, corporates, product vendors, and service providers. Greg Houston, a senior architect in workflow enablement at Relativity, represented Relativity on the project team. Moved by a desire to benefit the entire industry, competitors collaborated to supply a solution. Other product vendors on the team included EDT, Nuix, and Reveal.

Initially, we sought to solve the cross-platform duplicate identification problem by writing a complex specification for processing email messages that required every tool and vendor to switch from “their” way to “our” way. We quickly saw the folly of that approach and turned instead to a simple, novel method that wouldn’t require anyone to replace existing deduplication methodologies—yet would prove astonishingly effective at cross-platform identification of duplicate email messages.

When Ian Folkman, a digital forensic investigator at Ericsson in Japan, joined the team almost a year into the project, he was initially unsure about what we were trying to achieve. But after spending time studying our proposed solution, he said: "I've been hashing and processing data for more than a decade; I thought deduplication was a solved problem. I was skeptical of creating another duplicate identification standard, fearing it would go unused. After joining the committee and learning about their goals, vision, and solution, I became convinced of the value it would add when dealing with data across tools, regions, and databases."

The Solution: The EDRM MIH

After two years of collaboration, which included setting out our objectives, brainstorming ideas, and then rigorously drafting and testing our proposal, the project team released the EDRM Message Identification Hash Specification 1.0 in February 2023 after a 30-day market consultation period.

As noted, electronic discovery service providers and software tools use algorithms to calculate hash values of segments of email messages, comparing those hash values to flag duplicates. Hash deduplication works well, but stumbles when minor variations prompt inconsistent outcomes for messages reviewers regard as being “the same.” Hash deduplication fails altogether when messages are exchanged in forms other than those native to email communications—a common practice in US electronic discovery, where efficient electronic forms are often printed to static page images.

The EDRM Message Identification Hash (MIH) rejects the complexity of message duplicate identification without sacrificing effectiveness. It does so by taking advantage of an underutilized feature of email communication standards called the “Message ID” and pairing it with the power of hash deduplication. If it sounds simple, it is—and by design. It’s far less complex than traditional approaches but sacrifices little or no effectiveness or utility. Crucially, it doesn’t require any difficult or expensive departure from the way parties engage in discovery and production of email messages. By hashing a single, consistent snippet of data unique to e-mail messages, the EDRM MIH eliminates the need to conform to rigid—and ultimately error-prone—collection, normalization, and concatenation of message segments.

When parties produce emails with metadata in accompanying load files, they customarily include fields like Bates numbers/Document IDs, message dates, sender, recipients, and the like. Ideally, the composition of load files is specified in a well-crafted request for production or production protocol. For service providers, producing one more field of metadata is a trivial change, rarely requiring more effort than simply ticking a box.

Gaining the benefit of the EDRM Email Duplicate Identification Specification is as simple as requesting that load files contain an EDRM MIH for each email message produced. The EDRM Email Duplicate Identification Specification is an open specification, so no fees or permissions are required to use it, and leading e-discovery service and software providers like Relativity support the new specification. For others, it’s simple to generate the MIH without redesigning software or impeding workflows. In fact, the EDRM has made free tools available to support the specification.

Any party with the MIH of an email message can readily determine if a copy of the message exists in their collection. Armed with MIH values for emails, parties can flag duplicates even when those duplicates take different forms, enabling native message formats to be compared to productions supplied as TIFF or PDF images.

By requesting the EDRM MIH, parties receiving rolling or supplemental productions will know if they’ve received a message before, allowing reviewers to dedicate resources only to new and unique evidence. Email messages produced by different parties in different forms using different service providers can be compared to instantly surface or suppress duplicates. Cross-platform email duplicate identification means that email productions can be compared across matters, too. Parties receiving production can easily tell if the same message was or was not produced in other cases. Cross-platform support also permits a cross-border ability to assess whether a message is a duplicate without the need to share personally identifiable information restricted from dissemination by privacy laws.

There are isolated instances where the EDRM MIH aren’t effective (such as for deduplication of draft messages lacking Message IDs); the Specification and Guidelines detail these considerations, so make sure you review those before getting started with this protocol.

How to Get Started Using the EDRM MIH in Relativity

All litigants need do to begin reaping the benefits of cross-platform message duplicate identification is amend their Requests for Production to include the EDRM Message Identification Hash (MIH) among the metadata values routinely produced in load files. As a prominently published specification by the leading standards organization in e-discovery, it’s likely the producing party’s service provider or litigation support staff know what’s required. But if not, refer them to the EDRM Email Duplicate Identification Specification & Guidelines published here.

Law in Order, a RelativityOne Gold Partner, has developed a new EDRM MIH Relativity application, which complies with the EDRM MIH Specification and generates EDRM MIH values. Visit the Relativity Community site to find out how to use the app as well as how to use the EDRM MIH in Relativity workflows.

Murali Baddula, chief digital officer at Law in Order, developed the Relativity application. Additionally, as a dedicated member of the Dupe ID Project Team, Murali was instrumental in the development and testing of the EDRM MIH solution. Murali offers valuable guidance for users of the protocol, suggesting: "A crucial initial action to consider is incorporating the EDRM MIH as a metadata field within your load files when submitting Requests for Production. Additionally, I advise my clients to make EDRM MIH a mandatory component in their Exchange Protocols when engaging with external parties or regulatory bodies. Embracing these two minor adjustments today can yield substantial advantages in the long run."

Also, to learn more about the EDRM MIH generally, see the EDRM website for comprehensive resources in our EDRM Email Duplicate Identification Toolkit. Developed by the Dupe ID Project Team for a range of stakeholders including parties, vendors, regulators, and courts, the toolkit is comprised of three parts:

1. How can you generate EDRM MIH values?

This section features logos and links to each of the products that have implemented the EDRM MIH, and will include a link to the Relativity/Law in Order app and resources. Also included in this section is the Small Dataset MIH Calculator, which is an Excel-based tool to generate EDRM MIH values for small sets of Message IDs.

2. What is the EDRM MIH and how do you use it?

  • Cross Platform Email Duplicate Identification, a document consisting of:
    • Project Overview
    • List of Contributors
    • EDRM Message Identification Hash (EDRM MIH) Specification (v1.0), which is a succinct, technical specification with advisory notes written for the target audience of vendors, like Relativity, who are implementing the EDRM MIH in their platform.
    • EDRM Email Duplicate Identification Guidelines (v1.0), which is the most important document to read to understand why and how to use the EDRM MIH before getting started! It is a non-technical reference that outlines the objectives, methodology, potential use cases, advantages, and usage considerations.
  • Introducing the EDRM Email Duplicate Identification Specification and Message ID Hash (EDRM MIH) White Paper by Craig Ball is a non-technical introduction to the EDRM MIH and is a must-read for any lawyer who wants to quickly understand the benefits of this protocol.
  • EDRM Email Duplicate Identification Infographic is a simple one-page explanation and a great visual tool for users who want a quick overview of the solution, including benefits and use cases.

3. If you are implementing the EDRM MIH in a product, how can you test and verify it?

  • Final EDRM MIH Example Data is a test data set for testing and verifying EDRM MIH implementations.
  • Final EDRM MIH Sample Email Index.xlsx is an Excel spreadsheet to be used for verification of Message ID extraction and MIH calculations for all emails in the example data set.

Feedback Welcome! 

Relativity has been a valued supporter throughout this two-year journey. Simple yet powerful, the EDRM MIH solution promises to save time, money, and effort while enhancing the consistency and accuracy of email duplicate identification.

The EDRM MIH can be used for various purposes, including grouping email duplicates to enable a more efficient review process, or deduplicating a cross-platform email data set. We are excited that Relativity users will now have a way to utilize this new method of cross-platform email duplicate identification within the Relativity ecosystem.

The EDRM welcomes any feedback you may have on this new protocol or on any of the resources provided. We are interested in further ideas you may have and expect the use of the EDRM MIH to evolve over time. You can post any feedback or questions here.

Graphics for this article were created by Kael Rose.

AI for PI

Beth Patterson is director of ESPconnect, a leading legal technology and innovation consulting business, and is an adjunct professor in the University of Technology Sydney Faculty of Law. She is a recognised legal technology leader with extensive consulting experience developing legal tech strategies, underpinned by her unique insight into building multidisciplinary teams to create a collaborative culture between lawyers and technologists. She assists clients in addressing the challenges of digital disruption, with a focus on strategy, product selection & implementation, education, and developing multidisciplinary teams. She also consults widely across the legal ecosystem on the strategic implications of artificial intelligence, especially in this new era of generative AI.