|
|
- Hashes and how they apply to E-Discovery
I am going to provide
a simple explanation of what hash values are and how they are used for de-duplication
in discovery of electronic documents. Hash values are also used in computer
forensics for verification purposes, but we will speak to that at another time.
A hash is a complex mathematical algorithm that when run against an electronic file
will generate a short alphanumeric sequence calculated from the values of the given
information. The algorithm ensures that if a character or a bit of information
is removed, added or changed within the same electronic file, it will provide a
completely different hash value. The odds of two different electronic files
having the same MD5 hash value is a staggering 1 in 340,282,366,920,938,463,463,374,607,431,768,211,000
chance. Other algorithms such as the SHA-1 produce even more reliable hash
strings, which are definitely more accurate than DNA or fingerprints and both of
these are readily accepted by the courts.
So how are hash values used in e-discovery?
We'll break e-discovery up into two sections, the first dealing with electronic
documents and the second with e-mail repositories. As a result of email, electronic
documents have become extremely easy to distribute, and in large volumes.
When collecting electronic documents from custodians and backups for discovery purposes,
it is very likely that you will get many duplicates. Comparing the hash values
of these documents provides for a quick method of de-duplicating these files.
What about emails? Well emails
are a completely different animal. If you tried to de-duplicate complete emails
based on their hash value you would have little luck. Here is a quick explanation.
If an email is sent to multiple parties they should all be duplicates of each other
but their hash values would not match. The reason for this is that the receiving
email server for each recipient adds new information to the email as well as the
time received, which would most likely be different for each recipient. Accordingly,
when hashed, a different hash value will be returned for each email.
So how can you use hash values with
emails to assist in email de-duplication? Most email messages contain a unique
identifier called the "Message ID" which is a Globally Unique Identifier (GUID).
Although comparing GUID values is an accepted method of de-duplicating emails, it
is not mandatory that all emails have this value and, therefore, not all email servers
assign Message IDs to their emails. As a result, you can end up with large
amounts of emails to dedupe, without a Message ID. This is where using hash
values become useful.
A hash algorithm can be run against
any combination of electronic values. In the case de-duplicating emails, you
can generate a hash string of certain portions of the email or a combination of
email parts. The most common would include the Author, the Recipients, the
Subject Line, the Content, and the Date and Time Sent. Creating a hash string
of these combined values provides for a far more accurate method of de-duplicating
emails because the Sender's and the Recipients' emails would have all the same values
matching. Similar techniques are used for identifying near duplicates.
I trust this assists in understanding
how hash values are used in e-discovery. If you would like any more information,
please do not hesitate to contact us.
Girts Jansons
Litigation Support Technical Specialist
JLS inc.
girts@jls.ca
Providing Discover-E Services since 1995
Cell: 705-715-6808
Phn: 800-979-9139
If you would like to recieve periodic emails from us, please click here and type "Include" in the
subject line.
|