Forensic Hashing in Criminal and Civil Discovery
After reading an earlier IP/Decode post about hashing, my friend Jenny Rossman reached out to explain how law enforcement was using hash values to fight the spread of child pornography. For over a decade, Jenny had been a sex crimes prosecutor in Florida. She, alongside law enforcement, had been using the technique to identify suspects and secure convictions. It is a brilliant use of hashing that is also worth considering in civil cases, particularly trade secret litigations.
Using Forensic Hashing to Fight Child Pornography
As I wrote in the earlier post, hashing can convert files to shorter strings of numbers and letters (the "hash value"). To demonstrate this, below is a set of five files that contain different content. I computed their unique hash values using the MD5 algorithm:
Filename |
MD5 Hash Value |
File1 |
585960c5cf6ed77c10d37e8dfa66629f |
File2 |
994d6db8e10d41ac5cc49f15281a5bef |
File3 |
fec2a0796d37905dec5b9ef0b24045bf |
File4 |
a3d95a3899c1050c146cd05c054cebf8 |
File5 |
748f65d8e5d27d17dd2f142a7b712392 |
Law enforcement, along with private entities, have been using these unique hash values like fingerprints to identify illicit digital materials. In practice, if law enforcement knows that File5 is child pornography from a previous investigation, then File5’s hash value can be used to identify other files with that same hash value. If there is a match, then there may be a crime. (U.S. v. Miller, 982 F.3d 412 (6th Cir. 2020), is a good read for those interested in how this practice implicates the Fourth Amendment.)
As I wrote in the previous post, the solution to speeding up nearly any search problem is hashing, and it provides the solution in this context as well. To find File5 in a suspect's computer, one would only need to run all files on the computer through an MD5 hash. After those hash values are generated, you search for File5's unique string: 748f65d8e5d27d17dd2f142a7b712392. Below are hash values for another set of randomized files that include the illicit File5:
Filename |
MD5 Hash Value |
File6 |
01cadc70bb61741a28915dd336f878d0 |
File7 |
748f65d8e5d27d17dd2f142a7b712392 |
File8 |
8259db3e9b95531adae71e740ff362b0 |
File9 |
d76c67896451dc0d920dc39ed8c802fb |
File10 |
cdf2d0112d601302ede03f6eafea0ad4 |
File7's MD5 hash value is the same as File5's, so we have a match. Due to the math behind the MD5 hash algorithm, the odds of File7's content differing from File5's, but still resulting in the same hash value, are almost impossibly small: "In the real world the number of files required for there to be a 50% probability for an MD5 collision to exist is still 264 or 1.8x1019 [that is 18,000,000,000,000,000,000 computer files]. The chance of an MD5 hash collision to exist in a computer case with 10 million files is still astronomically low."
Using hash values to find illicit material struck me as smart for a number of reasons. First, it is computationally fast, and with the number of digital files rapidly expanding, fast matters. Second, it is a minimally invasive search. The example above did not probe the contents of the searched laptop's files. The reviewer only converted each file to a content-free hash value – they never opened the files to view what was inside. And because hashing is a one-way street, the reviewer cannot work backwards from the hash value to the original files' content. This is an elegant solution: the privacy of the user is maintained to a large degree and, when one is searching for disturbing content, avoiding having to look at it is beneficial to them as well.
The solution is, however, not perfect. This is because hashing is sensitive: Flip one bit among millions and the result will be an image that is nearly identical to the original, but has a dramatically different hash value. Such a file would avoid law enforcement's detection.
To address this issue, Microsoft has built more sophisticated solution: PhotoDNA. PhotoDNA is performing a type of hashing, but does so at the image – not file – level. This means that while flipping a bit may result in an image having a new hash value, it will not alter the PhotoDNA value. Technologies such as PhotoDNA are thus keeping one step ahead of criminals.
Using Forensic Hashing in Civil Cases
In the criminal context, hashing solved two problems at once: how to find a file while not viewing its contents. These are problems that arise in civil litigation as well, and hashing would provide a valuable solution.
For example, consider a common trade secret misappropriation fact pattern: former employee Rebecca left Company with a valuable Excel customer list ("List.xlsx"), then brought it to Competitor. List.xlsx will have a hash value (e.g., 7b98d3485b4f17206bc09aa2fe8d2c31) that will be useful during the investigation and litigation stages. During investigation, Company can use the hash value to probe its systems to see where it was stored and when it was accessed. This would also confirm that the file was kept in spaces that used reasonable security measures (a requirement of trade secret protection). If litigation follows, then Company's discovery requests can be more targeted and less invasive, because List.xlsx can be identified by both its name and its hash value.
To shortcut discovery and determine quickly if Rebecca did indeed steal the file, Company could propose early targeted discovery that requests hash values only for all files to which Rebecca has had access (i.e., her laptop or shared spaces to which she had access). That targeted discovery would return only a list of hash values to which 7b98d3485b4f17206bc09aa2fe8d2c31 could be compared for a match – it would not disclose the content of Rebecca's or Competitor's files. If there is a match, then Company has a case; if there is not a match, then either Rebecca did not take the file or she has since modified it. To catch modified copies of the List.xlsx file, more sophisticated hashing algorithms could be used.
Civil litigators looking to "strike gold" by finding a misappropriated file should consider hashing as a valuable forensics tool that provides powerful searching without disclosing files' content.