Next / Previous / Contents / Shipman's homepage

5. Design

In order to find large files with duplicate content, we use hashing. Specifically, we use the SHA-256 cryptographic hash function.

Hashing the contents of a file gives us a digest, that is, a string of characters whose value is entirely a function of the contents. If we hash two different files and obtain the same digest, we can be reasonably confident that their contents are identical. (It is theoretically possible to obtain the same digest from two different contents, but the probability is roughly equivalent to the probability that all the air molecules in a two-car garage will happen to wind up in the same half of the room at the same time.)

Here is the general flow of the application:

  1. Create a small SQLite database containing one table named “path_hash” with three columns:

    • path, type VARCHAR, containing a file's absolute path name.

    • hash, type VARCHAR, containing the hash digest as a string of hexadecimal characters.

    • size, type Integer, containing the file's size in bytes.

    In this table, the path column is the unique primary key, but we also use a secondary key consisting of the concatenation of the hash and path columns, so we can find all the files with the same hash value.

  2. Examine every file in or under the selected starting directory. For each file that is at least the specified minimum size, hash it using the hashlib.sha256() function, and add a new row to the table with that file's path name, hash, and size.

  3. Go through the table in primary key order. For each absolute path name P, use the secondary key to find the set of path names {P0, P1, ...} that have the same hash digest.

    If that set has two or more members, and P is lexically the first in that set, write the separator line followed by one line for each Pi in the set, in lexical order, along with its size.