Next / Previous / Contents / Shipman's homepage

13. addFile(): Process one file

deduper
# - - -   a d d F i l e

def addFile(fileData, dirName, name):
    '''Add one file to the database if it qualifies.

      [ (fileData is a FileData instance) and
        (dirName is the absolute path of a directory) and
        (name is a name in directory (dirName)) ->
          let
            filePath == os.path.join(dirName, name)
          in ->
            if filePath refers to a readable non-link file of size
            LARGE_SIZE or greater ->
              fileData  +:=  a new FilePath with path (filePath)
                  and hash (sha256 hex digest of that file)
            else if filePath is unreadable ->
                sys.stderr  +:=  error message, (name) is unreadable
            else -> I ]
    '''

The first order of business is to assemble the absolute path name of the file to be considered.

deduper
    #-- 1
    # [ filePath  :=  dirName + "/" + name ]
    filePath = os.path.join(dirName, name)

Next we try to retrieve the file's information. This may fail, in which case we'll write an error message and return.

If the file is a soft link, the os.lstat() function gives us the status tuple for the link, not for the file referenced. In practice, that means that this script ignores soft links, because they don't take a lot of disk space.

The constant stat.ST_MODE specifies the index within the status tuple of the mode word, and stat.ST_SIZE is the index of the file size.

deduper
    #-- 2
    # [ if filePath can be statted ->
    #     mode  :=  the file's mode word
    #     fileSize  :=  the file's size
    #   else ->
    #     sys.stderr  +:=  error message, (filePath) is unreadable
    #     return ]
    try:
        statusTuple = os.lstat(filePath)
        mode = statusTuple[stat.ST_MODE]
        fileSize = statusTuple[stat.ST_SIZE]
    except Exception as x:
        message("*** Can't stat {0}: {1}".format(filePath, x))
        return

The stat.S_ISREG() function tests the mode word to see if it represents a regular file. We ignore directories, devices, and other oddments.

The reason we record the minimum file size in the FileData instance is because the os.path.walk() passes only one thing to the visitor function, and the logic at this level needs both the database and the minimum size.

deduper
    #-- 3
    # [ if (mode does not describe a regular file) or
    #   (fileSize < fileData.minSize) ->
    #     return
    #   else -> I ]
    if ((not stat.S_ISREG(mode)) or
        (fileSize < fileData.minSize)):
        return

At this point we know that the file is readable and sufficiently large to be of interest. For the logic that computes the hash digest, see Section 14, “hashFile(): Compute a file's hash digest”. If that function fails for any reason, it returns None, otherwise it returns the hash digest as a string.

deduper
    #-- 4
    # [ if filePath is readable ->
    #     hash  :=  sha256 hex digest of that file
    #   else ->
    #     return ]
    hash = hashFile(filePath)
    if hash is None:
        return

Now we have everything we need to add a row to the table: the path, the hash, and the size. See Section 17, “class FileData: The database” for the .add() method.

deduper
    #-- 4
    # [ fileData  +:=  a new row for path=filePath, hash=hash,
    #                  and size=fileSize ]
    fileData.add(filePath, hash, fileSize)