Next / Previous / Contents / TCC Help System / NM Tech homepage

7. Source code for bigfiles.py

The script starts out by making up a list of directory trees to be visited. If there are any command line arguments, these arguments make up the list of directories. If there are no arguments, we default to a list containing one entry, ".".

The following steps are done for each directory in the list:

  1. Look at every file in and under that directory. Use PathInfo to take a snapshot of that file, and build a list of these PathInfo instances. This process of “walking the directory tree” is easy because the os.path module has a function called walk() that handles the process of visiting every directory in the tree. For documentation on this function, see the Python Library Reference.

  2. Sort this list in descending order by the size attribute.

    Python makes it easy to sort a list: all list objects have a .sort() method that sorts the list in place. But how do we get a list of PathInfo objects to sort in descending order by size? The .sort() method can take as an optional argument a function that compares two objects. However, a more Pythonic way to do it is to define a new class that inherits from the PathInfo class called BigInfo. In that class we define a special method named .__cmp__() that tells Python how to order two objects of that class.

  3. Print a heading for this section of the report, then go through the sorted list and print one line for each PathInfo instance in that list.

One of the goals of object-oriented programming is to minimize, if not eliminate, the use of global variables. An earlier version of this program used a global variable to hold the list of PathInfo objects. A better way is to define a class that holds this list. The methods in the class have access to the list, but no code outside the class needs such access. This is in accord with the generally accepted software design principle of “information hiding”: we will feed the class constructor the name of a directory, and it will return an object that has everything we need to generate the final report.

We'll call this class BigReport, because it represents a report on big files. With this design, the overall program flow for each directory becomes:

  1. Instantiate a BigReport object. Pass the name of the directory to its constructor.

  2. The BigReport object has a method named .genFiles() that generates the lines of the report. (Generators are a relatively new feature of Python, since version 2.2. See the Python language reference section on generators.)

7.1. bigfiles.py: Code prologue

The actual code for the bigfiles.py script starts with the Linux “pound bang line” that makes the script self-executing.

bigfiles.py
#!/usr/bin/env python
#================================================================
# bigfiles.py:  Script to show files in descending order by size.
#   For documentation in "literate programming" style, see:
#     http://www.nmt.edu/help/lang/python/examples/pathinfo/
#----------------------------------------------------------------

SCRIPT_NAME       =  "bigfiles.py"
EXTERNAL_VERSION  =  "1.1"

Next we need to import a few Python modules: sys for command line arguments and standard streams; os for numerous file- and directory-related functions; and of course pathinfo.py.

bigfiles.py
#================================================================
# Imports
#----------------------------------------------------------------
import sys, os
import pathinfo

7.2. bigfiles.py: The main program

The first step is to put together a list of the desired directories. If none are given on the command line, we create a list containing just "." for the current directory.

bigfiles.py
# - - - - -   m a i n   - - - - -

def main():
    """Main program."""

    print "=== %s %s ===" % (SCRIPT_NAME, EXTERNAL_VERSION)

    #-- 1 --
    # [ if sys.argv[1:] is empty ->
    #     dirList  :=  [ "." ]
    #   else ->
    #     dirList  :=  sys.argv[1:] ]
    dirList  =  sys.argv[1:]
    if  len(dirList) == 0:
        dirList  =  [ "." ]

Next we go through the elements of dirList, generating a report for each one.

bigfiles.py
    #-- 2 --
    # [ sys.stdout  +:=  reports listing files below each
    #       directory named in dirList, with files in descending
    #       order by size ]
    for  dirName in dirList:
        #-- 2 body --
        # [ dirName is a string ->
        #     sys.stdout  +:=  a report listing the files below
        #         directory (dirName), with files in descending
        #         order by size ]
        report ( dirName )

7.3. report(): Generate one directory tree's report

The report() function generates the portion of the report for one directory subtree. The path name to the subtree is its argument.

bigfiles.py
# - - -   r e p o r t   - - -

def report ( dirName ):
    """Generate the report for one directory subtree.

      [ dirName is a string ->
          sys.stdout  +:=  a report listing the files below
              directory (dirName), with files in descending
              order by size ]
    """

This function has only three steps: write a report heading; instantiate the BigReport object containing the report data; and call that object's .genFiles() method to generate the lines of the report.

Each report starts with a line showing the name of the subtree's starting directory. This uses Python's os.path.realpath() function, which resolves soft links and relative path names to the actual absolute path name.

We also set basePath to the absolute path name corresponding to dirName. This is necessary to the BigReport object so that it can display each file's path name relative to that base directory. For this, we use the os.path.abspath() function, which does not replace soft links with their real locations.

bigfiles.py
    #-- 1 -
    # [ basePath  :=  dirName's absolute path name
    #   sys.stdout  +:=  report heading showing dirName's real
    #                    absolute path ]
    basePath  =  os.path.abspath ( dirName )
    print "\n   === %s ===" % os.path.realpath ( dirName )

    #-- 2 --
    # [ bigReport  :=  a BigReport object describing all the
    #       accessible files in directory tree (dirName) ]
    bigReport  =  BigReport ( basePath )

    #-- 3 --
    # [ bigReport is a BigReport object ->
    #     sys.stdout  +:=  lines describing files in bigReport
    #         in descending order by size ]
    for  bigInfo in bigReport.genFiles():
        print bigInfo

7.4. class BigInfo: The PathInfo subclass

In order to make the file snapshots sort in descending order by size, we could just define a .__cmp__() method in the PathInfo class. However, the bigfiles.py script and the oldfiles.py script need different sorting behavior: the former sorts by size, while the latter sorts by modification timestamp.

So each of these scripts defines a new class, inheriting from PathInfo, that defines a .__cmp__() method that makes the objects sort correctly for that application.

So that a BigInfo instance can display the path name relative to the report's starting directory, its constructor requires an additional argument named basePath, the starting directory's absolute path.

Here's the beginning of the class declaration. Note that the class name is followed by the parent class name in parentheses.

bigfiles.py
#================================================================
# Functions and classes
#----------------------------------------------------------------


# - - - - -   c l a s s   B i g I n f o   - - - - -

class BigInfo(pathinfo.PathInfo):
    """Represents information about one file; sorts by size.

      Exports:
        BigInfo ( path, basePath ):
          [ (path is the path name to a file) and
            (basePath is the path name of some directory above
            path) ->
              return a new BigInfo instance with those values ]
        .__cmp__ ( self, other ):
          [ other is a BigInfo instance ->
              return cmp ( other.size, self.size ) ]
        .__str__ ( self ):
          [ return a string describing self's modification time,
            its size, and its path name relative to basePath ]

      State/Invariants:
        .__basePath:  [ as passed to constructor, read-only ]
    """

7.5. BigInfo.__init__(): Constructor

The constructor for this class differs from PathInfo's constructor in that it requires one additional argument, the base path.

bigfiles.py
# - - -   B i g I n f o . _ _ i n i t _ _   - - -

    def __init__ ( self, path, basePath ):
        """Constructor for BigInfo."""

First we call the parent class constructor. Then we store the basePath argument in the internal attribute __basePath.

bigfiles.py
        #-- 1 --
        pathinfo.PathInfo.__init__ ( self, path )

        #-- 2 --
        self.__basePath  =  basePath

7.6. BigInfo.__cmp__(): The comparator method

When two instances of the PathInfo base class are compared, the .__cmp__() method in that class orders them by pathname.

In order to get BigInfo objects to sort in descending order by size (with the pathname as a tie-breaker), we redefine that method in this derived class.

bigfiles.py
# - - -   B i g I n f o . _ _ c m p _ _   - - -

    def __cmp__ ( self, other ):
        """Compare two BigInfo objects.

          [ other is a BigInfo object ->
              if self should precede other ->
                return a negative number
              else if self should follow other ->
                return a positive number
              else -> return 0 ]
        """

To make larger files precede smaller ones, we want to return a negative number if self.size is greater than other.size, a positive number if it is less, and zero if their .size attributes are equal. The cmp() function does this comparison, but backwards. So we can implement the comparison we want by inverting the sign of the result of cmp().

We need to consider at more than the file sizes, however. If there are multiple files with the same size, in what order should they be shown? We'll use the pathname as a secondary key. That way, if for example there are a lot of files with length 0, those files will be grouped together but sorted by pathname.

So the first step is to call the cmp() function to compare the sizes, and negate its result so we get descending instead of ascending order. If this result is nonzero, we can return it to the caller.

bigfiles.py
        #-- 1 --
        compare  =  - cmp ( self.size, other.size )

        #-- 2 --
        if compare != 0:
            return compare

If the sizes are equal, we then call cmp() again on the .path attributes, and return that.

bigfiles.py
        #-- 3 --
        return cmp(self.path, other.path)

7.7. BigInfo.__str__(): String conversion method

If you convert a PathInfo object to a string, it starts with the permissions. However, in the bigfiles.py script, we're assuming that the user is not going to interested in permissions, but mainly in the file's size and pathname, and perhaps also its last modification time.

So, to change this format, we can define a __str__() method to override the base class's .__str__() method. This version of the method presents only the modification time, file size, and path name.

There is one refinement to make the display more readable. Because this version of .__str__() does not include the type code (d for directory, - for regular files), it is hard to tell which pathnames relate to directories. So we append a "/" to the pathname if it is a directory. This is the convention used by the output of the “ls -F” command to identify directories.

bigfiles.py
# - - -   B i g I n f o . _ _ s t r _ _   - - -

    def __str__ ( self ):
        """Format a BigInfo for printing."""

So that the reader of the report can tell which lines are for directories, we set suffix to a slash if this path is a directory, or to an empty string otherwise.

bigfiles.py
        #-- 1 --
        # [ if self represents a directory ->
        #     suffix  :=  "/"
        #   else ->
        #     suffix  :=  "" ]
        if  self.isDir():  suffix  =  "/"
        else:              suffix  =  ""

Next we find the path relative to self.__basePath. This code assumes that self.__basePath is the absolute path name of a directory above our path. To get the relative path, we can then just use os.path.abspath() to get our path's absolute path, then trim off the first len(self.__basePath) characters, plus one for the slash that separates those two parts.

bigfiles.py
        #-- 2 --
        # [ self.__basePath is the absolute path of a directory
        #   above self.path ->
        #     relPath  :=  path to self.path relative to
        #                  self.__basePath ]
        absPath  =  os.path.abspath ( self.path )
        relPath  =  absPath [ len(self.__basePath) + 1 : ]

There is one special case: the first line of the report is for the base path itself, whose absolute path is identical to self.__basePath, and relPath is now an empty string. In this case, we substitute "." for the path name, and set suffix to the empty string so that the line will not read "./".

bigfiles.py
        #-- 3 --
        if  relPath == "":
            relPath  =  "."
            suffix   =  ""

Finally we are ready to format and return the report line.

bigfiles.py
        #-- 4 --
        return ( "%s %10s %s%s" %
                 (self.modTime(), self.size, relPath, suffix) )

7.8. class BigReport: The class for the whole application

The BigReport class is a container for all the information we need to produce the report. Its constructor takes the name of a directory, walks that directory subtree, and records all the file information. Its .genFiles() method is used to extract the resulting report in the desired order. Here is its interface:

bigfiles.py
# - - - - -   c l a s s   B i g R e p o r t   - - - - -

class BigReport:
    """Holds the big-files report.

      Exports:
        BigReport ( dir ):
          [ dir is a string ->
              if dir names a directory to which we have access ->
                return a BigReport object describing all the
                accessible files in that directory's subtree
              else -> raise OSError ]
        .genFiles():
          [ generate a sequence of BigInfo objects representing
            the files in self, in descending order by file size,
            with the path name as a secondary key ]

      Class invariants:
        .__bigList:
          [ a list of information on all the files in self
            as BigInfo objects, sorted ]
    """

7.9. BigReport.__init__(): Constructor

The constructor takes as an argument the name of a directory, and the name of a directory above it so it can display the path name relative to that directory. First we initialize the internal .__bigList attribute.

bigfiles.py
# - - -   B i g R e p o r t . _ _ i n i t _ _   - - -

    def __init__ ( self, dir ):
        """Constructor for the BigReport class."""

        #-- 1 --
        self.__bigList  =  []

To visit every file in the subtree, we use the os.path.walk() function. This function takes three arguments:

  1. The name of the directory subtree to be walked.

  2. A “visitor function” that will be called once for every directory in the subtree, including the starting directory.

  3. The third argument gets passed on to the visitor function as its first argument. We'll use this argument to pass the starting directory name to the visitor function, because the BigInfo object needs it to determine relative path names.

bigfiles.py
        #-- 2 --
        # [ dir is a string ->
        #     self.__bigList  :=  self.__bigList with BigInfo
        #         objects added representing every accessible
        #         file in the subtree named by dir ]
        os.path.walk ( dir, self.__visitor, dir )

All that remains in the constructor is to sort the list.

bigfiles.py
        #-- 3 --
        # [ self.__bigList  :=  self.__bigList, sorted ]
        self.__bigList.sort()

7.10. BigReport.__visitor(): Visitor function for os.path.walk()

When we call os.path.walk(), we pass it this method as the “visitor function”. This function is called once for each directory in the subtree. As discussed in the Python Library Reference section on the os.path module, the visitor function takes three arguments:

  1. arg: The value passed as the third argument to os.path.walk() is passed on to the visitor function. We are not using this value.

  2. dirName: The name of the directory we are currently visiting.

  3. nameList: A list of the names within this directory. This may include regular files, subdirectories, and soft links (and perhaps other creatures that are not of interest to this script). If the directory is empty, this argument will be an empty list.

This method must find all the regular files in nameList, take snapshots of them with the BigInfo constructor, and add those BigInfo instances to the self.__bigList list.

We can ignore subdirectories here, because os.path.walk() will take care of calling the visitor function for them.

bigfiles.py
# - - -   B i g R e p o r t . _ _ v i s i t o r   - - -

    def __visitor ( self, basePath, dirName, nameList ):
        """Visitor function for os.path.walk.

          [ (basePath is the absolute path name to a directory
            above dirName) and
            (dirName is the name of a directory) and
            (nameList is a list of the names within that 
            directory) ->
              self.__bigList  :=  self.__bigList with BigInfo
                  objects added representing the accessible
                  ordinary files in nameList ]
        """

The first step is to add an entry for the directory itself. It is unlikely that the directory will be inaccessible, but we use a try:/except: block just in case it is, so the script won't crash.

bigfiles.py
        #-- 1 --
        # [ self.__bigList  :=  self.__bigList with a BigInfo
        #       object added representing dirName ]
        try:
            dirInfo  =  BigInfo ( dirName, basePath )
            self.__bigList.append ( dirInfo )
        except OSError, detail:
            pass

Next, we iterate through the files in nameList, attempting to pass each one to BigInfo. If we don't have access to the file, that constructor will raise an OSError exception; in that case, we just discard that name and move on to the next one.

Each file's path name must be reconstructed by prepending it with dirName. We use the special os.path.join() function to concatenate them.

Then, we append each BigInfo object to self.__bigList only if it is a regular file.

bigfiles.py
        #-- 2 --
        for  fileName in nameList:
            #-- 2 body --
            # [ if fileName names an accessible regular file ->
            #     self.__bigList  :=  self.__bigList with a new
            #         BigInfo object representing fileName
            #   else -> I ]

            #-- 2.1 --
            # [ filePath  :=  dirName + fileName ]
            filePath  =  os.path.join ( dirName, fileName )

            #-- 2.2 --
            # [ if filePath is an accessible path to a regular file ->
            #     self.__bigList  :=  self.__bigList + (a BigInfo
            #         showing the status of filePath)
            #   else -> I ]
            try:
                bigInfo  =  BigInfo ( filePath, basePath )
                if  bigInfo.isFile():
                    self.__bigList.append ( bigInfo )
            except OSError, detail:
                pass

Note the pass statement above. This causes inodes such as block and character device files to be ignored silently.

7.11. BigReport.genFiles(): Generate the report

This method generates the elements of .self.__bigList() in order. To print the report, the caller can just use a print statement on each returned value: that will convert it to a string and print it. Then we raise the special StopIteration exception to signify the end of generated values.

bigfiles.py
# - - -   B i g R e p o r t . g e n F i l e s   - - -

    def genFiles ( self ):
        """Generate the BigInfo objects in self.__bigList."""
        for  bigInfo in self.__bigList:
            yield bigInfo

        raise StopIteration

7.12. Epilogue

These last few lines of the script invoked the main(), but only if the script is being run (as opposed to being imported).

bigfiles.py
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

if  __name__  ==  "__main__":
    main()