The script starts out by making up a list of directory
trees to be visited. If there are any command line
arguments, these arguments make up the list of directories.
If there are no arguments, we default to a list containing
one entry, ".".
The following steps are done for each directory in the list:
Look at every file in and under that directory. Use
PathInfo to take a snapshot of that file, and build a list
of these PathInfo instances. This process of
“walking the directory tree” is easy
because the os.path module has
a function called walk() that
handles the process of visiting every directory in the
tree. For documentation on this function, see the
Python Library Reference.
Sort this list in descending order by the size attribute.
Python makes it easy to sort a list: all list objects
have a .sort() method that
sorts the list in place. But how do we get a list of
PathInfo objects to sort in descending order by size? The
.sort() method can take as an
optional argument a function that compares two objects.
However, a more Pythonic way to do it is to define a
new class that inherits from the PathInfo class called
BigInfo. In that class we
define a special method named .__cmp__() that tells Python how to order
two objects of that class.
Print a heading for this section of the report, then go
through the sorted list and print one line for each
PathInfo instance in that list.
One of the goals of object-oriented programming is to
minimize, if not eliminate, the use of global variables.
An earlier version of this program used a global variable
to hold the list of PathInfo objects. A better way is to
define a class that holds this list. The methods in the
class have access to the list, but no code outside the
class needs such access. This is in accord with the
generally accepted software design principle of
“information hiding”: we will feed the class
constructor the name of a directory, and it will return an
object that has everything we need to generate the final
report.
We'll call this class BigReport,
because it represents a report on big files. With this
design, the overall program flow for each directory
becomes:
Instantiate a BigReport
object. Pass the name of the directory to its
constructor.
The BigReport object has a
method named .genFiles() that
generates the lines of the report. (Generators are a
relatively new feature of Python, since version 2.2.
See the Python
language reference section on generators.)
The actual code for the bigfiles.py script starts with the Linux
“pound bang line” that makes the script
self-executing.
#!/usr/bin/env python #================================================================ # bigfiles.py: Script to show files in descending order by size. # For documentation in "literate programming" style, see: # http://www.nmt.edu/help/lang/python/examples/pathinfo/ #---------------------------------------------------------------- SCRIPT_NAME = "bigfiles.py" EXTERNAL_VERSION = "1.1"
Next we need to import a few Python modules: sys for command line arguments and standard
streams; os for numerous file- and
directory-related functions; and of course pathinfo.py.
#================================================================ # Imports #---------------------------------------------------------------- import sys, os import pathinfo
The first step is to put together a list of the desired
directories. If none are given on the command line, we
create a list containing just "." for the
current directory.
# - - - - - m a i n - - - - -
def main():
"""Main program."""
print "=== %s %s ===" % (SCRIPT_NAME, EXTERNAL_VERSION)
#-- 1 --
# [ if sys.argv[1:] is empty ->
# dirList := [ "." ]
# else ->
# dirList := sys.argv[1:] ]
dirList = sys.argv[1:]
if len(dirList) == 0:
dirList = [ "." ]
Next we go through the elements of dirList, generating a report for each one.
#-- 2 --
# [ sys.stdout +:= reports listing files below each
# directory named in dirList, with files in descending
# order by size ]
for dirName in dirList:
#-- 2 body --
# [ dirName is a string ->
# sys.stdout +:= a report listing the files below
# directory (dirName), with files in descending
# order by size ]
report ( dirName )
The report() function generates
the portion of the report for one directory subtree. The
path name to the subtree is its argument.
# - - - r e p o r t - - -
def report ( dirName ):
"""Generate the report for one directory subtree.
[ dirName is a string ->
sys.stdout +:= a report listing the files below
directory (dirName), with files in descending
order by size ]
"""
This function has only three steps: write a report
heading; instantiate the BigReport object containing the report data; and call that
object's .genFiles() method to
generate the lines of the report.
Each report starts with a line showing the name of the
subtree's starting directory. This uses Python's os.path.realpath() function, which resolves soft
links and relative path names to the actual absolute path
name.
We also set basePath to the absolute path
name corresponding to dirName. This
is necessary to the BigReport object so
that it can display each file's path name relative to
that base directory. For this, we use the os.path.abspath() function, which does not
replace soft links with their real locations.
#-- 1 -
# [ basePath := dirName's absolute path name
# sys.stdout +:= report heading showing dirName's real
# absolute path ]
basePath = os.path.abspath ( dirName )
print "\n === %s ===" % os.path.realpath ( dirName )
#-- 2 --
# [ bigReport := a BigReport object describing all the
# accessible files in directory tree (dirName) ]
bigReport = BigReport ( basePath )
#-- 3 --
# [ bigReport is a BigReport object ->
# sys.stdout +:= lines describing files in bigReport
# in descending order by size ]
for bigInfo in bigReport.genFiles():
print bigInfo
In order to make the file snapshots sort in descending
order by size, we could just define a .__cmp__() method in the PathInfo class.
However, the bigfiles.py script and the oldfiles.py script need
different sorting behavior: the former sorts by size,
while the latter sorts by modification timestamp.
So each of these scripts defines a new class, inheriting
from PathInfo, that defines a .__cmp__() method that makes the objects
sort correctly for that application.
So that a BigInfo instance can display the
path name relative to the report's starting directory,
its constructor requires an additional argument named
basePath, the starting directory's
absolute path.
Here's the beginning of the class declaration. Note that the class name is followed by the parent class name in parentheses.
#================================================================
# Functions and classes
#----------------------------------------------------------------
# - - - - - c l a s s B i g I n f o - - - - -
class BigInfo(pathinfo.PathInfo):
"""Represents information about one file; sorts by size.
Exports:
BigInfo ( path, basePath ):
[ (path is the path name to a file) and
(basePath is the path name of some directory above
path) ->
return a new BigInfo instance with those values ]
.__cmp__ ( self, other ):
[ other is a BigInfo instance ->
return cmp ( other.size, self.size ) ]
.__str__ ( self ):
[ return a string describing self's modification time,
its size, and its path name relative to basePath ]
State/Invariants:
.__basePath: [ as passed to constructor, read-only ]
"""
The constructor for this class differs from PathInfo's constructor in that it requires one
additional argument, the base path.
# - - - B i g I n f o . _ _ i n i t _ _ - - -
def __init__ ( self, path, basePath ):
"""Constructor for BigInfo."""
First we call the parent class constructor.
Then we store the basePath argument in the
internal attribute __basePath.
#-- 1 --
pathinfo.PathInfo.__init__ ( self, path )
#-- 2 --
self.__basePath = basePath
When two instances of the PathInfo base class are compared,
the .__cmp__() method in that
class orders them by pathname.
In order to get BigInfo objects
to sort in descending order by size (with the pathname as
a tie-breaker), we redefine that method in this derived
class.
# - - - B i g I n f o . _ _ c m p _ _ - - -
def __cmp__ ( self, other ):
"""Compare two BigInfo objects.
[ other is a BigInfo object ->
if self should precede other ->
return a negative number
else if self should follow other ->
return a positive number
else -> return 0 ]
"""
To make larger files precede smaller ones, we want to
return a negative number if self.size is greater than other.size, a positive number if it is
less, and zero if their .size
attributes are equal.
The cmp() function does this
comparison, but backwards. So we can implement the
comparison we want by inverting the sign of the result of
cmp().
We need to consider at more than the file sizes, however. If there are multiple files with the same size, in what order should they be shown? We'll use the pathname as a secondary key. That way, if for example there are a lot of files with length 0, those files will be grouped together but sorted by pathname.
So the first step is to call the cmp() function to compare the sizes, and
negate its result so we get descending instead of
ascending order. If this result is nonzero, we can
return it to the caller.
#-- 1 --
compare = - cmp ( self.size, other.size )
#-- 2 --
if compare != 0:
return compare
If the sizes are equal, we then call cmp() again on the .path attributes, and return that.
#-- 3 --
return cmp(self.path, other.path)
If you convert a PathInfo object to a string, it starts with
the permissions. However, in the bigfiles.py script, we're
assuming that the user is not going to interested in
permissions, but mainly in the file's size and pathname,
and perhaps also its last modification time.
So, to change this format, we can define a __str__() method to override the base
class's .__str__() method.
This version of the method presents only the modification
time, file size, and path name.
There is one refinement to make the display more
readable. Because this version of .__str__() does not include the type code
(d for directory, - for regular files), it is hard to tell
which pathnames relate to directories. So we append a
"/" to the pathname if it is a
directory. This is the convention used by the output of
the “ls -F”
command to identify directories.
# - - - B i g I n f o . _ _ s t r _ _ - - -
def __str__ ( self ):
"""Format a BigInfo for printing."""
So that the reader of the report can tell which lines are
for directories, we set suffix to a slash
if this path is a directory, or to an empty string
otherwise.
#-- 1 --
# [ if self represents a directory ->
# suffix := "/"
# else ->
# suffix := "" ]
if self.isDir(): suffix = "/"
else: suffix = ""
Next we find the path relative to self.__basePath. This code assumes that self.__basePath is the absolute path name of a
directory above our path. To get the relative path,
we can then just use os.path.abspath() to
get our path's absolute path, then trim off the first
len(self.__basePath) characters, plus
one for the slash that separates those two parts.
#-- 2 --
# [ self.__basePath is the absolute path of a directory
# above self.path ->
# relPath := path to self.path relative to
# self.__basePath ]
absPath = os.path.abspath ( self.path )
relPath = absPath [ len(self.__basePath) + 1 : ]
There is one special case: the first line of the report
is for the base path itself, whose absolute path is
identical to self.__basePath, and
relPath is now an empty string. In this
case, we substitute "." for the path name,
and set suffix to the empty string so
that the line will not read "./".
#-- 3 --
if relPath == "":
relPath = "."
suffix = ""
Finally we are ready to format and return the report line.
#-- 4 --
return ( "%s %10s %s%s" %
(self.modTime(), self.size, relPath, suffix) )
The BigReport class is a
container for all the information we need to produce the
report. Its constructor takes the name of a directory,
walks that directory subtree, and records all the file
information. Its .genFiles()
method is used to extract the resulting report in the
desired order. Here is its interface:
# - - - - - c l a s s B i g R e p o r t - - - - -
class BigReport:
"""Holds the big-files report.
Exports:
BigReport ( dir ):
[ dir is a string ->
if dir names a directory to which we have access ->
return a BigReport object describing all the
accessible files in that directory's subtree
else -> raise OSError ]
.genFiles():
[ generate a sequence of BigInfo objects representing
the files in self, in descending order by file size,
with the path name as a secondary key ]
Class invariants:
.__bigList:
[ a list of information on all the files in self
as BigInfo objects, sorted ]
"""
The constructor takes as an argument the name of a
directory, and the name of a directory above it so it can
display the path name relative to that directory. First
we initialize the internal .__bigList
attribute.
# - - - B i g R e p o r t . _ _ i n i t _ _ - - -
def __init__ ( self, dir ):
"""Constructor for the BigReport class."""
#-- 1 --
self.__bigList = []
To visit every file in the subtree, we use the os.path.walk() function. This function takes
three arguments:
The name of the directory subtree to be walked.
A “visitor function” that will be called once for every directory in the subtree, including the starting directory.
The third argument gets passed on to the visitor
function as its first argument. We'll use this
argument to pass the starting directory name to
the visitor function, because the BigInfo object needs it to determine
relative path names.
#-- 2 --
# [ dir is a string ->
# self.__bigList := self.__bigList with BigInfo
# objects added representing every accessible
# file in the subtree named by dir ]
os.path.walk ( dir, self.__visitor, dir )
All that remains in the constructor is to sort the list.
#-- 3 --
# [ self.__bigList := self.__bigList, sorted ]
self.__bigList.sort()
When we call os.path.walk(), we
pass it this method as the “visitor
function”. This function is called once for each
directory in the subtree. As discussed in the Python Library Reference section on the os.path module,
the visitor function takes three arguments:
arg: The value passed as the
third argument to os.path.walk() is passed on to the
visitor function. We are not using this value.
dirName: The name of the
directory we are currently visiting.
nameList: A list of the
names within this directory. This may include
regular files, subdirectories, and soft links (and
perhaps other creatures that are not of interest to
this script). If the directory is empty, this
argument will be an empty list.
This method must find all the regular files in nameList, take snapshots of them with the
BigInfo constructor, and add
those BigInfo instances to the
self.__bigList list.
We can ignore subdirectories here, because os.path.walk() will take care of calling
the visitor function for them.
# - - - B i g R e p o r t . _ _ v i s i t o r - - -
def __visitor ( self, basePath, dirName, nameList ):
"""Visitor function for os.path.walk.
[ (basePath is the absolute path name to a directory
above dirName) and
(dirName is the name of a directory) and
(nameList is a list of the names within that
directory) ->
self.__bigList := self.__bigList with BigInfo
objects added representing the accessible
ordinary files in nameList ]
"""
The first step is to add an entry for the directory
itself. It is unlikely that the directory will be
inaccessible, but we use a try:/except: block just in case it
is, so the script won't crash.
#-- 1 --
# [ self.__bigList := self.__bigList with a BigInfo
# object added representing dirName ]
try:
dirInfo = BigInfo ( dirName, basePath )
self.__bigList.append ( dirInfo )
except OSError, detail:
pass
Next, we iterate through the files in nameList, attempting to pass each one to
BigInfo. If we don't have
access to the file, that constructor will raise an
OSError exception; in that case,
we just discard that name and move on to the next one.
Each file's path name must be reconstructed by prepending
it with dirName. We use the
special os.path.join() function
to concatenate them.
Then, we append each BigInfo
object to self.__bigList only if
it is a regular file.
#-- 2 --
for fileName in nameList:
#-- 2 body --
# [ if fileName names an accessible regular file ->
# self.__bigList := self.__bigList with a new
# BigInfo object representing fileName
# else -> I ]
#-- 2.1 --
# [ filePath := dirName + fileName ]
filePath = os.path.join ( dirName, fileName )
#-- 2.2 --
# [ if filePath is an accessible path to a regular file ->
# self.__bigList := self.__bigList + (a BigInfo
# showing the status of filePath)
# else -> I ]
try:
bigInfo = BigInfo ( filePath, basePath )
if bigInfo.isFile():
self.__bigList.append ( bigInfo )
except OSError, detail:
pass
Note the pass statement above.
This causes inodes such as block and character device
files to be ignored silently.
This method generates the elements of .self.__bigList() in order. To print the
report, the caller can just use a print statement on each returned value:
that will convert it to a string and print it. Then we
raise the special StopIteration
exception to signify the end of generated values.
# - - - B i g R e p o r t . g e n F i l e s - - -
def genFiles ( self ):
"""Generate the BigInfo objects in self.__bigList."""
for bigInfo in self.__bigList:
yield bigInfo
raise StopIteration