Next / Previous / Contents / TCC Help System / NM Tech homepage


Describes the implementation of programs to extract New Mexico Tech's class schedules and publish their content in a machine-readable form.

This publication is available in Web form and also as a PDF document. Please forward any comments to

Table of Contents

1. Screen-scraping the NMT class schedules
2. What you will need
3. makexml: Translate HTML to XML
4. HTML analyses
5. pullsched: Build the sched.xml file
5.1. pullsched: Prologue
5.2. pullsched: Imports
5.3. pullsched: Manifest constants
5.4. pullsched: Specification functions
5.5. pullsched: main(): Main program
5.6. pullsched: makeBrowser(): Instantiate a mechanize.Browser
5.7. pullsched: loadSplash(): Load the splash page
5.8. pullsched: scrapeSplash(): Get the codes from the splash page
5.9. pullsched: findSemesters(): Build the semester code mapping
5.10. pullsched: findDepartments(): Build the department name mapping
5.11. pullsched: startTree(): Build the root element and department definitions
5.12. pullsched: buildAllSemesters(): Pull all schedule pages
5.13. pullsched: buildSemester(): Scan the departments for one semester
5.14. pullsched: pullDept(): Process one schedule page
5.15. pullsched: scrapeDetail(): Extract one page of schedules
5.16. pullsched: scrapeDetailRows(): Process the rows of a schedule table
5.17. pullsched: nonemptyRows(): Find the non-empty detail rows
5.18. pullsched: isCourseBoundary(): Course boundary detector
5.19. pullsched: buildCourse(): Build the course subtree
5.20. pullsched: isSectionBoundary(): Section boundary detector
5.21. pullsched: buildSection(): Build the section subtree
5.22. pullsched: buildSectionNode(): Build the section node
5.23. pullsched: class DetailRow: One row of the schedule table
5.24. pullsched: DetailRow.__init__()
5.25. pullsched: DetailRow._dissectCourseId(): Parse the section identifier
5.26. pullsched: DetailRow._dissectTimes(): Find the time range
5.27. pullsched: class Partition: Partitioned set class
5.28. pullsched: Partition.__init__()
5.29. pullsched: Partition.genSlices()
5.30. pullsched: message(): Write a message to the standard error stream
5.31. pullsched: fatal()
5.32. pullsched: Epilogue
6. The Python interface
6.1. Prologue
6.2. Imports
6.3. Manifest constants
6.4. Specification functions
6.5. acadYearToCal(): Convert academic year to calendar year
6.6. semesterName(): Produce the full semester name
6.7. class ClassSchedule
6.8. ClassSchedule.__init__()
6.9. ClassSchedule._getTimestamp(): Retrieve and translate the timestamp
6.10. ClassSchedule.lookupDeptName()
6.11. ClassSchedule.genDeptCodes()
6.12. ClassSchedule.genSemesters()
6.13. ClassSchedule._byDate(): Order semesters chronologically
6.14. ClassSchedule._semesterKey(): Extract the composite key for a semester
6.15. ClassSchedule.lookupSemester()
6.16. class SemesterSchedule
6.17. SemesterSchedule.__init__()
6.18. SemesterSchedule.lookupDeptName()
6.19. SemesterSchedule.lookupDept()
6.20. SemesterSchedule.genDepts()
6.21. SemesterSchedule.lookupCrn()
6.22. class DeptSchedule
6.23. DeptSchedule.__init__()
6.24. DeptSchedule.lookupCourse()
6.25. DeptSchedule.genCourses()
6.26. DeptSchedule.lookupCrn()
6.27. class CourseSchedule
6.28. CourseSchedule.__init__()
6.29. CourseSchedule.lookupCrn()
6.30. CourseSchedule.genSections()
6.31. class SectionSchedule
6.32. SectionSchedule.__init__()
6.33. SectionSchedule.genInstructors()
6.34. class ZoneUtc
7. Version history

1. Screen-scraping the NMT class schedules

This document describes the implementation of the scripts that create the file described in TCC Public Data Project: Class schedules. The general technique of extracting data from web pages is sometimes called Web-scraping or screen-scraping.

Two scripts are presented here in lightweight literate programming form: the actual code is embedded in prose that explains its functioning.

This project is an example of lightweight literate programming and was developed using the Cleanroom software development protocol.