Next / Previous / Contents / TCC Help System / NM Tech homepage

Abstract

Describes the implementation of programs to extract New Mexico Tech's class schedules and publish their content in a machine-readable form.

This publication is available in Web form and also as a PDF document. Please forward any comments to tcc-doc@nmt.edu.

Table of Contents

1. Screen-scraping the NMT class schedules
2. What you will need
3. makexml: Translate HTML to XML
4. HTML analyses
5. pullsched: Build the sched.xml file
5.1. pullsched: Prologue
5.2. pullsched: Imports
5.3. pullsched: Manifest constants
5.4. pullsched: Specification functions
5.5. pullsched: main(): Main program
5.6. pullsched: makeBrowser(): Instantiate a mechanize.Browser
5.7. pullsched: loadSplash(): Load the splash page
5.8. pullsched: scrapeSplash(): Get the codes from the splash page
5.9. pullsched: findSemesters(): Build the semester code mapping
5.10. pullsched: findDepartments(): Build the department name mapping
5.11. pullsched: startTree(): Build the root element and department definitions
5.12. pullsched: buildAllSemesters(): Pull all schedule pages
5.13. pullsched: buildSemester(): Scan the departments for one semester
5.14. pullsched: pullDept(): Process one schedule page
5.15. pullsched: scrapeDetail(): Extract one page of schedules
5.16. pullsched: scrapeDetailRows(): Process the rows of a schedule table
5.17. pullsched: nonemptyRows(): Find the non-empty detail rows
5.18. pullsched: isCourseBoundary(): Course boundary detector
5.19. pullsched: buildCourse(): Build the course subtree
5.20. pullsched: isSectionBoundary(): Section boundary detector
5.21. pullsched: buildSection(): Build the section subtree
5.22. pullsched: buildSectionNode(): Build the section node
5.23. pullsched: class DetailRow: One row of the schedule table
5.24. pullsched: DetailRow.__init__()
5.25. pullsched: DetailRow._dissectCourseId(): Parse the section identifier
5.26. pullsched: DetailRow._dissectTimes(): Find the time range
5.27. pullsched: class Partition: Partitioned set class
5.28. pullsched: Partition.__init__()
5.29. pullsched: Partition.genSlices()
5.30. pullsched: message(): Write a message to the standard error stream
5.31. pullsched: fatal()
5.32. pullsched: Epilogue
6. class_sched.py: The Python interface
6.1. class_sched.py: Prologue
6.2. class_sched.py: Imports
6.3. class_sched.py: Manifest constants
6.4. class_sched.py: Specification functions
6.5. acadYearToCal(): Convert academic year to calendar year
6.6. semesterName(): Produce the full semester name
6.7. class_sched.py: class ClassSchedule
6.8. class_sched.py: ClassSchedule.__init__()
6.9. class_sched.py: ClassSchedule._getTimestamp(): Retrieve and translate the timestamp
6.10. class_sched.py: ClassSchedule.lookupDeptName()
6.11. class_sched.py: ClassSchedule.genDeptCodes()
6.12. class_sched.py: ClassSchedule.genSemesters()
6.13. class_sched.py: ClassSchedule._byDate(): Order semesters chronologically
6.14. class_sched.py: ClassSchedule._semesterKey(): Extract the composite key for a semester
6.15. class_sched.py: ClassSchedule.lookupSemester()
6.16. class_sched.py: class SemesterSchedule
6.17. class_sched.py: SemesterSchedule.__init__()
6.18. class_sched.py: SemesterSchedule.lookupDeptName()
6.19. class_sched.py: SemesterSchedule.lookupDept()
6.20. class_sched.py: SemesterSchedule.genDepts()
6.21. class_sched.py: SemesterSchedule.lookupCrn()
6.22. class_sched.py: class DeptSchedule
6.23. class_sched.py: DeptSchedule.__init__()
6.24. class_sched.py: DeptSchedule.lookupCourse()
6.25. class_sched.py: DeptSchedule.genCourses()
6.26. class_sched.py: DeptSchedule.lookupCrn()
6.27. class_sched.py: class CourseSchedule
6.28. class_sched.py: CourseSchedule.__init__()
6.29. class_sched.py: CourseSchedule.lookupCrn()
6.30. class_sched.py: CourseSchedule.genSections()
6.31. class_sched.py: class SectionSchedule
6.32. class_sched.py: SectionSchedule.__init__()
6.33. class_sched.py: SectionSchedule.genInstructors()
6.34. class_sched.py: class ZoneUtc
7. Version history

1. Screen-scraping the NMT class schedules

This document describes the implementation of the scripts that create the file described in TCC Public Data Project: Class schedules. The general technique of extracting data from web pages is sometimes called Web-scraping or screen-scraping.

Two scripts are presented here in lightweight literate programming form: the actual code is embedded in prose that explains its functioning.

This project is an example of lightweight literate programming and was developed using the Cleanroom software development protocol.