The purpose of this function is to extract all the textual content beneath a given element node. Here is the function header:
# - - - f i n d T e x t - - -
def findText ( node ):
"""Find all text and CDATA descendants of node.
[ node is a DOM Element ->
return all the text and CDATA descendants of node,
converted to strings and concatenated ]
"""
|
This is not completely trivial because there may be more than just one DOM Text node child of a given programlisting element. The first complication is that, as we discussed above under Section 2, “Encoding the literate program”, there may be DocBook elements such as link and co embedded in the source text.
The author's first attempt used a clever XPath expression, "descendant::text()", to find all descendant text nodes and put them into a node-set. That finds text nodes even if they are embedded under child elements. Unfortunately, that XPath expression does not return CDATA children, which are different beasts than DOM Text nodes.
In order to pick up both regular text nodes and CDATA sections, we have to process all the children of the given node, building up a list of text strings as we encounter each text or CDATA section, and recursively processing other child nodes.
First we create an empty list to hold all the extracted strings.
#-- 1 --
result = []
|
Then we iterate over the children of the given node. The .childNodes attribute is a standard DOM attribute containing a list of all the node's children.
#-- 2 --
# [ result +:= all text and CDATA descendants of node,
# converted to strings ]
for child in node.childNodes:
|
Inside the loop, we use the DOM Node interface's .nodeName attribute to check for text and CDATA nodes. If it's not one of those, we recur to process the child's descendants. The call to the built-in Python str() function is necessary because the DOM operates using Unicode strings, and we don't need that complication.
#-- 2 loop --
# [ child is a DOM node ->
# if child is a text node ->
# result +:= child's text as string
# else if child is a CDATA node ->
# result +:= child's text as string
# else ->
# result +:= text and CDATA descendants of child,
# concatenated ]
if child.nodeName == '#text':
result.append ( str(child.nodeValue) )
elif child.nodeName == '#cdata-section':
result.append ( str(child.data) )
else:
result.append ( findText ( child ) )
|
Finally, we assemble the resulting list into a single string and return it to the caller.
#-- 3 --
return "".join(result)
|