4 XML Parsing for Java

This chapter contains these topics:

Introduction to the XML Parsing for Java

This section contains the following topics:

Prerequisites

Oracle XML parsing reads an XML document and uses DOM or SAX APIs to provide programmatic access to its content and structure. You can use parsing in validating or nonvalidating mode.

This chapter assumes that you are familiar with the following technologies:

If you require a general introduction to the preceding technologies, consult the XML resources listed in "Related Documents" of the preface.

Standards and Specifications

The DOM Level 1, Level 2, and Level 3 specifications are W3C Recommendations. You can find links to the specifications for all three levels at the following URL:

http://www.w3.org/DOM/DOMTR

SAX is available in version 1.0, which is deprecated, and 2.0. It is not a W3C specification. You can find the documentation for SAX at the following URL:

http://www.saxproject.org/

XML Namespaces are a W3C Recommendation. You can find the specification at the following URL:

http://www.w3.org/TR/REC-xml-names

JCR 1.0 (also known as JSR 170) defines a standard Java API for applications to interact with content repositories.

JAXP version 1.2 includes an XSLT framework plus some updates to the parsing API to support DOM Level 2 and SAX version 2.0 and an improved scheme to locate pluggable implementations. JAXP provides support for XML schema and an XSLT compiler. You can access the JAXP specification at the following URL:

http://www.oracle.com/technetwork/java/index.html

See Also:

Chapter 31, "XDK Standards" for an account of the standards supported by the XDK

Large Node Handling

DOM Stream access to XML nodes is done by PL/SQL and Java APIs. Nodes in an XML document can now exceed 64 KBytes by a large amount. Thus JPEG, Word, PDF, RTF, and HTML documents can be more readily stored.

See Also:

Oracle XML DB Developer's Guide for complete details on the Java large node capabilities

XML Parsing in Java

XMLParser is the abstract base class for the XML parser for Java. An instantiated parser invokes the parse() method to read an XML document.

XMLDOMImplementation factory methods provide another method to parse Binary XML to create scalable DOM.

Figure 4-1 illustrates the basic parsing process, using XMLParser. The diagram does not apply to XMLDOMImplementation().

Figure 4-1 The XML Parser Process

Description of Figure 4-1 follows
Description of "Figure 4-1 The XML Parser Process"

The following APIs provide a Java application with access to a parsed XML document:

  • DOM API, which parses XML documents and builds a tree representation of the documents in memory. Use either a DOMParser object to parse with DOM or the XMLDOMImplementation interface factory methods to create a pluggable, scalable DOM.

  • SAX API, which processes an XML document as a stream of events, which means that a program cannot access random locations in a document. Use a SAXParser object to parse with SAX.

  • JAXP, which is a Java-specific API that supports DOM, SAX, and XSL. Use a DocumentBuilder or SAXParser object to parse with JAXP.

The sample XML document in Example 4-1 helps illustrate the differences among DOM, SAX, and JAXP.

Example 4-1 Sample XML Document

<?xml version="1.0"?>
  <EMPLIST>
    <EMP>
     <ENAME>MARY</ENAME>
    </EMP>
    <EMP>
     <ENAME>SCOTT</ENAME>
    </EMP>
  </EMPLIST>

DOM in XML Parsing

DOM builds an in-memory tree representation of the XML document. For example, the DOM API receives the document described in Example 4-1 and creates an in-memory tree as shown in Figure 4-2. DOM provides classes and methods to navigate and process the tree.

In general, the DOM API provides the following advantages:

  • DOM API is easier to use than SAX because it provides a familiar tree structure of objects.

  • Structural manipulations of the XML tree, such as re-ordering elements, adding to and deleting elements and attributes, and renaming elements, can be performed.

  • Interactive applications can store the object model in memory, enabling users to access and manipulate it.

  • DOM as a standard does not support XPath. However, most XPath implementations use DOM. The Oracle XDK includes DOM API extensions to support XPath.

  • A pluggable, scalable DOM can be created that considerably improves scalability and efficiency.

DOM Creation

In Java XDK, there are three ways to create a DOM:

  • Parse a document using DOMParser. This has been the traditional XDK approach.

  • Create a scalable DOM using XMLDOMImplementation factory methods.

  • Use an XMLDocument constructor. This is not a common solution in XDK.

Scalable DOM

With Oracle 11g Release 1 (11.1), XDK provides scalable, pluggable support for DOM. This relieves problems of memory inefficiency, limited scalability, and lack of control over the DOM configuration.

For the scalable DOM, the configuration and creation are mainly supported using the XMLDOMImplementation class.

These are important aspects of scalable DOM:

  • Plug-in Data allows external XML representation to be directly used by Scalable DOM without replicating XML in internal representation.

    Scalable DOM is created on top of plug-in XML data through the Reader and InfosetWriter abstract interfaces. XML data can be in different forms, such as Binary XML, XMLType, and third-party DOM, and so on.

  • Transient nodes. DOM nodes are created lazily and may be freed if not in use.

  • Binary XML

    The scalable DOM can use binary XML as both input and output format. Scalable DOM can interact with the data in two ways:

    • Through the abstract InfosetReader and InfosetWriter interfaces. Users can (1) use the BinXML implementation of InfosetReader and InfosetWriter to read and write BinXML data, and (2) use other implementations supplied by the user to read and write in other forms of XML infoset.

    • Through an implementation of the InfosetReader and InfosetWriter adaptor for BinXMLStream.

Scalable DOM support consists of the following:

Pluggable DOM Support

Pluggable DOM is an XDK mechanism that enables you to split the DOM API from the data layer. The DOM API is separated from the data by the InfosetReader and InfosetWriter interfaces.

Using pluggable DOM, XML data can be easily moved from one processor to another.

The DOM API includes unified standard APIs on top of the data to support node access, navigation, update processes, and searching capability.

Lazy Materialization

Using the lazy materialization mechanism, XDK only creates nodes that are accessed and frees unused nodes from memory. Applications can process very large XML documents with improved scalability.

Configurable DOM Settings

DOM configurations can be made to suit different applications. You can configure the DOM with different access patterns such as read-only, streaming, transient update, and shadow copy, achieving maximum memory use and performance in your applications.

SAX in the XML Parser

Unlike DOM, SAX is event-based, so it does not build in-memory tree representations of input documents. SAX processes the input document element by element and can report events and significant data to callback methods in the application. The XML document in Example 4-1 is parsed as a series of linear events as shown in Figure 4-2.

In general, the SAX API provides the following advantages:

  • It is useful for search operations and other programs that do not need to manipulate an XML tree.

  • It does not consume significant memory resources.

  • It is faster than DOM when retrieving XML documents from a database.

Figure 4-2 Comparing DOM (Tree-Based) and SAX (Event-Based) APIs

Description of Figure 4-2 follows
Description of "Figure 4-2 Comparing DOM (Tree-Based) and SAX (Event-Based) APIs"

JAXP in the XML Parser

The JAXP API enables you to plug in an implementation of the SAX or DOM parser. The SAX and DOM APIs provided in the Oracle XDK are examples of vendor-specific implementations supported by JAXP.

In general, the advantage of JAXP is that you can use it to write interoperable applications. If an application uses features available through JAXP, then it can very easily switch the implementation.

The main disadvantage of JAXP is that it runs more slowly than vendor-specific APIs. In addition, several features are available through Oracle-specific APIs that are not available through JAXP APIs. Only some of the Oracle-specific features are available through the extension mechanism provided in JAXP. If an application uses these extensions, however, then the flexibility of switching implementation is lost.

Namespace Support in the XML Parser

The XML parser for Java can parse unqualified element types and attribute names as well as those in namespaces. Namespaces are a mechanism to resolve or avoid name collisions between element types or attributes in XML documents by providing "universal" names. Consider the XML document shown in Example 4-2.

Example 4-2 Sample XML Document Without Namespaces

<?xml version='1.0'?>
<addresslist>
  <company>
    <address>500 Oracle Parkway,
             Redwood Shores, CA 94065
    </address>
  </company>
  <!-- ... -->
  <employee>
    <lastname>King</lastname>
    <address>3290 W Big Beaver
             Troy, MI 48084
    </address>
  </employee>
  <!-- ... -->
</addresslist>

Without the use of namespaces, an application processing the XML document in Example 4-2 does not know whether the <address> tag refers to a company or employee address. As shown in Example 4-3, you can use namespaces to distinguish the <address> tags. The example declares the following XML namespaces:

http://www.oracle.com/employee
http://www.oracle.com/company

Example 4-3 associates the com prefix with the first namespace and the emp prefix with the second namespace. Thus, an application can distinguish <com:address> from <emp:address>.

Example 4-3 Sample XML Document with Namespaces

<?xml version='1.0'?>
<addresslist>
<!-- ... -->
  <com:company 
    xmlns:com="http://www.oracle.com/company/">
    <com:address>500 Oracle Parkway,
             Redwood Shores, CA 94065
    </com:address>
  </com:company>
  <!-- ... -->
  <emp:employee
    xmlns:emp="http://www.oracle.com/employee/">
    <emp:lastname>King</emp:lastname>
    <emp:address>3290 W Big Beaver
             Troy, MI 48084
    </emp:address>
</emp:employee>

It is helpful to remember the following terms when parsing documents that use namespaces:

  • Namespace prefix, which is a namespace prefix declared with xmlns. In Example 4-3, emp and com are namespace prefixes.

  • Local name, which is the name of an element or attribute without the namespace prefix. In Example 4-3, employee and company are local names.

  • Qualified name, which is the local name plus the prefix. In Example 4-3, emp:employee and com:company are qualified names.

  • Namespace URI, which is the URI assigned to xmlns. In Example 4-3, http://www.oracle.com/employee and http://www.oracle.com/company are namespace URIs.

  • Expanded name, which is obtained by substituting the namespace URI for the namespace prefix. In Example 4-3, http://www.oracle.com/employee:employee and http://www.oracle.com/company:company are expanded element names.

Validation in the XML Parser

Applications invoke the parse() method to parse XML documents. Typically, applications invoke initialization and termination methods in association with the parse() method. You can use the setValidationMode() method defined in oracle.xml.parser.v2.XMLParser to set the parser mode to validating or nonvalidating.

By parsing an XML document according to the rules specified in a DTD or an XML schema, a validating XML parser determines whether the document conforms to the specified DTD or XML schema. If the XML document does conform, then the document is valid, which means that the structure of the document conforms to the DTD or schema rules. A nonvalidating parser checks for well-formedness only.

Table 4-1 shows the flags that you can use in setValidationMode() to set the validation mode in the Oracle XDK parser.

Table 4-1 XML Parser for Java Validation Modes

Name Value The XML Parser . . .

Nonvalidating mode

NONVALIDATING

Verifies that the XML is well-formed and parses the data.

DTD validating mode

DTD_VALIDATION

Verifies that the XML is well-formed and validates the XML data against the DTD. The DTD defined in the <!DOCTYPE> declaration must be relative to the location of the input XML document.

Schema validation mode

SCHEMA_VALIDATION

Validates the XML Document according to the XML schema specified for the document.

LAX validation mode

SCHEMA_LAX_VALIDATION

Tries to validate part or all of the instance document as long as it can find the schema definition. It does not raise an error if it cannot find the definition. See the sample program XSDLax.java in the schema directory.

Strict validation mode

SCHEMA_STRICT_VALIDATION

Tries to validate the whole instance document, raising errors if it cannot find the schema definition or if the instance does not conform to the definition.

Partial validation mode

PARTIAL_VALIDATION

Validates all or part of the input XML document according to the DTD, if present. If the DTD is not present, then the parser is set to nonvalidating mode.

Auto validation mode

AUTO_VALIDATION

Validates all or part of the input XML document according to the DTD or XML schema, if present. If neither is present, then the parser is set to nonvalidating mode.


In addition to setting the validation mode with setValidationMode(), you can use the oracle.xml.parser.schema.XSDBuilder class to build an XML schema and then configure the parser to use it by invoking the XMLParser.setXMLSchema() method. In this case, the XML parser automatically sets the validation mode to SCHEMA_STRICT_VALIDATION and ignores the schemaLocation and noNamespaceSchemaLocation attributes. You can also change the validation mode to SCHEMA_LAX_VALIDATION. The XMLParser.setDoctype() method is a parallel method for DTDs, but unlike setXMLSchema() it does not alter the validation mode.

See Also:

Compression in the XML Parser

You can use the XML compressor, which is implemented in the XML parser, to compress and decompress XML documents. The compression algorithm is based on tokenizing the XML tags. The assumption is that any XML document repeats a number of tags and so tokenizing these tags gives considerable compression. The degree of compression depends on the type of document: the larger the tags and the lesser the text content, the better the compression.

The Oracle XML parser generates a binary compressed output from an in-memory DOM tree or SAX events generated from an XML document. Table 4-2 describes the two types of compression.

Table 4-2 XML Compression with DOM and SAX

Type Description Compression APIs

DOM-based

The goal is to reduce the size of the XML document without losing the structural and hierarchical information of the DOM tree. The parser serializes an in-memory DOM tree, corresponding to a parsed XML document, and generates a compressed XML output stream. The serialized stream regenerates the DOM tree when read back.

Use the writeExternal() method to generate compressed XML and the readExternal() method to reconstruct it. The methods are in the oracle.xml.parser.v2.XMLDocument class.

SAX-based

The SAX parser generates a compressed stream when it parses an XML file. SAX events generated by the SAX parser are handled by the SAX compression utility, which generates a compressed binary stream. When the binary stream is read back, the SAX events are generated.

To generate compressed XML, instantiate oracle.xml.comp.CXMLHandlerBase by passing an output stream to the constructor. Pass the object to SAXParser.setContentHandler() and then execute the parse() method. Use the oracle.xml.comp.CXMLParser class to decompress the XML.

Note: CXMLHandlerBase implements both SAX 1.0 and 2.0, but because 1.0 is deprecated, it is recommended that you use the 2.0 API.


The compressed streams generated from DOM and SAX are compatible, that is, you can use the compressed stream generated from SAX to generate the DOM tree and vice versa. As with XML documents in general, you can store the compressed XML data output in the database as a BLOB.

When a program parses a large XML document and creates a DOM tree in memory, it can affect performance. You can compress an XML document into a binary stream by serializing the DOM tree. You can regenerate the DOM tree without validating the XML data in the compressed stream. You can treat the compressed stream as a serialized stream, but the data in the stream is more controlled and managed than the compression implemented by Java's default serialization.

Note:

Oracle Text cannot search a compressed XML document. Decompression reduces performance. If you are transferring files between client and server, then HTTP compression can be easier.

Using XML Parsing for Java: Overview

The fundamental component of any XML development is XML parsing. XML parsing for Java is a standalone XML component that parses an XML document (and possibly also a standalone DTD or XML Schema) so that your program can process it. This section contains the following topics:

Note:

You can use the parser with any supported JavaVMs. With Oracle9i or higher you can load the parser into the database and use the internal Oracle9i JVM. For other database versions, run the parser in an external JVM and connect to a database through JDBC.

Using the XML Parser for Java: Basic Process

Figure 4-3 shows how to use the XML parser in a typical XML processing application.

Figure 4-3 XML Parser for Java

Description of Figure 4-3 follows
Description of "Figure 4-3 XML Parser for Java"

The basic process of the application shown in Figure 4-3 is as follows:

  1. The DOM or SAX parser parses input XML documents. For example, the program can parse XML data documents, DTDs, XML schemas, and XSL stylesheets.

  2. If you implement a validating parser, then the processor attempts to validate the XML data document against any supplied DTDs or XML schemas.

Running the XML Parser Demo Programs

Demo programs for the XML parser for Java are included in $ORACLE_HOME/xdk/demo/java/parser. The demo programs are distributed among the subdirectories described in Table 4-3.

Table 4-3 Java Parser Demos

Directory Contents These programs ...

common

class.xml
DemoUtil.java
empl.xml
family.dtd
family.xml
iden.xsl
NSExample.xml
traversal.xml

Provide XML files and Java programs for general use with the XML parser. For example, you can use the XSLT stylesheet iden.xsl to achieve an identity transformation of the XML files. DemoUtil.java implements a helper method to create a URL from a file name. This method is used by many of the other demo programs.

comp

DOMCompression.java
DOMDeCompression.java
SAXCompression.java
SAXDeCompression.java
SampleSAXHandler.java
sample.xml
xml.ser

Illustrate DOM and SAX compression:

  • DOMCompression.java compresses a DOM tree.

  • DOMDeCompression.java reads back a DOM from a compressed stream.

  • SAXCompression.java compresses the output from a SAX parser.

  • SAXDeCompression.java regenerates SAX events from the compressed stream.

  • SampleSAXHandler.java illustrates use of a handler to handle the events thrown by the SAX DeCompressor.

dom

AutoDetectEncoding.java
DOM2Namespace.java
DOMNamespace.java
DOMRangeSample.java
DOMSample.java
EventSample.java
I18nSafeXMLFileWritingSample.java
NodeIteratorSample.java
ParseXMLFromString.java
TreeWalkerSample.java

Illustrate uses of the DOM API:

  • DOM2Namespace.java shows how to use DOM Level 2.0 APIs.

  • DOMNamespace.java shows how to use Namespace extensions to DOM APIs.

  • DOMRangeSample.java shows how to use DOM Range APIs.

  • DOMSample.java shows basic use of the DOM APIs.

  • EventSample.java shows how to use DOM Event APIs.

  • NodeIteratorSample.java shows how to use DOM Iterator APIs.

  • TreeWalkerSample.java shows how to use DOM TreeWalker APIs.

jaxp

JAXPExamples.java
age.xsl
general.xml
jaxpone.xml
jaxpone.xsl
jaxpthree.xsl
jaxptwo.xsl
oraContentHandler.java

Illustrate various uses of the JAXP API:

  • JAXPExamples.java provides a few examples of how to use the JAXP 1.1 API to run the Oracle engine.

  • oraContentHandler.java implements a SAX content handler. The program invokes methods such as startDocument(), endDocument(), startElement(), and endElement() when it recognizes an XML tag.

sax

SAX2Namespace.java
SAXNamespace.java
SAXSample.java
Tokenizer.java

Illustrate various uses of the SAX APIs:

  • SAX2Namespace.java shows how to use SAX 2.0.

  • SAXNamespace.java shows how to use namespace extensions to SAX APIs.

  • SAXSample.java shows basic use of the SAX APIs.

  • Tokenizer.java shows how to use the XMLToken interface APIs. The program implements the XMLToken interface, which must be registered with the setTokenHandler() method. A request for XML tokens is registered with the setToken() method. During tokenizing, the parser does not validate the document and does not include or read internal/external utilities.

xslt

XSLSample.java
XSLSample2.java
match.xml
match.xsl
math.xml
math.xsl
number.xml
number.xsl
position.xml
position.xsl
reverse.xml
reverse.xsl
string.xml
string.xsl
style.txt
variable.xml
variable.xsl

Illustrate the transformation of documents with XSLT:

  • XSLSample.java shows how to use the XSL processing capabilities of the Oracle XML parser. It transforms an input XML document with a given input stylesheet. This demo builds the result of XSL transformations as a DocumentFragment and so does not support xsl:output features.

  • XSLSample2.java transforms an input XML document with a given input stylesheet. The demo streams the result of the XSL transformation and so supports xsl:output features.

See Also: "Running the XSLT Processor Demo Programs"


Documentation for how to compile and run the sample programs is located in the README. The basic steps are as follows:

  1. Change into the $ORACLE_HOME/xdk/demo/java/parser directory (UNIX) or %ORACLE_HOME%\xdk\demo\java\parser directory (Windows).

  2. Set up your environment as described in "Setting Up the Java XDK Environment".

  3. Change into each of the following subdirectories and run make (UNIX) or Make.bat (Windows) at the command line. For example:

    cd comp;make;cd ..
    cd jaxp;make;cd ..
    cd sax;make;cd ..
    cd dom;make;cd ..
    cd xslt;make;cd ..
    

    The make file compiles the source code in each directory, runs the programs, and writes the output for each program to a file with an *.out extension.

  4. You can view the *.out files to view the output for the programs.

Using the XML Parser Command-Line Utility

The oraxml utility, which is located in $ORACLE_HOME/bin (UNIX) or %ORACLE_HOME%\bin (Windows), is a command-line interface that parses XML documents. It checks for both well-formedness and validity.

To use oraxml ensure that the following is true:

  • Your CLASSPATH is set up as described in "Setting Up the Java XDK Environment". In particular, make sure you CLASSPATH environment variable points to the xmlparserv2.jar file.

  • Your PATH environment variable can find the Java interpreter that comes with the version of the JDK that you are using.

Table 4-4 lists the oraxml command-line options.

Table 4-4 oraxml Command-Line Options

Option Purpose

-help

Prints the help message

-version

Prints the release version

-novalidate fileName

Checks whether the input file is well-formed

-dtd fileName

Validates the input file with DTD Validation

-schema fileName

Validates the input file with Schema Validation

-log logfile

Writes the errors to the output log file

-comp fileName

Compresses the input XML file

-decomp fileName

Decompresses the input compressed file

-enc fileName

Prints the encoding of the input file

-warning

Show warnings


For example, change into the $ORACLE_HOME/xdk/demo/java/parser/common directory. You can validate the document family.xml against family.dtd by executing the following on the command line:

oraxml -dtd -enc family.xml

The output should appear as follows:

The encoding of the input file: UTF-8The input XML file is parsed without errors using DTD validation mode.

Parsing XML with DOM

The W3C standard library org.w3c.dom defines the Document class as well as classes for the components of a DOM. The Oracle XML parser includes the standard DOM APIs and is compliant with the W3C DOM recommendation. Along with org.w3c.dom, Oracle XML parsing includes classes that implement the DOM APIs and extend them to provide features such as printing document fragments and retrieving namespace information.

This section contains the following topics:

Using the DOM API

To implement DOM-based components in your XML application, you can use the following XDK classes:

  • oracle.xml.parser.v2.DOMParser. This class implements an XML 1.0 parser according to the W3C recommendation. Because DOMParser extends XMLParser, all methods of XMLParser are available to DOMParser.

  • oracle.xml.parser.v2.XMLDOMImplementation. This class contains factory methods used to created scalable, pluggable DOM.

    For purposes of this discussion, DOMs created with the XMLDOMImplementation class are referred to as scalable or pluggable DOM.

You can also make use of the DOMNamespace and DOM2Namespace classes, which are sample programs included in $ORACLE_HOME/xdk/demo/java/parser/dom.

DOM Parser Architecture

Figure 4-4 is an architectural diagram of the DOM Parser.

Figure 4-4 Basic Architecture of the DOM Parser

Description of Figure 4-4 follows
Description of "Figure 4-4 Basic Architecture of the DOM Parser"

Performing Basic DOM Parsing

The program DOMSample.java is provided to illustrate the basic steps for parsing an input XML document and accessing it through a DOM.

The program receives an XML file as input, parses it, and prints the elements and attributes in the DOM tree.

The steps provide reference to tables that provide possible methods and interfaces you can use at that point.

  1. Create a DOMParser object by calling the DOMParser() constructor. You can use this parser to parse input XML data documents as well as DTDs. The following code fragment from DOMSample.java illustrates this technique:

    DOMParser parser = new DOMParser();
    
  2. Configure parser properties. See Table 4-5.

    The following code fragment from DOMSample.java specifies the error output stream, sets the validation mode to DTD validation, and enables warning messages:

    parser.setErrorStream(System.err);
    parser.setValidationMode(DOMParser.DTD_VALIDATION);
    parser.showWarnings(true);
    
  3. Parse the input XML document by invoking the parse() method. The program builds a tree of Node objects in memory.

    This code fragment from DOMSample.java shows how to parse an instance of the java.net.URL class:

    parser.parse(url);
    

    Note that the XML input can be a file, string buffer, or URL. As illustrated by the following code fragment, DOMSample.java accepts a filename as a parameter and calls the createURL helper method to construct a URL object that can be passed to the parser:

    public class DOMSample
    {
       static public void main(String[] argv)
       {
          try
          {
             if (argv.length != 1)
             {
                // Must pass in the name of the XML file.
                System.err.println("Usage: java DOMSample filename");
                System.exit(1);
             }
             ...
             // Generate a URL from the filename.
             URL url = DemoUtil.createURL(argv[0]);
             ...
    
  4. Invoke getDocument() to obtain a handle to the root of the in-memory DOM tree, which is an XMLDocument object. You can use this handle to access every part of the parsed XML document. The XMLDocument class implements the interfaces shown in Table 4-6.

    This code fragment from DOMSample.java illustrates this technique:

    XMLDocument doc = parser.getDocument();
    
  5. Obtain and manipulate DOM nodes of the retrieved document by calling various XMLDocument methods. See Table 4-7.

    The following code fragment from DOMSample.java uses the DOMParser.print() method to print the elements and attributes of the DOM tree:

    System.out.print("The elements are: ");
    printElements(doc);
     
    System.out.println("The attributes of each element are: ");
    printElementAttributes(doc);
    

    The program implements the printElements() method, which calls getElementsByTagName() to obtain a list of all the elements in the DOM tree. It then loops through each item in the list and calls getNodeName() to print the name of each element:

    static void printElements(Document doc)
    {
       NodeList nl = doc.getElementsByTagName("*");
       Node n;
    
       for (int i=0; i<nl.getLength(); i++)
       {
          n = nl.item(i);
          System.out.print(n.getNodeName() + " ");
       }
     
       System.out.println();
    }
    

    The program implements the printElementAttributes() method, which calls Document.getElementsByTagName() to obtain a list of all the elements in the DOM tree. It then loops through each element in the list and calls Element.getAttributes() to obtain the list of attributes for the element. It then calls Node.getNodeName() to obtain the attribute name and Node.getNodeValue() to obtain the attribute value:

    static void printElementAttributes(Document doc)
    {
       NodeList nl = doc.getElementsByTagName("*");
       Element e;
       Node n;
       NamedNodeMap nnm;
     
       String attrname;
       String attrval;
       int i, len;
     
       len = nl.getLength();
    
       for (int j=0; j < len; j++)
       {
          e = (Element)nl.item(j);
          System.out.println(e.getTagName() + ":");
          nnm = e.getAttributes();
     
          if (nnm != null)
          {
             for (i=0; i<nnm.getLength(); i++)
             {
                n = nnm.item(i);
                attrname = n.getNodeName();
                attrval = n.getNodeValue();
                System.out.print(" " + attrname + " = " + attrval);
             }
          }
          System.out.println();
       }
    }
    
  6. Reset the parser state by invoking the reset() method. The parser is now ready to parse a new document.

Useful Methods and Interfaces

The following tables provide useful methods and interfaces to use in creating an application such as the one just created in "Performing Basic DOM Parsing".

Table 4-5 lists useful configuration methods.

Table 4-5 DOMParser Configuration Methods

Method Use this method to . . .

setBaseURL()

Set the base URL for loading external entities and DTDs. Call this method if the XML document is an InputStream.

setDoctype()

Specify the DTD to use when parsing.

setErrorStream()

Create an output stream for the output of errors and warnings.

setPreserveWhitespace()

Instruct the parser to preserve the whitespace in the input XML document.

setValidationMode()

Set the validation mode of the parser. Table 4-1 describes the flags that you can use with this method.

showWarnings()

Specify whether the parser should print warnings.


Table 4-6 lists the interfaces that the XMLDocument class implements.

Table 4-6 Some Interfaces Implemented by XMLDocument

Interface Defines . . .

org.w3c.dom.Node

A single node in the document tree and methods to access and process the node.

org.w3c.dom.Document

A Node that represents the entire XML document.

org.w3c.dom.Element

A Node that represents an XML element.


Table 4-7 lists some useful methods for obtaining nodes.

Table 4-7 Useful XMLDocument Methods

Method Use this method to . . .

getAttributes()

Generate a NamedNodeMap containing the attributes of this node (if it is an element) or null otherwise.

getElementsbyTagName()

Retrieve recursively all elements that match a given tag name under a certain level. This method supports the * tag, which matches any tag. Call getElementsByTagName("*") through the handle to the root of the document to generate a list of all elements in the document.

getExpandedName()

Obtain the expanded name of the element. This method is specified in the NSName interface.

getLocalName()

Obtain the local name for this element. If an element name is <E1:locn xmlns:E1="http://www.oracle.com/"/>, then locn is the local name.

getNamespaceURI()

Obtain the namespace URI of this node, or null if it is unspecified. If an element name is <E1:locn xmlns:E1="http://www.oracle.com/"/>, then http://www.oracle.com is the namespace URI.

getNodeName()

Obtain the name of a node in the DOM tree.

getNodeValue()

Obtain the value of this node, depending on its type. This mode is in the Node interface.

getPrefix()

Obtain the namespace prefix for an element.

getQualifiedName()

Obtain the qualified name for an element. If an element name is <E1:locn xmlns:E1="http://www.oracle.com/"/>, then E1:locn is the qualified name..

getTagName()

Obtain the name of an element in the DOM tree.


Creating Scalable DOM

This section discusses how to create and use a scalable DOM.

This section contains the following topics:

Using Pluggable DOM

Pluggable DOM has the DOM API split from the data. The underlying data can be either internal or plug-in, and both can be in binary XML.

  • Internal Data

    To plug in internal data (XML text that has not been parsed), the XML text must be saved as binary XML, then parsed by the DOMParser. The parsed binary XML can be then be plugged into the InfoSetReader of the DOM API layer.

    The InfosetReader argument is the interface to the underlying XML data.

  • Plug-in Data

    Plug-in data is data that has already been parsed and therefore can be transferred from one processor to another without requiring parsing.

To create a pluggable DOM, XML data is plugged in through the InfosetReader interface on an XMLDOMImplementation object, for example:

public Document createDocument(InfosetReader reader) throws DOMException

The InfosetReader API is implemented on top of the XDK BinXMLStream. Optional adaptors for other forms of XML data such as DOM4J, JDOM, or JDBC may also be supported. Users can also plug in their own implementations.

InfosetReader serves as the interface between the scalable DOM API layer and the underlying data. It is a generic, stream-based pull API that accesses XML data. The InfosetReader retrieves sequential events from the XML stream and queries the state and data from these events. In the following example, the XML data is scanned to retrieve the QNames and attributes of all elements:

InfosetReader reader;
While (reader.hasNext())
{
   reader.next();
   if (reader.getEventType() == START_ELEMENT)
   {
        QName name = reader.getQName();
        TypedAttributeList attrList = reader.getAttributeList();
     }
} 
InfosetReader Options

The InfosetReader interface supports the following functionality:

Copying: To support shadow copy of DOM across documents, a new copy of InfosetReader can be created to ensure thread safety, using the Clone method. An InfosetReader obtained from BinXMLStream always supports this (Optional).

Moving Focus: To support lazy materialization, the InfosetReader may have the ability to move focus to any location specified by Offset (Optional).

If (reader.hasSeekSupport())
   reader.seek(offset);
InfosetWriter

InfosetWriter is an extension of the InfosetReader interface that supports data writing. XDK provides an implementation on top of binary XML. Users cannot modify this implementation.

Saving XML Text as Binary XML

To create a scalable DOM from XML text, you must save the XML text as binary XML, before you can run DOMParser on it. You can save the XML text as either of the following:

  • Binary XML

  • References to binary XML: You can save the section reference of binary XML instead of actual data, if you know that the data source is available for deserialization.

The following example illustrates how to save as binary XML.

XMLDocument doc;
InfosetWriter writer;
doc.save(writer, false);
writer.close();
 

To save as references to binary XML, use true as the argument for the save command.

Using Lazy Materialization

Using lazy materialization, you can plug in an empty DOM, which can pull in more data when needed and free nodes when they are no longer needed.

Pulling Data on Demand

The plug-in DOM architecture creates an empty DOM, which contains a single Document node as the root of the tree. The rest of the DOM tree can be expanded later if it is accessed. A node may have unexpanded child and sibling nodes, but its parent and ancestors are always expanded. Each node maintains the InfoSetReader.Offset property of the next node so that the DOM can pull data additional to create the next node.

Depending on the access method type, DOM nodes may expand more than the set of nodes returned:

  • DOM Navigation

    The DOM navigation interface allows access to neighboring nodes such as first child, last child, parent, previous or next sibling. If node creation is needed, it is always done in document order.

  • ID Indexing

    A DTD or XML schema can specify nodes with the type ID. If the DOM supports ID indexing, those nodes can be directly retrieved using the index. In the case of scalable DOM, retrieval by index does not cause the expansion of all previous nodes, but their ancestor nodes are materialized.

  • XPath Expressions

    XPath evaluation can cause materialization of all intermediate nodes in memory. For example, the descendent axis '//' results in the expansion of the whole subtree, although some nodes might be released after evaluation.

Freeing Nodes When No Longer Needed

Scalable DOM supports either manual or automatic dereferencing of nodes:

  • Automatic Dereferencing Using Weak References

    To enable automatic dereferencing, set PARTIAL_DOM attribute to Boolean.TRUE.

    Supporting DOM navigation requires adding cross references among nodes. In automatic dereferencing mode, some of the links are weak references, which can be freed during garbage collection.

    Node release is based on the importance of the links: Links to parent nodes cannot be dropped because ancestors provide context for in-scope namespaces and it is difficult to retrieve dropped parent nodes using streaming APIs such as InfosetReader.

    The scalable DOM always holds its parent and previous sibling strongly but holds its children and following sibling weakly. When the Java Virtual Machine frees the nodes, references to them are still available in the underlying data so they can be recreated if needed.

  • Manual Dereferencing:

    To enable manual dereferencing, set the attribute PARTIAL_DOM to Boolean.FALSE and create the DOM with plug-in XML data.

    In this mode, the DOM depends on the application to explicitly dereference a document fragment from the whole tree. There are no weak references. It is recommended that if an application has a deterministic order of processing the data, to avoid the extra overhead of repeatedly releasing and recreating nodes.

    To dereference a node from all other nodes, call freeNode() on it. For example:

    Element root = doc.getDocumentElement();
     Node item = root.getFirstChild();
    While (item != null)
    {
         processItem(item);
         Node tmp = item;
         item = item.getNextSibling();
         ((XMLNode)tmp).freeNode();
    }
    

    The freeNode call has no effect on a non-scalable DOM.

    Note that dereferencing nodes is different from removing nodes from a DOM tree. The DOM tree does not change when freeNode is called on a DOM node. The node can still be accessed and recreated from its parent, previous, and following siblings. However, a variable that holds the node will throw an error when accessing the node after the node has been freed.

Using Shadow Copy

With shadow copy, the data underneath can be shared to avoid data replications

Cloning, a common operation in XML processing, can be done lazily with pluggable DOM.

When the copy method is used, it creates just the root node of the fragment being copied, and the subtree can be expanded on demand.

Data sharing is for the underlying data, not the DOM nodes themselves. The DOM specification requires that the clone and its original have different node identities, and that they have different parent nodes.

Incorporating DOM Updates

The DOM API supports update operations such as adding, deleting nodes, setting, deleting, changing, and inserting values. When a DOM is created by plugging in XML data, the underlying data is considered external to the DOM. DOM updates are visible from the DOM APIs but the data source remains the same. Normal update operations are available and do not interfere with each other.

To make a modified DOM persistent, you must explicitly save the DOM. This merges all the changes with the original data and serializes the data in persistent storage. If you do not save a modified DOM explicitly, the changes are lost once the transaction ends.

Using the PageManager Interface to Support Internal Data

When XML text is parsed with DOMParser and configured to create a scalable DOM, internal data is cached in the form of binary XML, and the DOM API layer is built on top of the internal data. This provides increased scalability, because the binary XML is more compact than DOM nodes.

For additional scalability, the scalable DOM can use backend storage for binary data through the PageManager interface. Then, binary data can be swapped out of memory when not in use.

This code example illustrates how to use the PageManager interface.

DOMParser parser = new DOMParser();
parser.setAttribute(PARTIAL_DOM, Boolean.TRUE); //enable scalable DOM
parser.setAttribute(PAGE_MANAGER, new FilePageManager("pageFile"));
...
// DOMParser other configuration
parser.parse(fileURL);
XMLDocument doc = parser.getDocument();

If the PageManager interface is not used, then the parser caches the whole document as binary XML.

Using Configurable DOM Settings

When you create a DOM with the XMLDOMImplementation class, you can configure the DOM to suit different applications and achieve maximum efficiency, using the setAttribute method in the XMLDOMImplementation class.

public void setAttribute(String name, Object value) throws IllegalArgumentException

For scalable DOM, call setAttribute for the PARTIAL_DOM and ACCESS_MODE attributes.

Note:

New attribute values always affect the next DOM, not the current one, so an instance of XMLDOMImplementation can be used to create DOMs with different configurations.
  • PARTIAL_DOM

    This attribute indicates whether the DOM is scalable (partial), and whether it takes a Boolean value. DOM creation is scalable when the attribute is set to TRUE and nodes that are not in use are freed and recreated when needed. DOM creation is not scalable when the attribute is set to FALSE or is not set.

  • ACCESS_MODE

    This attribute controls the access of the DOM and applies to both scalable DOM and non-scalable DOM. It has the following values:

    • UPDATEABLE

      The DOM supports all DOM update operations.

      UPDATEABLE is the default value for the ACCESS_MODE attribute, in order to maintain backward compatibility with the XDK DOM implementation.

    • READ_ONLY

      DOM can only read this.

      Any attempt to modify the DOM tree results in an error, but node creation such as cloning is allowed, as long as the new nodes are not added to the DOM tree.

    • FORWARD_READ

      This value allows forward navigation, such as getFirstChild().getNextSibling(), and getLastChild(), but not backward access, such as getPreviousSibling().

      FORWARD_READ can still access parent and ancestor nodes.

    • STREAMING

      DOM access is limited to the stream of nodes in Document Order, similar to SAX-event access.

      Following the concept of current node in stream mode, the current node is the last node that has been accessed in document order. Applications can hold nodes in variables and revisit them, but using the DOM method to access any node before the current node causes a DOM error. However, accessing ancestor nodes and attribute nodes is always allowed.

      The following illustrates the DOM behavior in stream mode:

      Node parent = currentNode.getParentNode(); // OK although parent is before current node
      
      Node child = parent.getFirstChild(); // Error if the current node is not the first child of parent!
      
      Attribute attr = parent.getFirstAttribute();// OK accessing attributes from Element is always //allowed
      

    The following lists the access modes from less restrictive to more restrictive.

    UPDATEABLE > READ_ONLY > FORWARD_READ > STREAM_READ

Performance Advantages to Configurable DOM Settings

DOM cannot be modified in READ_ONLY mode, so the whole write buffer is not needed.

DOM does not read backward in FORWARD_READ mode, except to the ancestor node. Therefore, the previous sibling link is not created.

DOM only maintains parent links and does not need to remember data location for a node in STREAM_READ mode. Therefore, it does not need to recreate any node that has been freed.

Scalable DOM Applications

Here is an application that creates and uses a scalable, pluggable DOM:

XMLDOMImplementation domimpl = new XMLDOMImplementation();
domimpl.setAttribute(XMLDocument.SCALABLE_DOM, Boolean.TRUE);
domimpl.setAttribute(XMLDocument.ACCESS_MODE,XMLDocument.UPDATEABLE);
XMLDocument scalableDoc = (XMLDocument) domimpl.createDocument(reader);

Here is an application that creates and uses a scalable, pluggable DOM based on binary XML, which is described in Chapter 5, "Using Binary XML for Java":

BinXMLProcessor proc = BinXMLProcessorFactory.createProcessor();
BinXMLStream bstr = proc.createBinXMLStream();
BinXMLEncoder enc = bstr.getEncoder();
enc.setProperty(BinXMLEncoder.ENC_SCHEMA_AWARE, false);
 
SAXParser parser = new SAXParser();
parser.setContentHandler(enc.getContentHandler());
parser.setErrorHandler(enc.getErrorHandler());
parser.parse(BinXMLUtil.createURL(xmlfile));
 
BinXMLDecoder dec = bstr.getDecoder();
InfosetReader reader = dec.getReader();
XMLDOMImplementation domimpl = new XMLDOMImplementation();
domimpl.setAttribute(XMLDocument.SCALABLE_DOM, Boolean.TRUE);
XMLDocument currentDoc = (XMLDocument) domimpl.createDocument(reader);

Performing DOM Operations with Namespaces

The DOM2Namespace.java program illustrates a simple use of the parser and namespace extensions to the DOM APIs. The program receives an XML document, parses it, and prints the elements and attributes in the document.

The initial four steps of the "Performing Basic DOM Parsing", from parser creation to the getDocument() call, are basically the same as for DOM2Namespace.java. The principal difference is in printing the DOM tree, which is step 5. The DOM2Namespace.java program does the following instead:

// Print document elements
printElements(doc);
 
// Print document element attributes
System.out.println("The attributes of each element are: ");
printElementAttributes(doc);

The printElements() method implemented by DOM2Namespace.java calls getElementsByTagName() to obtain a list of all the elements in the DOM tree. It then loops through each item in the list and casts each Element to an nsElement. For each nsElement it calls nsElement.getPrefix() to get the namespace prefix, nsElement.getLocalName() to get the local name, and nsElement.getNamespaceURI() to get the namespace URI:

static void printElements(Document doc)
{
   NodeList nl = doc.getElementsByTagName("*");
   Element nsElement;
   String prefix;
   String localName;
   String nsName;

   System.out.println("The elements are: ");
   for (int i=0; i < nl.getLength(); i++)
   {
      nsElement = (Element)nl.item(i);
 
      prefix = nsElement.getPrefix();
      System.out.println("  ELEMENT Prefix Name :" + prefix);
 
      localName = nsElement.getLocalName();
      System.out.println("  ELEMENT Local Name    :" + localName);
 
      nsName = nsElement.getNamespaceURI();
      System.out.println("  ELEMENT Namespace     :" + nsName);
   } 
   System.out.println();
}

The printElementAttributes() method calls Document.getElementsByTagName() to obtain a NodeList of the elements in the DOM tree. It then loops through each element and calls Element.getAttributes() to obtain the list of attributes for the element as special list called a NamedNodeMap. For each item in the attribute list it calls nsAttr.getPrefix() to get the namespace prefix, nsAttr.getLocalName() to get the local name, and nsAttr.getValue() to obtain the value:

static void printElementAttributes(Document doc)
{
   NodeList nl = doc.getElementsByTagName("*");
   Element e;
   Attr nsAttr; 
   String attrpfx;
   String attrname;
   String attrval; 
   NamedNodeMap nnm;
   int i, len;
 
   len = nl.getLength();
 
   for (int j=0; j < len; j++)
   {
      e = (Element) nl.item(j);
      System.out.println(e.getTagName() + ":");
 
      nnm = e.getAttributes();
 
      if (nnm != null)
      {
         for (i=0; i < nnm.getLength(); i++)
         {
            nsAttr = (Attr) nnm.item(i);
 
            attrpfx = nsAttr.getPrefix();
            attrname = nsAttr.getLocalName();
            attrval = nsAttr.getNodeValue();
 
            System.out.println(" " + attrpfx + ":" + attrname + " = " 
                               + attrval);
         }
      }
      System.out.println();
   }
}

Performing DOM Operations with Events

The EventSample.java program shows how to register various events with an event listener. For example, if a node is added to a specified DOM element, an event is triggered, which causes the listener to print information about the event.

The program follows these steps:

  1. Instantiate an event listener. When a registered change triggers an event, this event is passed to the event listener, which handles it. The following code fragment from EventSample.java shows the implementation of the listener:

    eventlistener evtlist = new eventlistener();
    ...
    class eventlistener implements EventListener
    {
       public eventlistener(){}
       public void handleEvent(Event e)
       {
          String s = " Event "+e.getType()+" received " + "\n";
          s += " Event is cancelable :"+e.getCancelable()+"\n";
          s += " Event is bubbling event :"+e.getBubbles()+"\n";
          s += " The Target is " + ((Node)(e.getTarget())).getNodeName() + "\n\n";
          System.out.println(s);
       }
    }
    
  2. Instantiate a new XMLDocument and then call getImplementation() to retrieve a DOMImplementation object. You can call the hasFeature() method to determine which features are supported by this implementation. The following code fragment from EventSample.java illustrates this technique:

    XMLDocument doc1 = new XMLDocument();
    DOMImplementation impl = doc1.getImplementation();
     
    System.out.println("The impl supports Events "+
                       impl.hasFeature("Events", "2.0"));
    System.out.println("The impl supports Mutation Events "+
                       impl.hasFeature("MutationEvents", "2.0"));
    
  3. Register desired events with the listener. The following code fragment from EventSample.java registers three events on the document node:

    doc1.addEventListener("DOMNodeRemoved", evtlist, false);
    doc1.addEventListener("DOMNodeInserted", evtlist, false);
    doc1.addEventListener("DOMCharacterDataModified", evtlist, false);
    

    The following code fragment from EventSample.java creates a node of type XMLElement and then registers three events on this node:

    XMLElement el = (XMLElement)doc1.createElement("element");
    ...
    el.addEventListener("DOMNodeRemoved", evtlist, false);
    el.addEventListener("DOMNodeRemovedFromDocument", evtlist, false);
    el.addEventListener("DOMCharacterDataModified", evtlist, false);
    ...
    
  4. Perform actions that trigger events, which are then passed to the listener for handling. The following code fragment from EventSample.java illustrates this technique:

    att.setNodeValue("abc");
    el.appendChild(el1);
    el.appendChild(text);
    text.setNodeValue("xyz");
    doc1.removeChild(el);
    

Performing DOM Operations with Ranges

According to the W3C DOM specification, a range identifies a range of content in a Document, DocumentFragment, or Attr. It selects the content between a pair of boundary-points that correspond to the start and the end of the range. Table 4-8 describes useful range methods accessible through XMLDocument.

Table 4-8 Useful Methods in the Range Class

Method Description

cloneContents()

Duplicates the contents of a range

deleteContents()

Deletes the contents of a range

getCollapsed()

Returns TRUE is the range is collapsed

getEndContainer()

Obtains the node within which the range ends

getStartContainer()

Obtains the node within which the range begins

selectNode()

Selects a node and its contents

selectNodeContents()

Selects the contents within a node

setEnd()

Sets the attributes describing the end of a range

setStart()

Sets the attributes describing the beginning of a range


The DOMRangeSample.java program illustrates some of the things that you can do with ranges.

The initial four steps of the "Performing Basic DOM Parsing", from parser creation to the getDocument() call, are the same as for DOMRangeSample.java. The DOMRangeSample.java program then proceeds by following these steps:

  1. After calling getDocument() to create the XMLDocument, create a range object with createRange() and call setStart() and setEnd() to set its boundaries. The following code fragment from DOMRangeSample.java illustrates this technique:

    XMLDocument doc = parser.getDocument();
    ...
    Range r = (Range) doc.createRange();
    XMLNode c = (XMLNode) doc.getDocumentElement();
     
    // set the boundaries
    r.setStart(c,0);
    r.setEnd(c,1);
    
  2. Call XMLDocument methods to obtain information about the range and manipulate its contents. Table 4-8 describes useful methods. The following code fragment from DOMRangeSample.java selects the contents of the current node and prints it:

    r.selectNodeContents(c);
    System.out.println(r.toString());
    

    The following code fragment clones a range contents and prints it:

    XMLDocumentFragment df =(XMLDocumentFragment) r.cloneContents();
    df.print(System.out);
    

    The following code fragment obtains and prints the start and end containers for the range:

    c = (XMLNode) r.getStartContainer();
    System.out.println(c.getText());
    c = (XMLNode) r.getEndContainer();
    System.out.println(c.getText());
    

Only some of the features of the demo program are described in this section. For more detail, refer to the demo program itself.

Performing DOM Operations with TreeWalker

The W3C DOM Level 2 Traversal and Range specification defines the NodeFilter and TreeWalker interfaces. The XDK includes implementations of these interfaces.

A node filter is an object that can filter out certain types of Node objects. For example, it can filter out entity reference nodes but accept element and attribute nodes. You create a node filter by implementing the NodeFilter interface and then passing a Node object to the acceptNode() method. Typically, the acceptNode() method implementation calls getNodeType() to obtain the type of the node and compares it to static variables such as ELEMENT_TYPE, ATTRIBUTE_TYPE, and so forth, and then returns one of the static fields in Table 4-9 based on what it finds.

Table 4-9 Static Fields in the NodeFilter Interface

Method Description

FILTER_ACCEPT

Accept the node. Navigation methods defined for NodeIterator or TreeWalker will return this node.

FILTER_REJECT

Rejects the node. Navigation methods defined for NodeIterator or TreeWalker will not return this node. For TreeWalker, the children of this node will also be rejected. NodeIterators treat this as a synonym for FILTER_SKIP.

FILTER_SKIP

Skips this single node. Navigation methods defined for NodeIterator or TreeWalker will not return this node. For both NodeIterator and TreeWalker, the children of this node will still be considered.


You can use TreeWalker objects to traverse a document tree or subtree using the view of the document defined by their whatToShow flags and filters (if any). You can use the XMLDocument.createTreeWalker() method to create a TreeWalker object by specifying the following:

  • A root node for the tree

  • A flag that governs the type of nodes it should include in the logical view

  • A filter for filtering nodes

  • A flag that determines whether entity references and their descendents should be included

Table 4-10 describes useful methods in the org.w3c.dom.traversal.TreeWalker interface.

Table 4-10 Useful Methods in the TreeWalker Interface

Method Description

firstChild()

Moves the tree walker to the first visible child of the current node and returns the new node. If the current node has no visible children, then it returns null and retains the current node.

getRoot()

Obtains the root node of the tree walker as specified when it was created.

lastChild()

Moves the tree walker to the last visible child of the current node and returns the new node. If the current node has no visible children, then it returns null and retains the current node.

nextNode()

Moves the tree walker to the next visible node in document order relative to the current node and returns the new node.


The TreeWalkerSample.java program illustrates some of the things that you can do with node filters and tree traversals.

The initial four steps of the "Performing Basic DOM Parsing", from parser creation to the getDocument() call, are the same as for TreeWalkerSample.java. The TreeWalkerSample.java program then proceeds by following these steps:

  1. Create a node filter object. The acceptNode() method in the nf class, which implements the NodeFilter interface, invokes getNodeType() to obtain the type of node. The following code fragment from TreeWalkerSample.java illustrates this technique:

    NodeFilter n2 = new nf();
    ...
    class nf implements NodeFilter
    {
      public short acceptNode(Node node)
      {
        short type = node.getNodeType();
     
        if ((type == Node.ELEMENT_NODE) || (type == Node.ATTRIBUTE_NODE))
           return FILTER_ACCEPT;
        if ((type == Node.ENTITY_REFERENCE_NODE))
           return FILTER_REJECT;
        return FILTER_SKIP;
      }
    }
    
  2. Invoke the XMLDocument.createTreeWalker() method to create a tree walker. The following code fragment from TreeWalkerSample.java uses the root node of the XMLDocument as the root node of the tree walker and includes all nodes in the tree:

    XMLDocument doc = parser.getDocument();
    ...
    TreeWalker tw = doc.createTreeWalker(doc.getDocumentElement(),NodeFilter.SHOW_ALL,n2,true);
    
  3. Obtain the root element of the TreeWalker object. The following code fragment illustrates this technique:

    XMLNode nn = (XMLNode)tw.getRoot();
    
  4. Traverse the tree. The following code fragment illustrates how to walk the tree in document order by calling the TreeWalker.nextNode() method:

    while (nn != null)
    {
      System.out.println(nn.getNodeName() + " " + nn.getNodeValue());
      nn = (XMLNode)tw.nextNode();
    }
    

    The following code fragment illustrates how to walk the tree the left depth of the tree by calling the firstChild() method (you can traverse the right depth of the tree by calling the lastChild() method):

     while (nn != null)
     {
       System.out.println(nn.getNodeName() + " " + nn.getNodeValue());
       nn = (XMLNode)tw.firstChild();
     }
    

Only some of the features of the demo program are described in this section. For more detail, refer to the demo program itself.

Parsing XML with SAX

SAX is a standard interface for event-based XML parsing. This section contains the following topics:

Using the SAX API

The SAX API, which is released in a Level 1 and Level 2 versions, is a set of interfaces and classes. We can divide the API into the following categories:

  • Interfaces implemented by the Oracle XML parser.

  • Interfaces that you must implement in your application. The SAX 2.0 interfaces are listed in Table 4-11.

    Table 4-11 SAX2 Handler Interfaces

    Interface Description

    ContentHandler

    Receives notifications from the XML parser. The major event-handling methods are startDocument(), endDocument(), startElement(), and endElement() when it recognizes an XML tag. This interface also defines the methods characters() and processingInstruction(), which are invoked when the parser encounters the text in an XML element or an inline processing instruction.

    DeclHandler

    Receives notifications about DTD declarations in the XML document.

    DTDHandler

    Processes notations and unparsed (binary) entities.

    EntityResolver

    Needed to perform redirection of URIs in documents. The resolveEntity() method is invoked when the parser must identify data identified by a URI.

    ErrorHandler

    Handles parser errors. The program invokes the methods error(), fatalError(), and warning() in response to various parsing errors.

    LexicalHandler

    Receives notifications about lexical information such as comments and CDATA section boundaries.


  • Standard SAX classes.

  • Additional Java classes in org.xml.sax.helper. The SAX 2.0 helper classes are as follows:

    • AttributeImpl, which makes a persistent copy of an AttributeList

    • DefaultHandler, which is a base class with default implementations of the SAX2 handler interfaces listed in Table 4-11

    • LocatorImpl, which makes a persistent snapshot of a Locator's values at specified point in the parse

    • NamespaceSupport, which adds support for XML namespaces

    • XMLFilterImpl, which is a base class used by applications that need to modify the stream of events

    • XMLReaderFactory, which supports loading SAX parsers dynamically

  • Demonstration classes in the nul package.

Figure 4-5 illustrates how to create a SAX parser and use it to parse an input document.

Figure 4-5 Using the SAXParser Class

Description of Figure 4-5 follows
Description of "Figure 4-5 Using the SAXParser Class"

The basic stages for parsing an input XML document with SAX are as follows:

  1. Create a SAXParser object and configure its properties (see Table 4-5 for useful property methods). For example, set the validation mode of the parser.

  2. Instantiate an event handler. The program should provide implementations of the handler interfaces in Table 4-11.

  3. Register the event handlers with the parser. You must register your event handlers with the parser so that it knows which methods to invoke when a given event occurs. Table 4-12 lists registration methods available in SAXParser.

    Table 4-12 SAXParser Methods for Registering Event Handlers

    Method Use this method to . . .

    setContentHandler()

    Register a content event handler with an application. The org.xml.sax.DefaultHandler class implements the org.xml.sax.ContentHandler interface. Applications can register a new or different handler in the middle of a parse; the SAX parser must begin using the new handler immediately.

    setDTDHandler()

    Register a DTD event handler. If the application does not register a DTD handler, all DTD events reported by the SAX parser are silently ignored. Applications may register a new or different handler in the middle of a parse; the SAX parser must begin using the new handler immediately.

    setErrorHandler()

    Register an error event handler with an application. If the application does not register an error handler, all error events reported by the SAX parser are silently ignored; however, normal processing may not continue. It is highly recommended that all SAX applications implement an error handler to avoid unexpected bugs. Applications may register a new or different handler in the middle of a parse; the SAX parser must begin using the new handler immediately.

    setEntityResolver()

    Register an entity resolver with an application. If the application does not register an entity resolver, the XMLReader performs its own default resolution. Applications may register a new or different resolver in the middle of a parse; the SAX parser must begin using the new resolver immediately.


  4. Parse the input document with the SAXParser.parse() method. All SAX interfaces are assumed to be synchronous: the parse method must not return until parsing is complete. Readers must wait for an event-handler callback to return before reporting the next event.

  5. When the SAXParser.parse() method is called, the program invokes one of several callback methods implemented in the application. The methods are defined by the ContentHandler, ErrorHandler, DTDHandler, and EntityResolver interfaces implemented in the event handler. For example, the application can call the startElement() method when a start element is encountered.

Performing Basic SAX Parsing

The SAXSample.java program illustrates the basic steps of SAX parsing. The SAXSample class extends HandlerBase. The program receives an XML file as input, parses it, and prints information about the contents of the file.

The program follows these steps:

  1. Store the Locator. The Locator associates a SAX event with a document location. The SAX parser provides location information to the application by passing a Locator instance to the setDocumentLocator() method in the content handler. The application can use the object to obtain the location of any other content handler event in the XML source document. The following code fragment from SAXSample.java illustrates this technique:

    Locator locator;
    
  2. Instantiate a new event handler. The following code fragment from SAXSample.java illustrates this technique:

    SAXSample sample = new SAXSample();
    
  3. Instantiate the SAX parser and configure it. The following code fragment from SAXSample.java sets the mode to DTD validation:

    Parser parser = new SAXParser();
    ((SAXParser)parser).setValidationMode(SAXParser.DTD_VALIDATION);
    
  4. Register event handlers with the SAX parser. You can use the registration methods in the SAXParser class, but you must implement the handler interfaces yourself. The following code fragment registers the handlers:

    parser.setDocumentHandler(sample);
    parser.setEntityResolver(sample);
    parser.setDTDHandler(sample);
    parser.setErrorHandler(sample);
    

    The following code shows some of the DocumentHandler interface implementation:

    public void setDocumentLocator (Locator locator)
    {
      System.out.println("SetDocumentLocator:");
      this.locator = locator;
    }
    public void startDocument()
    {
      System.out.println("StartDocument");
    }
    public void endDocument() throws SAXException
    {
      System.out.println("EndDocument");
    }
    public void startElement(String name, AttributeList atts)
                                                   throws SAXException
    {
      System.out.println("StartElement:"+name);
      for (int i=0;i<atts.getLength();i++)
      {
        String aname = atts.getName(i);
        String type = atts.getType(i);
        String value = atts.getValue(i); 
        System.out.println("   "+aname+"("+type+")"+"="+value);
      }  
    }
    ...
    

    The following code shows the EntityResolver interface implementation:

    public InputSource resolveEntity (String publicId, String systemId)
                          throws SAXException
    {
      System.out.println("ResolveEntity:"+publicId+" "+systemId);
      System.out.println("Locator:"+locator.getPublicId()+" locator.getSystemId()+
                        " "+locator.getLineNumber()+" "+locator.getColumnNumber());
      return null;
    }
    

    The following code shows the DTDHandler interface implementation:

    public void notationDecl (String name, String publicId, String systemId)
    {
      System.out.println("NotationDecl:"+name+" "+publicId+" "+systemId);
    }
    public void unparsedEntityDecl (String name, String publicId,
                                    String systemId, String notationName)
    {
      System.out.println("UnparsedEntityDecl:"+name + " "+publicId+" "+
                          systemId+" "+notationName);
    }
    

    The following code shows the ErrorHandler interface implementation:

    public void warning (SAXParseException e)
               throws SAXException
    {
      System.out.println("Warning:"+e.getMessage());
    }
    public void error (SAXParseException e)
               throws SAXException
    {
      throw new SAXException(e.getMessage());
    }
    public void fatalError (SAXParseException e)
              throws SAXException
    {
      System.out.println("Fatal error");
      throw new SAXException(e.getMessage());
    }
    
  5. Parse the input XML document. The following code fragment converts the document to a URL and then parses it:

    parser.parse(DemoUtil.createURL(argv[0]).toString());
    

Performing Basic SAX Parsing with Namespaces

This section discusses the SAX2Namespace.java sample program, which implements an event handler named XMLDefaultHandler as a subclass of the org.xml.sax.helpers.DefaultHandler class. The easiest way to implement the ContentHandler interface is to extend the org.xml.sax.helpers.DefaultHandler class. The DefaultHandler class provides some default behavior for handling events, although typically the behavior is to do nothing.

The SAX2Namespace.java program overrides methods for only the events that it cares about. Specifically, the XMLDefaultHandler class implements only two methods: startElement() and endElement(). The startElement event is triggered whenever SAXParser encounters a new element within the XML document. When this event is triggered, the startElement() method prints the namespace information for the element.

The SAX2Namespace.java sample program follows these steps:

  1. Instantiate a new event handler of type DefaultHandler. The following code fragment illustrates this technique:

    DefaultHandler defHandler = new XMLDefaultHandler();
    
  2. Create a SAX parser and set its validation mode. The following code fragment from SAXSample.java sets the mode to DTD validation:

    Parser parser = new SAXParser();
    ((SAXParser)parser).setValidationMode(SAXParser.DTD_VALIDATION);
    
  3. Register event handlers with the SAX parser. The following code fragment registers handlers for the input document, the DTD, entities, and errors:

    parser.setContentHandler(defHandler);
    parser.setEntityResolver(defHandler);
    parser.setDTDHandler(defHandler);
    parser.setErrorHandler(defHandler);
    

    The following code shows the XMLDefaultHandler implementation. The startElement() and endElement() methods print the qualified name, local name, and namespace URI for each element (refer to Table 4-7 for an explanation of these terms):

    class XMLDefaultHandler extends DefaultHandler
    {
       public void XMLDefaultHandler(){}
       public void startElement(String uri, String localName,
                                String qName, Attributes atts)
       throws SAXException
       {
          System.out.println("ELEMENT Qualified Name:" + qName);
          System.out.println("ELEMENT Local Name    :" + localName);
          System.out.println("ELEMENT Namespace     :" + uri);
     
          for (int i=0; i<atts.getLength(); i++)
          {
             qName = atts.getQName(i);
             localName = atts.getLocalName(i);
             uri = atts.getURI(i);
     
             System.out.println(" ATTRIBUTE Qualified Name   :" + qName);
             System.out.println(" ATTRIBUTE Local Name       :" + localName);
             System.out.println(" ATTRIBUTE Namespace        :" + uri);
     
             // You can get the type and value of the attributes either
             // by index or by the Qualified Name.
     
             String type = atts.getType(qName);
             String value = atts.getValue(qName);
     
             System.out.println(" ATTRIBUTE Type             :" + type);
             System.out.println(" ATTRIBUTE Value            :" + value);
     
             System.out.println();
     
          }
       }
       public void endElement(String uri, String localName,
                              String qName) throws SAXException
       {
          System.out.println("ELEMENT Qualified Name:" + qName);
          System.out.println("ELEMENT Local Name    :" + localName);
          System.out.println("ELEMENT Namespace     :" + uri);
       }
    }
    
  4. Parse the input XML document. The following code fragment converts the document to a URL and then parses it:

    parser.parse(DemoUtil.createURL(argv[0]).toString());
    

Performing SAX Parsing with XMLTokenizer

You can create a simple SAX parser as a instance of the XMLTokenizer class and use the parser to tokenize the input XML. Table 4-13 lists useful methods in the class.

Table 4-13 XMLTokenizer Methods

Method Description

setToken()

Register a new token for XML tokenizer.

setErrorStream()

Register a output stream for errors

tokenize()

Tokenizes the input XML


SAX parsers with Tokenizer features must implement the XMLToken interface. The callback method for XMLToken is token(), which receives an XML token and its corresponding value and performs an action. For example, you can implement token() so that it prints the token name followed by the value of the token.

The Tokenizer.java program accepts an XML document as input, parses it, and prints a list of the XML tokens. The program implements a doParse() method that does the following:

  1. Create a URL from the input XML stream:

    URL url = DemoUtil.createURL(arg);
    
  2. Create an XMLTokenizer parser as follows:

    parser  = new XMLTokenizer ((XMLToken)new Tokenizer());
    
  3. Register an output error stream as follows:

    parser.setErrorStream  (System.out);
    
  4. Register tokens with the parser. The following code fragment from Tokenizer.java shows just some of the registered tokens:

    parser.setToken (STagName, true);
    parser.setToken (EmptyElemTag, true);
    parser.setToken (STag, true);
    parser.setToken (ETag, true);
    parser.setToken (ETagName, true);
    ...
    
  5. Tokenize the XML document as follows:

    parser.tokenize (url);
    

    The token() callback method determines the action to take when an particular token is encountered. The following code fragment from Tokenizer.java shows some of the implementation of this method:

    public void token (int token, String value)
    {
       switch (token)
       {
       case XMLToken.STag:
          System.out.println ("STag: " + value);
          break;
       case XMLToken.ETag:
          System.out.println ("ETag: " + value);
          break;
       case XMLToken.EmptyElemTag:
          System.out.println ("EmptyElemTag: " + value);
          break;
       case XMLToken.AttValue:
          System.out.println ("AttValue: " + value);
          break;
       ...
       default:
          break;
       }
    }
    

Parsing XML with JAXP

JAXP enables you to use the SAX and DOM parsers and the XSLT processor in your Java program. This section contains the following topics:

Using the JAXP API

The JAXP APIs, which are listed in Table 4-14, have an API structure consisting of abstract classes that provide a thin layer for parser pluggability. Oracle implemented JAXP based on the Sun Microsystems reference implementation.

Table 4-14 JAXP Packages

Package Description

javax.xml.parsers

Provides standard APIs for DOM 2.0 and SAX 1.0 parsers. The package contains vendor-neutral factory classes that give you a SAXParser and a DocumentBuilder. DocumentBuilder creates a DOM-compliant Document object.

javax.xml.transform

Defines the generic APIs for processing XML transformation and performing a transformation from a source to a result.

javax.xml.transform.dom

Provides DOM-specific transformation APIs.

javax.xml.transform.sax

Provides SAX2-specific transformation APIs.

javax.xml.transform.stream

Provides stream- and URI- specific transformation APIs.


Using the SAX API Through JAXP

You can rely on the factory design pattern to create new SAX parser engines with JAXP. Figure 4-6 illustrates the basic process.

Figure 4-6 SAX Parsing with JAXP

Description of Figure 4-6 follows
Description of "Figure 4-6 SAX Parsing with JAXP"

The basic steps for parsing with SAX through JAXP are as follows:

  1. Create a new SAX parser factory with the SAXParserFactory class.

  2. Configure the factory.

  3. Create a new SAX parser (SAXParser) object from the factory.

  4. Set the event handlers for the SAX parser.

  5. Parse the input XML documents.

Using the DOM API Through JAXP

You can rely on the factory design pattern to create new DOM document builder engines with JAXP. Figure 4-7 illustrates the basic process.

Figure 4-7 DOM Parsing with JAXP

Description of Figure 4-7 follows
Description of "Figure 4-7 DOM Parsing with JAXP"

The basic steps for parsing with DOM through JAXP are as follows:

  1. Create a new DOM parser factory. with the DocumentBuilderFactory class.

  2. Configure the factory.

  3. Create a new DOM builder (DocumentBuilder) object from the factory.

  4. Set the error handler and entity resolver for the DOM builder.

  5. Parse the input XML documents.

Transforming XML Through JAXP

The basic steps for transforming XML through JAXP are as follows:

  1. Create a new transformer factory. Use the TransformerFactory class.

  2. Configure the factory.

  3. Create a new transformer from the factory and specify an XSLT stylesheet.

  4. Configure the transformer.

  5. Transform the document.

Parsing with JAXP

The JAXPExamples.java program illustrates the basic steps of parsing with JAXP. The program implements the following methods and uses them to parse and perform additional processing on XML files in the /jaxp directory:

  • basic()

  • identity()

  • namespaceURI()

  • templatesHandler()

  • contentHandler2contentHandler()

  • contentHandler2DOM()

  • reader()

  • xmlFilter()

  • xmlFilterChain()

The program creates URLs for the jaxpone.xml and jaxpone.xsl sample XML files and then calls the preceding methods in sequence. The basic design of the demo is as follows (to save space only the basic() method is shown):

public class JAXPExamples
{
        public static void main(String argv[])
        throws TransformerException, TransformerConfigurationException,
               IOException, SAXException, ParserConfigurationException,                 
               FileNotFoundException
        {
        try {
         URL xmlURL = createURL("jaxpone.xml");
         String xmlID = xmlURL.toString();
         URL xslURL = createURL("jaxpone.xsl");
         String xslID = xslURL.toString();
         //
         System.out.println("--- basic ---");
         basic(xmlID, xslID);
         System.out.println();
         ...
      } catch(Exception err) {
        err.printStackTrace();
      }
   }
   //
   public static void basic(String xmlID, String xslID)
      throws TransformerException, TransformerConfigurationException
   {
      TransformerFactory tfactory = TransformerFactory.newInstance();
      Transformer transformer = tfactory.newTransformer(new StreamSource(xslID));
      StreamSource source = new StreamSource(xmlID);
      transformer.transform(source, new StreamResult(System.out));
   }
...
}

The reader() method in JAXPExamples.java program shows a simple technique for parsing an XML document with SAX. It follows these steps:

  1. Create a new instance of a TransformerFactory and then cast it to a SAXTransformerFactory. The application can use the SAX factory to configure and obtain SAX parser instances. For example:

    TransformerFactory tfactory = TransformerFactory.newInstance();
    SAXTransformerFactory stfactory = (SAXTransformerFactory)tfactory;
    
  2. Create an XML reader by creating a StreamSource object from a stylesheet and passing it to the factory method newXMLFilter(). This method returns an XMLFilter object that uses the specified Source as the transformation instructions. For example:

    URL xslURL = createURL("jaxpone.xsl");
    String xslID = xslURL.toString();
    ...
    StreamSource streamSource = new StreamSource(xslID);
    XMLReader reader = stfactory.newXMLFilter(streamSource);
    
  3. Create content handler and register it with the XML reader. The following example creates an instance of the class oraContentHandler, which is created by compiling the oraContentHandler.java program in the demo directory:

    ContentHandler contentHandler = new oraContentHandler();
    reader.setContentHandler(contentHandler);
    

    The following code fragment shows some of the implementation of the oraContentHandler class:

    public class oraContentHandler implements ContentHandler
    {
       private static final String TRADE_MARK = "Oracle 9i ";
     
       public void setDocumentLocator(Locator locator)
       {
          System.out.println(TRADE_MARK + "- setDocumentLocator");
       }
     
       public void startDocument()
          throws SAXException
       {
          System.out.println(TRADE_MARK + "- startDocument");
       }
     
       public void endDocument()
          throws SAXException
       {
          System.out.println(TRADE_MARK + "- endDocument");
       }
       ...
    
  4. Parse the input XML document by passing the InputSource to the XMLReader.parse() method. For example:

    InputSource is = new InputSource(xmlID);
    reader.parse(is);
    

Performing Basic Transformations with JAXP

You can use JAXP to transform any class of the interface Source into a class of the interface Result. Table 4-15 shows some sample transformations.

Table 4-15 Transforming Classes with JAXP

Use JAXP to transform this class . . . Into this class . . .

DOMSource

DOMResult

StreamSource

StreamResult

SAXSource

SAXResult


These transformations accept the following types of input:

  • XML documents

  • stylesheets

  • The ContentHandler class defined in oraContentHandler.java

For example, you can use the identity() method to perform a transformation in which the output XML document is the same as the input XML document. You can use the xmlFilterChain() method to apply three stylesheets in a chain.

The basic() method shows how to perform a basic XSLT transformation. The method follows these steps:

  1. Create a new instance of a TransformerFactory. For example:

    TransformerFactory tfactory = TransformerFactory.newInstance();
    
  2. Create a new XSL transformer from the factory and specify the stylesheet to use for the transformation. The following example specifies the jaxpone.xsl stylesheet:

    URL xslURL = createURL("jaxpone.xsl");
    String xslID = xslURL.toString();
    . . .
    Transformer transformer = tfactory.newTransformer(new StreamSource(xslID));
    
  3. Set the stream source to the input XML document. The following fragment from the basic() method sets the stream source to jaxpone.xml:

    URL xmlURL = createURL("jaxpone.xml");
    String xmlID = xmlURL.toString();
    . . .
    StreamSource source = new StreamSource(xmlID);
    
  4. Transform the document from a StreamSource to a StreamResult. The following example transforms a StreamSource into a StreamResult:

    transformer.transform(source, new StreamResult(System.out));
    

Compressing XML

The Oracle XDK enables you to use SAX or DOM to parse XML and then write the parsed data to a compressed binary stream. You can then reverse the process and reconstruct the XML data. This section contains the following topics:

Compressing and Decompressing XML from DOM

The DOMCompression.java and DOMDeCompression.java programs illustrate the basic steps of DOM compression and decompression. The most important DOM compression methods are the following:

  • XMLDocument.writeExternal() saves the state of the object by creating a binary compressed stream with information about the object.

  • XMLDocument.readExternal() reads the information written in the compressed stream by the writeExternal() method and restores the object.

Compressing a DOM Object

The basic technique for serialization is create an XMLDocument by parsing an XML document, initialize an ObjectOutputStream, and then call XMLDocument.writeExternal() to write the compressed stream.

The DOMCompression.java program follows these steps:

  1. Create a DOM parser, parse an input XML document, and obtain the DOM representation. This technique is described in "Performing Basic DOM Parsing". The following code fragment from DOMCompression.java illustrates this technique:

    public class DOMCompression
    {
       static OutputStream out = System.out;
       public static void main(String[] args)
       {
          XMLDocument doc = new XMLDocument();
          DOMParser parser = new DOMParser();
          try
          {
            parser.setValidationMode(XMLParser.SCHEMA_VALIDATION);
            parser.setPreserveWhitespace(false);
            parser.retainCDATASection(true);
            parser.parse(createURL(args[0]));
            doc = parser.getDocument();
            ...
    
  2. Create a FileOutputStream and wrap it in an ObjectOutputStream for serialization. The following code fragment creates the xml.ser output file:

    OutputStream os = new FileOutputStream("xml.ser");
    ObjectOutputStream oos = new ObjectOutputStream(os);
    
  3. Serialize the object to the file by calling XMLDocument.writeExternal(). This method saves the state of the object by creating a binary compressed stream with information about this object. The following statement illustrates this technique:

    doc.writeExternal(oos);
    

Decompressing a DOM Object

The basic technique for decompression is to create an ObjectInputStream object and then call XMLDocument.readExternal() to read the compressed stream.The DOMDeCompression.java program follows these steps:

  1. Create a file input stream for the compressed file and wrap it in an ObjectInputStream. The following code fragment from DOMDeCompression.java creates a FileInputStream from the compressed file created in the previous section:

    InputStream is;
    ObjectInputStream ois;
    ...
    is = new FileInputStream("xml.ser");
    ois = new ObjectInputStream(is);
    
  2. Create a new XML document object to contain the decompressed data. The following code fragment illustrates this technique:

    XMLDocument serializedDoc = null;
    serializedDoc = new XMLDocument();
    
  3. Read the compressed file by calling XMLDocument.readExternal(). The following code fragment read the data and prints it to System.out:

    serializedDoc.readExternal(ois);
    serializedDoc.print(System.out);
    

Compressing and Decompressing XML from SAX

The SAXCompression.java program illustrates the basic steps of parsing a file with SAX, writing the compressed stream to a file, and then reading the serialized data from the file. The important classes are as follows:

  • CXMLHandlerBase is a SAX Handler that compresses XML data based on SAX events. To use the SAX compression, implement this interface and register with the SAX parser by calling Parser.setDocumentHandler().

  • CXMLParser is an XML parser that regenerates SAX events from a compressed stream.

Compressing a SAX Object

The basic technique for serialization is to register a CXMLHandlerBase handler with a SAX parser, initialize an ObjectOutputStream, and then parse the input XML. The SAXCompression.java program follows these steps:

  1. Create a FileOutputStream and wrap it in an ObjectOutputStream. The following code fragment from SAXCompression.java creates the xml.ser file:

    String compFile = "xml.ser";
    FileOutputStream outStream = new FileOutputStream(compFile);
    ObjectOutputStream out = new ObjectOutputStream(outStream);
    
  2. Create the SAX event handler. The CXMLHandlerBase class implements the ContentHandler, DTDHandler, EntityResolver, and ErrorHandler interfaces. The following code fragment illustrates this technique:

    CXMLHandlerBase cxml = new CXMLHandlerBase(out);
    
  3. Create the SAX parser. The following code fragment illustrates this technique:

    SAXParser parser = new SAXParser();
    
  4. Configure the SAX parser. The following code fragment sets the content handler and entity resolver, and also sets the validation mode:

    parser.setContentHandler(cxml);
    parser.setEntityResolver(cxml);
    parser.setValidationMode(XMLConstants.NONVALIDATING);
    

    Note that oracle.xml.comp.CXMLHandlerBase implements both DocumentHandler and ContentHandler interfaces, but use of the SAX 2.0 ContentHandler interface is preferred.

  5. Parse the XML. The program writes the serialized data to the ObjectOutputStream. The following code fragment illustrates this technique:

    parser.parse(url);
    

Decompressing a SAX Object

The basic technique for deserialization of a SAX object is to create a SAX compression parser with the CXMLParser class, set the content handler for the parser, and then parse the compressed stream.

The SAXDeCompression.java program follows these steps:

  1. Create a SAX event handler. The SampleSAXHandler.java program creates a handler for use by SAXDeCompression.java. The following code fragment from SAXDeCompression.java creates handler object:

    SampleSAXHandler xmlHandler = new SampleSAXHandler();
    
  2. Create the SAX parser by instantiating the CXMLParser class. This class implements the regeneration of XML documents from a compressed stream by generating SAX events from them. The following code fragment illustrates this technique:

    CXMLParser parser = new CXMLParser();
    
  3. Set the event handler for the SAX parser. The following code fragment illustrates this technique:

    parser.setContentHandler(xmlHandler);
    
  4. Parse the compressed stream and generates the SAX events. The following code receives a filename from the command line and parses the XML:

    parser.parse(args[0]);
    

Tips and Techniques for Parsing XML

This section contains the following topics:

Extracting Node Values from a DOM Tree

You can use the selectNodes() method in the XMLNode class to extract content from a DOM tree or subtree based on the select patterns allowed by XSL. You can use the optional second parameter of selectNodes() to resolve namespace prefixes, that is, to return the expanded namespace URL when given a prefix. The XMLElement class implements NSResolver, so a reference to an XMLElement object can be sent as the second parameter. XMLElement resolves the prefixes based on the input document. You can use the NSResolver interface if you need to override the namespace definitions.

The sample code in Example 4-4 illustrates how to use selectNodes().

Example 4-4 Extracting Contents of a DOM Tree with selectNodes()

//
// selectNodesTest.java
//
import java.io.*;
import oracle.xml.parser.v2.*;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
 
public class selectNodesTest
{
  public static void main(String[] args)
    throws Exception
  {
    // supply an xpath expression
    String pattern = "/family/member/text()";
    // accept a filename on the command line
    // run the program with $ORACLE_HOME/xdk/demo/java/parser/common/family.xml
    String file    = args[0];
 
    if (args.length == 2)
      pattern = args[1];
 
    DOMParser dp = new DOMParser();
 
    dp.parse(DemoUtil.createURL(file));  // include createURL from DemoUtil
    XMLDocument xd = dp.getDocument();
    XMLElement element = (XMLElement) xd.getDocumentElement();
    NodeList nl = element.selectNodes(pattern, element);
    for (int i = 0; i < nl.getLength(); i++)
    {
      System.out.println(nl.item(i).getNodeValue());
    } // end for
  } // end main
} // end selectNodesTest

To test the program, create a file with the code in Example 4-4 and then compile it in the $ORACLE_HOME/xdk/demo/java/parser/common directory. Pass the filename family.xml to the program as a parameter to traverse the <family> tree. The output should be as follows:

% java selectNodesTest family.xml
Sarah
Bob
Joanne
Jim

Now run the following to determine the values of the memberid attributes of all <member> elements in the document:

% java selectNodesTest family.xml //member/@memberid
m1
m2
m3
m4

Merging Documents with appendChild()

Suppose that you want to write a program so that a user can fill in a client-side Java form and obtain an XML document. Suppose that your Java program contains the following variables of type String:

String firstname = "Gianfranco";
String lastname = "Pietraforte";

You can use either of the following techniques to insert this information into an XML document:

  • Create an XML document in a string and then parse it. For example:

    String xml = "<person><first>"+firstname+"</first>"+
         "<last>"+lastname+"</last></person>";
    DOMParser d = new DOMParser();
    d.parse(new StringReader(xml));
    Document xmldoc = d.getDocument();
    
  • Use DOM APIs to construct an XML document, creating elements and then appending them to one another. For example:

    Document xmldoc = new XMLDocument();
    Element e1 = xmldoc.createElement("person");
    xmldoc.appendChild(e1);
    Element e2 = xmldoc.createElement("firstname");
    e1.appendChild(e2);
    Text t = xmldoc.createText("Larry");
    e2.appendChild(t);
    

Note that you can only use the second technique on a single DOM tree. For example, suppose that you write the code snippet in Example 4-5.

Example 4-5 Incorrect Use of appendChild()

XMLDocument xmldoc1 = new XMLDocument();
XMLElement e1 = xmldoc1.createElement("person");
XMLDocument xmldoc2 = new XMLDocument();
XMLElement e2 = xmldoc2.createElement("firstname");
e1.appendChild(e2);  

The preceding code raises a DOM exception of WRONG_DOCUMENT_ERR when calling XMLElement.appendChild() because the owner document of e1 is xmldoc1 whereas the owner of e2 is xmldoc2. The appendChild() method only works within a single tree, but the code in Example 4-5 uses two different trees.

You can use the XMLDocument.importNode() method, introduced in DOM 2, and the XMLDocument.adoptNode() method, introduced in DOM 3, to copy and paste a DOM document fragment or a DOM node across different XML documents. The commented lines in Example 4-6 show how to perform this task.

Example 4-6 Merging Documents with appendChild

XMLDocument doc1 = new XMLDocument();
XMLElement element1 = doc1.createElement("person");
XMLDocument doc2 = new XMLDocument();
XMLElement element2 = doc2.createElement("firstname");
// element2 = doc1.importNode(element2);
// element2 = doc1.adoptNode(element2);
element1.appendChild(element2);

Parsing DTDs

This section discusses techniques for parsing DTDs. It contains the sections:

Loading External DTDs

If you call the DOMParser.parse() method to parse the XML Document as an InputStream, then use the DOMParser.setBaseURL() method to recognize external DTDs within your Java program. This method points to a location where the DTDs are exposed.

The following procedure describes how to load and parse a DTD:

  1. Load the DTD as an InputStream. For example, assume that you want to validate documents against the /mydir/my.dtd external DTD. You can use the following code:

    InputStream is = MyClass.class.getResourceAsStream("/mydir/my.dtd");
    

    This code opens ./mydir/my.dtd in the first relative location in the CLASSPATH where it can be found, including the JAR file if it is in the CLASSPATH.

  2. Create a DOM parser and set the validation mode. For example, use this code:

    DOMParser d = new DOMParser();
    d.setValidationMode(DTD_VALIDATION);
    
  3. Parse the DTD. The following example passes the InputStream object to the DOMParser.parseDTD() method:

    d.parseDTD(is, "rootelementname");
    
  4. Get the document type and then set it. The getDoctype() method obtains the DTD object and the setDoctype() method sets the DTD to use for parsing. The following example illustrates this technique:

    d.setDoctype(d.getDoctype());
    

    The following code demonstrates an alternative technique. You can invoke the parseDTD() method to parse a DTD file separately and get a DTD object:

    d.parseDTD(new FileReader("/mydir/my.dtd"));
    DTD dtd = d.getDoctype();
    parser.setDoctype(dtd);
    
  5. Parse the input XML document. For example, the following code parses mydoc.xml:

    d.parse("mydoc.xml");
    

Caching DTDs with setDoctype

The XML parser for Java provides for DTD caching in validation and nonvalidation modes through the DOMParser.setDoctype() method. After you set the DTD with this method, the parser caches this DTD for further parsing. Note that DTD caching is optional and is not enabled automatically.

Assume that your program must parse several XML documents with the same DTD. After you parse the first XML document, you can obtain the DTD from the parser and set it as in the following example:

DOMParser parser = new DOMParser();
DTD dtd = parser.getDoctype();
parser.setDoctype(dtd);

The parser caches this DTD and uses it for parsing subsequent XML documents. Example 4-7 provides a more complete illustration of how you can invoke DOMParser.setDoctype() to cache the DTD.

Example 4-7 DTDSample.java

/**
 * DESCRIPTION
 * This program illustrates DTD caching.
 */

import java.net.URL;
import java.io.*;
import org.xml.sax.InputSource;
import oracle.xml.parser.v2.*;
 
public class DTDSample
{
   static public void main(String[] args)
   {
      try
      {
         if (args.length != 3)
         {
            System.err.println("Usage: java DTDSample dtd rootelement xmldoc");
            System.exit(1);
         }
 
         // Create a DOM parser
         DOMParser parser = new DOMParser();
 
         // Configure the parser
         parser.setErrorStream(System.out);
         parser.showWarnings(true);
 
        // Create a FileReader for the DTD file specified on the command
        // line and wrap it in an InputSource
        FileReader r = new FileReader(args[0]);
        InputSource inSource = new InputSource(r);
 
        // Create a URL from the command-line argument and use it to set the 
        // system identifier
        inSource.setSystemId(DemoUtil.createURL(args[0]).toString());
 
        // Parse the external DTD from the input source. The second argument is 
        // the name of the root element.
        parser.parseDTD(inSource, args[1]);
        DTD dtd = parser.getDoctype();
 
        // Create a FileReader object from the XML document specified on the
        // command line
        r = new FileReader(args[2]);
 
        // Wrap the FileReader in an InputSource, create a URL from the filename,
        // and set the system identifier
        inSource = new InputSource(r);
        inSource.setSystemId(DemoUtil.createURL(args[2]).toString());

        // ********************
        parser.setDoctype(dtd);
        // ********************

        parser.setValidationMode(DOMParser.DTD_VALIDATION);
       // parser.setAttribute(DOMParser.USE_DTD_ONLY_FOR_VALIDATION,Boolean.TRUE);
        parser.parse(inSource);
 
        // Obtain the DOM tree and print
        XMLDocument doc = parser.getDocument();
        doc.print(new PrintWriter(System.out));
 
      }
      catch (Exception e)
      {
         System.out.println(e.toString());
      }
   }
}

If the cached DTD Object is used only for validation, then set the DOMParser.USE_DTD_ONLY_FOR_VALIDATION attribute. Otherwise, the XML parser will copy the DTD object and add it to the resulting DOM tree. You can set the parser as follows:

parser.setAttribute(DOMParser.USE_DTD_ONLY_FOR_VALIDATION,Boolean.TRUE);

Handling Character Sets with the XML Parser

This section contains the following topics:

Detecting the Encoding of an XML File on the Operating System

When reading an XML file stored on the operating system, do not use the FileReader class. Instead, use the XML parser to detect the character encoding of the document automatically. Given a binary FileInputStream with no external encoding information, the parser automatically determines the character encoding based on the byte-order mark and encoding declaration of the XML document. You can parse any well-formed document in any supported encoding with the sample code in the AutoDetectEncoding.java demo. This demo is located in $ORACLE_HOME/xdk/demo/java/parser/dom.

Note:

Include the proper encoding declaration in your document according to the specification. setEncoding() cannot set the encoding for your input document. Rather, it is used with oracle.xml.parser.v2.XMLDocument to set the correct encoding for printing.

Detecting the Encoding of XML Stored in an NCLOB Column

Suppose that you load XML into the an NCLOB column of a database using UTF-8 encoding. The XML contains two UTF-8 multibyte characters:

G(0xc2,0x82)otingen, Br(0xc3,0xbc)ck_W

You write a Java stored function that does the following:

  1. Uses the default connection object to connect to the database.

  2. Runs a SELECT query.

  3. Obtains the oracle.jdbc.OracleResultSet object.

  4. Calls the OracleResultSet.getCLOB() method.

  5. Calls the getAsciiStream() method on the CLOB object.

  6. Executes the following code to get the XML into a DOM object:

    DOMParser parser = new DOMParser();
    parser.setPreserveWhitespace(true);
    parser.parse(istr);
    // istr getAsciiStream XMLDocument xmldoc = parser.getDocument();
    

The program throws an exception stating that the XML contains an invalid UTF-8 encoding even though the character (0xc2, 0x82) is valid UTF-8. The problem is that the character can be distorted when the program calls the OracleResultSet.getAsciiStream() method. To solve this problem, invoke the getUnicodeStream() and getBinaryStream() methods instead of getAsciiStream(). If this technique does not work, then try to print the characters to make sure that they are not distorted before they are sent to the parser in when you call DOMParser.parse(istr).

Writing an XML File in a Nondefault Encoding

You should not use the FileWriter class when writing XML files because it depends on the default character encoding of the runtime environment. The output file can suffer from a parsing error or data loss if the document contains characters that are not available in the default character encoding.

UTF-8 encoding is popular for XML documents, but UTF-8 is not usually the default file encoding of Java. Using a Java class in your program that assumes the default file encoding can cause problems. To avoid these problems, you can use the technique illustrated in the I18nSafeXMLFileWritingSample.java program in $ORACLE_HOME/xdk/demo/java/parser/dom.

Note that you cannot use System.out.println() to output special characters. You need to use a binary output stream such as OutputStreamWriter that is encoding aware. You can construct an OutputStreamWriter and use the write(char[], int, int) method to print, as in the following example:

/* Java encoding string for ISO8859-1*/
OutputStreamWriter out = new OutputStreamWriter(System.out, "8859_1");
OutputStreamWriter.write(...);

Working with XML in Strings

Currently, there is no method that can directly parse an XML document contained in a String. You need to convert the string into an InputStream or InputSource object before parsing.

One technique is to create a ByteArrayInputStream that uses the bytes in the string. For example, assume that xmlDoc is a reference to a string of XML. You can use technique shown in Example 4-8 to convert the string to a byte array, convert the array to a ByteArrwayInputStream, and then parse.

Example 4-8 Converting XML in a String

// create parser
DOMParser parser=new DOMParser();
// create XML document in a string
String xmlDoc =
       "<?xml version='1.0'?>"+
       "<hello>"+
       "  <world/>"+
       "</hello>";
// convert string to bytes to stream
byte aByteArr [] = xmlDoc.getBytes();
ByteArrayInputStream bais = new ByteArrayInputStream(aByteArr,0,aByteArr.length);
//  parse and obtain DOM tree
DOMParser.parse(bais);
XMLDocument doc = parser.getDocument();

Suppose that you want to convert the XMLDocument object created in the previous code back to a string. You can perform this task by wrapping a StringWriter in a PrintWriter. The following example illustrates this technique:

StringWriter sw = new StringWriter();
PrintWriter  pw = new PrintWriter(sw);
doc.print(pw);
String YourDocInString = sw.toString();

ParseXMLFromString.java, which is located in $ORACLE_HOME/xdk/demo/java/parser/dom, is a complete program that creates an XML document as a string and parses it.

Parsing XML Documents with Accented Characters

Assume that an input XML file contains accented characters such as an é. Example 4-9 shows one way to parse an XML document with accented characters.

Example 4-9 Parsing a Document with Accented Characters

DOMParser parser=new DOMParser(); 
parser.setPreserveWhitespace(true); 
parser.setErrorStream(System.err); 
parser.setValidationMode(false); 
parser.showWarnings(true);
parser.parse (new FileInputStream(new File("file_with_accents.xml")));

When you attempt to parse the XML file, the parser can sometimes throw an "Invalid UTF-8 encoding" exception. If you explicitly set the encoding to UTF-8, or if you do not specify it at all, then the parser interprets an accented character—which has an ASCII value greater than 127—as the first byte of a UTF-8 multibyte sequence. If the subsequent bytes do not form a valid UTF-8 sequence, then you receive an error.

This error means that your XML editor did not save the file with UTF-8 encoding. For example, it may have saved it with ISO-8859-1 encoding. The encoding is a particular scheme used to write the Unicode character number representation to disk. Adding the following element to the top of an XML document does not itself cause your editor to write out the bytes representing the file to disk with UTF-8 encoding:

<?xml version="1.0" encoding="UTF-8"?>

One solution is to read in accented characters in their hex or decimal format within the XML document, for example, &#xd9;. If you prefer not to use this technique, however, then you can set the encoding based on the character set that you were using when you created the XML file. For example, try setting the encoding to ISO-8859-1 (Western European ASCII) or to something different, depending on the tool or operating system you are using.

Handling Special Characters in Tag Names

Special characters such as &, $, and #, and so on are not legal in tag names. For example, if a document names tags after companies, and if the document includes the tag <A&B>, then the parser issues an error about invalid characters.

If you are creating an XML document from scratch, then you can work around this problem by using only valid NameChars. For example, you can name the tag <A_B>, <AB>, <A_AND_B> and so on. If you are generating XML from external data sources such as database tables, however, then XML 1.0 does not address this problem.

The datatype XMLType addresses this problem by providing the setConvertSpecialChars and convert functions in the DBMS_XMLGEN package. You can use these functions to control the use of special characters in SQL names and XML names. The SQL to XML name mapping functions escape invalid XML NameChar characters in the format of _XHHHH_, where HHHH is the Unicode value of the invalid character. For example, table name V$SESSION is mapped to XML name V_X0024_SESSION.

Escaping invalid characters is another workaround to give users a way to serialize names so that they can reload them somewhere else.