xmldiff/xmldiff.md
2026-03-13 10:52:43 +01:00

6 KiB

xmldiff — Implementation Plan

Goal

A Java library that takes two XML strings (left = expected, right = actual) and produces two HTML strings suitable for rendering a side-by-side diff. Each output is a <span> tree with inner spans annotated with CSS classes.

CSS Classes

Class Meaning
neutral This token is identical in both sides
correct This token is on the left side and differs from right
wrong This token is on the right side and differs from left
skipped Child content of an element whose tag name differs

Diff Granularity Rules

Token If equal If different
Element name neutral Left → correct, right → wrong; all content (attrs, children, text) → skipped
Attribute name neutral Left attr name → correct, right attr name → wrong
Attribute value neutral Left attr name neutral, left value → correct; same on right → wrong
Text content neutral Left text → correct, right text → wrong
Element present only on left Left subtree → correct, right → empty <span></span>
Element present only on right Right subtree → wrong, left → empty <span></span>

Attribute order is not significant:

Output Format

Each output string is pretty-printed HTML. XML special characters (<, >, &, ") inside span text are HTML-escaped. Indentation uses 2 spaces per level. Output does not include an XML declaration.

Example shape:

<span class="neutral">&lt;root&gt;
  &lt;child </span><span class="correct">attr</span><span class="neutral">="</span><span class="correct">value</span><span class="neutral">"&gt;
    </span><span class="correct">text here</span><span class="neutral">
  &lt;/child&gt;
&lt;/root&gt;</span>

Dependencies

<!-- XML diffing -->
<dependency>
    <groupId>org.xmlunit</groupId>
    <artifactId>xmlunit-core</artifactId>
    <version>2.10.0</version>
</dependency>

<!-- Testing -->
<dependency>
    <groupId>org.junit.jupiter</groupId>
    <artifactId>junit-jupiter</artifactId>
    <version>5.11.0</version>
    <scope>test</scope>
</dependency>

XMLUnit 2.x is the diffing engine. It produces a list of Comparison objects, each with:

  • getType()ComparisonType enum: ELEMENT_TAG_NAME, ATTR_VALUE, ATTR_NAME_LOOKUP, TEXT_VALUE, CHILD_NODELIST_LENGTH, HAS_CHILD_NODES, etc.
  • getControlDetails().getXPath() — XPath of the affected node on the left side
  • getTestDetails().getXPath() — XPath of the affected node on the right side

Algorithm

Step 1 — Diff (DiffEngine)

Diff diff = DiffBuilder
    .compare(leftXml)
    .withTest(rightXml)
    .withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.byName))
    .ignoreWhitespace()
    .build();

For each Comparison c in diff.getDifferences():
    record (c.getControlDetails().getXPath(), c.getTestDetails().getXPath(), c.getType())
    into two maps:  leftDiffs: XPath → ComparisonType
                    rightDiffs: XPath → ComparisonType

Step 2 — Render (HtmlRenderer)

Walk each DOM tree independently, pretty-printing to HTML. At each node, look up its XPath in the relevant diff map to determine its CSS class.

Element node:

xp = xpathOf(node)
if leftDiffs contains xp with type ELEMENT_TAG_NAME:
    emit tag name as correct/wrong
    emit all attributes + children recursively as skipped
else:
    emit tag name as neutral
    for each attribute (in document order):
        emit based on attr-level diff lookup
    recurse into children

Text node:

xp = xpathOf(node)
if leftDiffs/rightDiffs contains xp with type TEXT_VALUE:
    emit as correct / wrong
else:
    emit as neutral

Missing child (CHILD_NODELIST_LENGTH or similar):

emit present side as correct/wrong
emit absent side as empty <span></span>

XPaths are computed from the DOM tree as each node is visited, matching the XPaths that XMLUnit generates (e.g. /root[1]/child[1]).

Step 3 — Output

XmlDiff.compare() calls DiffEngine, then calls HtmlRenderer once for the left tree and once for the right tree, returning a DiffResult.

Test Cases

# Scenario Left class Right class
1 Identical simple elements all neutral all neutral
2 Differing text content text correct text wrong
3 Differing attribute value value correct value wrong (name neutral)
4 Differing attribute name name correct name wrong
5 Differing element name name correct, children skipped name wrong, children skipped
6 Extra child on left only child correct empty span
7 Extra child on right only empty span child wrong
8 Attribute order differs first mismatch correct first mismatch wrong
9 Nested elements, partial diff only differing subtree marked same
10 Self-closing element, no diff all neutral all neutral

Assumptions

  • Comments, processing instructions, and CDATA sections are ignored.
  • Whitespace-only text nodes between elements are ignored (XMLUnit ignoreWhitespace()).
  • Namespace prefixes are treated as plain text; no namespace-aware comparison.
  • The library is stateless; XmlDiff.compare() is safe to call concurrently.