xmldiff/xmldiff.md

# xmldiff — Implementation Plan

## Goal

A Java library that takes two XML strings (left = expected, right = actual) and produces two HTML strings suitable for rendering a side-by-side diff. Each output is a `<span>` tree with inner spans annotated with CSS classes.

## CSS Classes

| Class      | Meaning                                                   |
|------------|-----------------------------------------------------------|
| `neutral`  | This token is identical in both sides                     |
| `correct`  | This token is on the **left** side and differs from right |
| `wrong`    | This token is on the **right** side and differs from left |
| `skipped`  | Child content of an element whose **tag name** differs    |

## Diff Granularity Rules

| Token                         | If equal   | If different                                                                         |
|-------------------------------|------------|--------------------------------------------------------------------------------------|
| Element name                  | `neutral`  | Left → `correct`, right → `wrong`; all content (attrs, children, text) → `skipped` |
| Attribute name                | `neutral`  | Left attr name → `correct`, right attr name → `wrong`                              |
| Attribute value               | `neutral`  | Left attr name neutral, left value → `correct`; same on right → `wrong`            |
| Text content                  | `neutral`  | Left text → `correct`, right text → `wrong`                                        |
| Element present only on left  | —          | Left subtree → `correct`, right → empty `<span></span>`                            |
| Element present only on right | —          | Right subtree → `wrong`, left → empty `<span></span>`                              |

Attribute **order is not significant**:

## Output Format

Each output string is pretty-printed HTML. XML special characters (`<`, `>`, `&`, `"`) inside span text are HTML-escaped. Indentation uses 2 spaces per level. Output does **not** include an XML declaration.

Example shape:

```html
<span class="neutral">&lt;root&gt;
  &lt;child </span><span class="correct">attr</span><span class="neutral">="</span><span class="correct">value</span><span class="neutral">"&gt;
    </span><span class="correct">text here</span><span class="neutral">
  &lt;/child&gt;
&lt;/root&gt;</span>
```

## Dependencies

```xml
<!-- XML diffing -->
<dependency>
    <groupId>org.xmlunit</groupId>
    <artifactId>xmlunit-core</artifactId>
    <version>2.10.0</version>
</dependency>

<!-- Testing -->
<dependency>
    <groupId>org.junit.jupiter</groupId>
    <artifactId>junit-jupiter</artifactId>
    <version>5.11.0</version>
    <scope>test</scope>
</dependency>
```

**XMLUnit 2.x** is the diffing engine. It produces a list of `Comparison` objects, each with:
- `getType()` — `ComparisonType` enum: `ELEMENT_TAG_NAME`, `ATTR_VALUE`, `ATTR_NAME_LOOKUP`, `TEXT_VALUE`, `CHILD_NODELIST_LENGTH`, `HAS_CHILD_NODES`, etc.
- `getControlDetails().getXPath()` — XPath of the affected node on the left side
- `getTestDetails().getXPath()` — XPath of the affected node on the right side

## Algorithm

### Step 1 — Diff (DiffEngine)

```
Diff diff = DiffBuilder
    .compare(leftXml)
    .withTest(rightXml)
    .withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.byName))
    .ignoreWhitespace()
    .build();

For each Comparison c in diff.getDifferences():
    record (c.getControlDetails().getXPath(), c.getTestDetails().getXPath(), c.getType())
    into two maps:  leftDiffs: XPath → ComparisonType
                    rightDiffs: XPath → ComparisonType
```

### Step 2 — Render (HtmlRenderer)

Walk each DOM tree independently, pretty-printing to HTML. At each node, look up its XPath in the relevant diff map to determine its CSS class.

**Element node:**
```
xp = xpathOf(node)
if leftDiffs contains xp with type ELEMENT_TAG_NAME:
    emit tag name as correct/wrong
    emit all attributes + children recursively as skipped
else:
    emit tag name as neutral
    for each attribute (in document order):
        emit based on attr-level diff lookup
    recurse into children
```

**Text node:**
```
xp = xpathOf(node)
if leftDiffs/rightDiffs contains xp with type TEXT_VALUE:
    emit as correct / wrong
else:
    emit as neutral
```

**Missing child (CHILD_NODELIST_LENGTH or similar):**
```
emit present side as correct/wrong
emit absent side as empty <span></span>
```

XPaths are computed from the DOM tree as each node is visited, matching the XPaths that XMLUnit generates (e.g. `/root[1]/child[1]`).

### Step 3 — Output

`XmlDiff.compare()` calls `DiffEngine`, then calls `HtmlRenderer` once for the left tree and once for the right tree, returning a `DiffResult`.

## Test Cases

| # | Scenario                              | Left class | Right class |
|---|---------------------------------------|------------|-------------|
| 1 | Identical simple elements             | all `neutral` | all `neutral` |
| 2 | Differing text content                | text `correct` | text `wrong` |
| 3 | Differing attribute value             | value `correct` | value `wrong` (name neutral) |
| 4 | Differing attribute name              | name `correct` | name `wrong` |
| 5 | Differing element name                | name `correct`, children `skipped` | name `wrong`, children `skipped` |
| 6 | Extra child on left only              | child `correct` | empty span |
| 7 | Extra child on right only             | empty span | child `wrong` |
| 8 | Attribute order differs               | first mismatch `correct` | first mismatch `wrong` |
| 9 | Nested elements, partial diff         | only differing subtree marked | same |
| 10| Self-closing element, no diff         | all `neutral` | all `neutral` |

## Assumptions

- Comments, processing instructions, and CDATA sections are ignored.
- Whitespace-only text nodes between elements are ignored (XMLUnit `ignoreWhitespace()`).
- Namespace prefixes are treated as plain text; no namespace-aware comparison.
- The library is stateless; `XmlDiff.compare()` is safe to call concurrently.