WARNING: This server provides a static reference view of the NetKernel documentation. Links to dynamic content do not work. For the best experience we recommend you install NetKernel and view the documentation in the live system .

Endpoint
Name:Tag Soup
Description:Converts badly formed HTML to clean XML
Id:core:web:tagSoup
Category:accessor
Identifier Syntax

Tag Soup is an accessor using Active URI syntax with the following base identifiers:

Base
active:tagSoup

and the following arguments: (for more details on argument passing details see here)

ArgumentRulesTypingDescription
operand
Mandatory
Representation (java.lang.Object)HTML document to process
Request Verbs

The following verb is supported:

Verb
SOURCE
Response

The response representation of this accessor for SOURCE requests is unknown.

This accessor throws no documented exceptions.

Import Requirements

To use Tag Soup accessor you must import the module urn:org:netkernel:web:core:

<import>
  <uri>urn:org:netkernel:web:core</uri>
</import>

The tagSoup accessor converts badly formed HTML to clean XML. Below is John Cowan's description of how the Tag Soup parser works...

TagSoup is designed as a parser, not a whole application; it isn't intended to permanently clean up
bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation
HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up
properly nested, default attributes will appear appropriately, and so on.

The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never,
never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's
much, much more. For example, if the first tag is LI, it will supply the application with enclosing
HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the
same reason, overlapping tags are correctly restarted whenever possible: text like:

This is bold, bold italic, italic, normal text

gets correctly rewritten as:

This is bold, bold italic, italic, normal text.