|Description:||Converts badly formed HTML to clean XML|
Tag Soup is an accessor using Active URI syntax with the following base identifiers:
and the following arguments: (for more details on argument passing details see here)
|Representation (java.lang.Object)||HTML document to process|
The following verb is supported:
The response representation of this accessor for SOURCE requests is unknown.
This accessor throws no documented exceptions.
To use Tag Soup accessor you must import the module urn:org:netkernel:web:core:
The tagSoup accessor converts badly formed HTML to clean XML. Below is John Cowan's description of how the Tag Soup parser works...
TagSoup is designed as a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on. The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted whenever possible: text like: This is bold, bold italic, italic, normal text gets correctly rewritten as: This is bold, bold italic, italic, normal text.