Squeaky Clean


Squeaky Clean was written with HTML exported from M$ office in mind. It rips out all the classes, styles, strange XML and conditionals. Thus it doesn't look the same afterwards but at least the markup is nice and clean.

This makes it easy to go back in and reimplement the styles using sensible CSS. Alternatively you can edit this file to stop style and class attributes being removed.

Documents will be converted into utf8 from whatever charset they started in. Installing iconv will increase the charset support to include multi-byte charsets, like east asian and arabic charsets. By default most single byte charsets and unicode are supported.

This program uses an XML parser to read the HTML. This means that if the source file is highly non XML compliant it will fail to parse. I have no interest in writing a robust HTML parser, so you'll either have to fix the file or use some other tool. The parser is not too strict about quotes and things. You can even tell it not to look for child tags by adding tags to the "nochild" section of the config file 'Clean.xml'.

The attributes and elements that get deleted are configurable via the file 'Clean.xml' distributed with the app. It works for the one file I needed clean, but I expect it'll need work to be useful for the general case. Please read the comments in that file and edit if neccessary for your own files. Generally useful changes should be sent back to me for inclusion in future versions.

Future versions may dig into the CSS and clean it up instead of just deleting it all. But that would require more invasive parsing, anyway this is just an alpha release.


Version: v0.10 [Alpha]


0.10 [Alpha]

Initial Release:
  • Basic load, clean and display of HTML.
  • Log window for status and error messages.
  • XML based parsing, cleans out specified attributes and tags.
  • Inline editor for cleaning up by hand.
  • Options specified in 'Clean.xml' gives the user some control over the attributes and elements that get nuked.