Next Designs

Cleaning your pages with Tidy.NET

HTML Tidy was created by Dave Raggett to fix up common errors in webpage markup. It is easy to use, powerful, and is a tremendous resource for anyone who maintains a website.

Originally written in C, it has been ported over, or at least made accessible through a wrapper, to other languages including PERL, Python, C++, among others. Charles Reitzel created a wrapper for .Net which I have used in the past, but reaching through the COM interop layer would sometimes produce unusual behaviour.

After some poking around, I discovered a native .Net implementation called Tidy.NET hosted on Sourceforge. This is a great piece of software that unfortunately suffers from a dearth of documentation. It is beta software, but I have used it to parse tens of thousands of pages, and I have found it to be reliable. Your mileage may vary.

Below is a very simple example that illustrates how to use Tidy.NET to clean a page.

   1:  //The Tidy object
   2:  Tidy doc = new Tidy();
   3:   
   4:  //The TidyMessageCollection holds all errors, warning and info
   5:  //messages that Tidy generates
   6:  TidyMessageCollection tmc = new TidyMessageCollection();
   7:   
   8:  //These streams are the input and output streams for the markup            
   9:  MemoryStream input = new MemoryStream();
  10:  MemoryStream output = new MemoryStream();
  11:   
  12:  //Set some Tidy options, refer to the HTML Tidy docs for more info
  13:  doc.Options.DocType = DocType.Strict;
  14:  doc.Options.Xhtml = true;
  15:  doc.Options.LogicalEmphasis = true;
  16:  doc.Options.DropFontTags = true;
  17:  doc.Options.DropEmptyParas = true;
  18:  doc.Options.QuoteAmpersand = true;
  19:  doc.Options.TidyMark = false;
  20:  doc.Options.MakeClean = true;
  21:  doc.Options.IndentContent = true;
  22:  doc.Options.SmartIndent = true;
  23:  doc.Options.Spaces = 4;
  24:  doc.Options.WrapLen = 100;
  25:  doc.Options.CharEncoding = CharEncoding.UTF8;
  26:   
  27:  //Turn our html into an array of bytes
  28:  byte[] byteArray = System.Text.Encoding.UTF8.GetBytes(html);
  29:   
  30:  //Write out the byte array to the input stream
  31:  input.Write(byteArray, 0, byteArray.Length);
  32:   
  33:  //Reset the position of the memory stream to the beginning
  34:  input.Position = 0;
  35:   
  36:  //Parse the input stream, outputting to output, with messages written
  37:  //to our collection of Tidy messages
  38:  doc.Parse(input, output, tmc);
  39:   
  40:  //Let's check each message
  41:  foreach (TidyMessage message in tmc)
  42:  {
  43:      //If an error has been thrown, we want to trap for it
  44:       if (message.Level == MessageLevel.Error)
  45:       {
  46:           //Throw a simple ApplicationException
  47:           throw new ApplicationException(String.Format("{0} at line {1} column {2}",
  48:           message.Message, message.Line,
  49:           message.Column));
  50:       }
  51:  }
  52:   
  53:  //If we got this far, Tidy was able to successfully clean the source.
  54:  string cleanedMarkUp = System.Text.Encoding.UTF8.GetString(output.ToArray());

Sound off


Comments: