Cleaning your pages with Tidy.NET
HTML Tidy was created by Dave Raggett
to fix up common errors in webpage markup. It is easy to use, powerful, and is a tremendous resource for anyone who
maintains a website.
Originally written in C, it has been ported over, or at least made accessible through a wrapper, to other languages including PERL, Python, C++, among others.
Charles Reitzel created a wrapper for .Net which I have used in the past,
but reaching through the COM interop layer would sometimes produce
unusual behaviour.
After some poking around, I discovered a native .Net implementation called Tidy.NET hosted on Sourceforge.
This is a great piece of software that unfortunately suffers from a dearth of documentation. It is beta software, but I have used it to parse tens of thousands of pages,
and I have found it to be reliable. Your mileage may vary.
Below is a very simple example that illustrates how to use Tidy.NET to clean a page.
1: //The Tidy object
2: Tidy doc = new Tidy();
3:
4: //The TidyMessageCollection holds all errors, warning and info
5: //messages that Tidy generates
6: TidyMessageCollection tmc = new TidyMessageCollection();
7:
8: //These streams are the input and output streams for the markup
9: MemoryStream input = new MemoryStream();
10: MemoryStream output = new MemoryStream();
11:
12: //Set some Tidy options, refer to the HTML Tidy docs for more info
13: doc.Options.DocType = DocType.Strict;
14: doc.Options.Xhtml = true;
15: doc.Options.LogicalEmphasis = true;
16: doc.Options.DropFontTags = true;
17: doc.Options.DropEmptyParas = true;
18: doc.Options.QuoteAmpersand = true;
19: doc.Options.TidyMark = false;
20: doc.Options.MakeClean = true;
21: doc.Options.IndentContent = true;
22: doc.Options.SmartIndent = true;
23: doc.Options.Spaces = 4;
24: doc.Options.WrapLen = 100;
25: doc.Options.CharEncoding = CharEncoding.UTF8;
26:
27: //Turn our html into an array of bytes
28: byte[] byteArray = System.Text.Encoding.UTF8.GetBytes(html);
29:
30: //Write out the byte array to the input stream
31: input.Write(byteArray, 0, byteArray.Length);
32:
33: //Reset the position of the memory stream to the beginning
34: input.Position = 0;
35:
36: //Parse the input stream, outputting to output, with messages written
37: //to our collection of Tidy messages
38: doc.Parse(input, output, tmc);
39:
40: //Let's check each message
41: foreach (TidyMessage message in tmc)
42: {
43: //If an error has been thrown, we want to trap for it
44: if (message.Level == MessageLevel.Error)
45: {
46: //Throw a simple ApplicationException
47: throw new ApplicationException(String.Format("{0} at line {1} column {2}",
48: message.Message, message.Line,
49: message.Column));
50: }
51: }
52:
53: //If we got this far, Tidy was able to successfully clean the source.
54: string cleanedMarkUp = System.Text.Encoding.UTF8.GetString(output.ToArray());
Sound off