Next Designs

Cleaning your pages with Tidy.NET

HTML Tidy was created by Dave Raggett to fix up common errors in webpage markup. It is easy to use, powerful, and is a tremendous resource for anyone who maintains a website.

Originally written in C, it has been ported over, or at least made accessible through a wrapper, to other languages including PERL, Python, C++, among others. Charles Reitzel created a wrapper for .Net which I have used in the past, but reaching through the COM interop layer would sometimes produce unusual behaviour.

After some poking around, I discovered a native .Net implementation called Tidy.NET hosted on Sourceforge. This is a great piece of software that unfortunately suffers from a dearth of documentation. It is beta software, but I have used it to parse tens of thousands of pages, and I have found it to be reliable. Your mileage may vary.

Below is a very simple example that illustrates how to use Tidy.NET to clean a page.

   1:  //The Tidy object
   2:  Tidy doc = new Tidy();
   3:   
   4:  //The TidyMessageCollection holds all errors, warning and info
   5:  //messages that Tidy generates
   6:  TidyMessageCollection tmc = new TidyMessageCollection();
   7:   
   8:  //These streams are the input and output streams for the markup            
   9:  MemoryStream input = new MemoryStream();
  10:  MemoryStream output = new MemoryStream();
  11:   
  12:  //Set some Tidy options, refer to the HTML Tidy docs for more info
  13:  doc.Options.DocType = DocType.Strict;
  14:  doc.Options.Xhtml = true;
  15:  doc.Options.LogicalEmphasis = true;
  16:  doc.Options.DropFontTags = true;
  17:  doc.Options.DropEmptyParas = true;
  18:  doc.Options.QuoteAmpersand = true;
  19:  doc.Options.TidyMark = false;
  20:  doc.Options.MakeClean = true;
  21:  doc.Options.IndentContent = true;
  22:  doc.Options.SmartIndent = true;
  23:  doc.Options.Spaces = 4;
  24:  doc.Options.WrapLen = 100;
  25:  doc.Options.CharEncoding = CharEncoding.UTF8;
  26:   
  27:  //Turn our html into an array of bytes
  28:  byte[] byteArray = System.Text.Encoding.UTF8.GetBytes(html);
  29:   
  30:  //Write out the byte array to the input stream
  31:  input.Write(byteArray, 0, byteArray.Length);
  32:   
  33:  //Reset the position of the memory stream to the beginning
  34:  input.Position = 0;
  35:   
  36:  //Parse the input stream, outputting to output, with messages written
  37:  //to our collection of Tidy messages
  38:  doc.Parse(input, output, tmc);
  39:   
  40:  //Let's check each message
  41:  foreach (TidyMessage message in tmc)
  42:  {
  43:      //If an error has been thrown, we want to trap for it
  44:       if (message.Level == MessageLevel.Error)
  45:       {
  46:           //Throw a simple ApplicationException
  47:           throw new ApplicationException(String.Format("{0} at line {1} column {2}",
  48:           message.Message, message.Line,
  49:           message.Column));
  50:       }
  51:  }
  52:   
  53:  //If we got this far, Tidy was able to successfully clean the source.
  54:  string cleanedMarkUp = System.Text.Encoding.UTF8.GetString(output.ToArray());

Sound off


Comments:

Sorry for typo, here is correct version. How do I use it NOT to add html, head and body tags? I do not need them.
Posted at 10/21/2011 10:22:28 AM

How do I use it now to add html, head and body tags? I do not need them.
Posted at 10/21/2011 10:20:42 AM

[url=http://watchgreatmovies.info][img]http://watchgreatmovies.info/images/1.jpg[/img][/url] Watch Movies Online: What are the Risks? While technology made everything possible nowadays, a lot of individuals have started to enjoy the advantage of being able to watch movies online. This made it even possible for those busy individuals who do not have time to go to the cinemas to catch up and watch the latest movies. As a matter of fact, there are certain websites that permit everyone to either watch movies through the internet, or even download movies online. The first thing that you have to take into consideration when planning to use the World Wde Web for watching the most recent Hollywood movie is the reputation of the website. This is very important most particularly if you have chosen a free movie downloads website – are you sure that there are no viruses, malware or even spyware along with those files that you are about to download? It is best that you choose the website where you are about to watch movies online for reviews and the most recent testimonials coming from other users. It does not necessarily mean that the first encountered website offering free movie downloads is the best web page for you to go with. From time to time, you need to read some reviews and determine whether they are worthy of your time, and whether these sites won’t harm your computer with their damaged files.
Posted at 4/5/2011 11:50:05 AM

runescape private server runescape hq runescape accounts runescape classic [url=http://forum.linkinparkfans.ru/index.php?s=9bf30f902d2e3a256da10f26be940c32&showuser=346542 ]Runescape Cursor[/url] runescape hq runescape runescape game runescape nex runescape classic runescape luring runescape 3 runescape gods exposed [url=http://worldmus.krasnoturinsk.ru/member.php?u=39278 ]Runescape Cursor[/url] runescape membership runescape staking runescape staking runescape classic runescape runescape wiki runescape in school runescape bots [url=http://www.adr-group.com/adrforum/profile.php?mode=viewprofile&u=292433 ]Runescape Cursors[/url] runescape wiki runescape wiki runescape gold runescape cheats runescape tips runescape runescape tips runescape quest help [url=http://seo2go.co.uk/backlinkforum/index.php?action=profile;u=18151 ]Runescape Cursor[/url] runescape luring runescape golden cracker runescape wiki runescape cheats runescape hq runescape in school runescape game runescape wiki [url=http://epo2hand.smfhit.com/index.php?action=profile;u=739 ]Runescape Cursors[/url] runescape bots runescape gods exposed runescape wiki runescape hacks runescape runescape membership runescape hq runescape bot [url=http://www.annaamura.it/phpBB2/profile.php?mode=viewprofile&u=342489 ]Runescape Cursors[/url] runescape golden cracker runescape gold runescape pking runescape cheats runescape cheats runescape game runescape accounts runescape private server [url=http://www.arenam-online.com
Posted at 4/4/2011 12:15:56 AM

Hello. And Bye. <a href="http://www.pornhubhd.com/">fr33 pr0n</a> this is it!
Posted at 2/13/2011 7:33:15 AM