Rob Garrett - Blogs

Welcome to Rob Garrett - Blogs Sign in | Join | Help
in Search
Google

Software/Technology Discussion

Software and Technology Tid-bits

HTML to XHTML using SGMLReader

Chris Lovett wrote a wonderful .NET library called SGMLReader, which, when passed a regular HTML file will spit out XHTML.  The library relies on SGML parsing and uses a DTD (Document Type Definition) file to parse unformatted HTML. 

I've been playing with Chris's library this week and trying to convert HTML to XHTML on the fly using an ASP.NET Http Filter. The code below consists of the following:
  • An HttpModule, which sets the filter property of the response object for all ASPX page requests.
  • An Http filter, which is a custom stream, to manipulate the HTML content for the requested page.
  • The entry in the web.config file, required for the module to operate.
So, how does the code work?

To understand my code, you should know a little about how Http modules work in ASP.NET. I am not going to breach this subject in this post, but you can out all you need to know about Http modules on MSDN.  My module code interjects with each incoming web request sent through IIS, checks to see if an ASPX page has been requested, and if so, sets the filter property of the response object to a new instance of XhtmlFilter.

When ASP.NET is ready to push processed HTML content back to the client browser, via IIS, it uses the filter property in the response object.  The default filter used by ASP.NET is -: System.Web.HttpResponseStreamFilterSink.  However, it is possible to reassign the filter property a custom filer, which will perform so additional processing before forwarding the content to the default filter.  This exactly what my filtering code does.  My XhtmlFilter class is a stream derived class, which captures incoming HTML data from ASP.NET, converts it to XHTML using SGMLReader, and then forwards the changed content to the default filter.

The XhtmlFilter class inherits from a base class - HttpFilterBase, which abstracts away the stream functions not supported by Http filters.  Inside my filter class I make use of a MemoryStream object to capture all the HTML content pushed by the framework before processing content with SGMLReader. As the SGMLReader parses the HTML data the converted XHTML is streamed out to the default filter, and out to the client browser.

The code follows....

XhtmlFilterModule.cs

XhtmlFilter.cs

HttpFilerBase.cs

Module entry in web.config

Share this post: Email it! | bookmark it! | digg it! | reddit!
Published Tuesday, August 09, 2005 3:16 PM by Rob Garrett

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Chris Hynes said:

I'm currently implementing a filter of my own. Thanks for the great code.

One thing you may want to do is pass Response.Encoding to the filter rather than assuming it's UTF-8.
October 4, 2005 9:08 PM
 

Phil Morris said:

Why not just make sure the original ASPX files are XHTML compliant?  SgmlReader will kill ASP.NET controls so I wrote AspxCleaner as a wrapper around Sgmlreader to 'protect' the ASP.NET controls and other .NET constructs that would otherwise be destroyed.  You can get it at http://www.dotnetwebtools.com, and it's free!  I haven't go so far as write code to replace deprecated HTML tags, but will make my source code available to anyone interested in doing so.
June 2, 2006 11:59 PM
 

Rob Garrett said:

Thanks for your additions Phil.  I couldn't agree more about making sure that ASPX files are XHTML compliant.  
June 5, 2006 9:27 AM
 

Lothar said:

Great to have people willing to share their work. Love to see the source code for Phil Morris AspxCleaner. Could you make the source available from your website?
October 5, 2006 1:16 PM
 

Skydiver420 said:

Hi ! I recently came across the interresting SgmlReader from Chris Lovett but I had 2 major issues ! GoDotNet.com has been shut down.... And the sample is no where to be found in msdn Posting comments on Chris's web site produce an error (access denied) So can anybody put back the code... Thanks !
February 15, 2008 10:27 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit

Blurb


Head Shot
Rob Garrett is a British Expat living in Maryland USA. Rob is a trained software engineer and experienced in Windows .NET development.

Rob enjoys listening to Rock music, posting to blogs, driving in the country with the sunroof open, beer (not in conjunction with country driving) and spending time with his family.

This Blog

Syndication

Powered by Community Server, by Telligent Systems