Rob Garrett - Blogs

Welcome to Rob Garrett - Blogs Sign in | Join | Help
in Search
Google

Software/Technology Discussion

Software and Technology Tid-bits

I-Filters

Today Carved out a chunk of the day to work on I-Filters. I-Filters are COM dynamic link libraries that convert known file types to text under Windows XP/2K/2K3. The OS's indexing service uses I-Filters to convert PDF and Office file types to text so the indexer can tokenize words contained in files.

I wrote a test application that calls an I-Filter library given a file name and converts it to text. The correct filter is determined by examining the file extension and querying the registry (I-Filters are registered with associated file extensions). My code works great with Office documents but barfs when using Adobe's 6.0 I-Filter.

Below is a synopsis of the method that does the work of invoking the filter (leave a comment if you want the rest of the code). The CLSID is the class ID of the filter, read from the registry.

(Apologies for no syntax highlighting)

private static string ExecuteFilter(string clsID, string sourceFile)
{
  string result = String.Empty;
  // Some filters are not reentrant, such as Adobe PDF filter.
  lock(_lock)
  {
    object itfc = null;
    try
    {
      // Get the filter type from CLSID.
      Type t = Type.GetTypeFromCLSID(new Guid(clsID));
      if (null != t)
      {
        // Get filter instance.
        itfc = Activator.CreateInstance(t);
        // Cast to IPersistFile.
        IFilter ifilt = (IFilter)(itfc);
        System.Runtime.InteropServices.UCOMIPersistFile ipf =
           (System.Runtime.InteropServices.UCOMIPersistFile)(ifilt);
        // Load source.
        ipf.Load(sourceFile, 0);
        // Initialize.
        uint i = 0;
        int hr = 0;
        STAT_CHUNK chunk = new STAT_CHUNK();
        ifilt.Init(IFILTER_INIT.NONE, 0, null, ref i);
        // Read the in chunks.
        StringBuilder masterBuffer = new StringBuilder();
        while (0 == hr)
        {
          // Read next chunk structure.
          try
          {
            hr = ifilt.GetChunk(out chunk);
          }
          catch (COMException ex)
          {
            // Get Chunk will throw an exception
            // when no more chunks to read - tsk.
            if (FILTER_E_END_OF_CHUNKS == ex.ErrorCode)
              hr = ex.ErrorCode;
            else
              throw ex;
          }

          // if chunk is text..
          if (0 == hr && CHUNKSTATE.CHUNK_TEXT == chunk.flags)
          {
            // Read text to buffer.
            uint bufferSize = CHUNK_SIZE;
            int hr2 = 0;
            while (FILTER_S_LAST_TEXT != hr2 || 0 == hr2)
            {
              bufferSize = CHUNK_SIZE;
              StringBuilder buffer = new StringBuilder((int)bufferSize);
              hr2 = ifilt.GetText(ref bufferSize, buffer);
              masterBuffer.Append(buffer.ToString(0, (int)bufferSize));
            }

            // Did we get an error?
            if (FILTER_E_NO_MORE_TEXT != hr2 && FILTER_S_LAST_TEXT != hr2)
              throw new Exception("Failed reading data from chunk!");
          }
        }
        // Assign result.
        result = masterBuffer.ToString();
      }
    }
    catch (Exception ex)
    {
      throw new FileLoadException("Failed to read data from filter!", ex);
    }
    finally
    {
      if (null != itfc)
        Marshal.ReleaseComObject(itfc);
    }
  }
return result;
}


Share this post: Email it! | bookmark it! | digg it! | reddit!
Published Tuesday, January 11, 2005 5:24 PM by Rob Garrett

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Chris Hynes said:

Been checking around and I found a couple of hints as to what might be going on. First of all, the Adobe IFilter is apartment threaded. I see a lot of people recommending to change it's ThreadingModel registry key to Both. See this page for more: http://www.adobe.com/support/techdocs/327014.html.

Also, I found this thread (http://sqljunkies.com/WebLog/acencini/articles/716.aspx) which has a bunch of info, as well as a recommendation to check whether the buffer returned was the right size... Can you check for Adobe IFilter returning a smaller sized chunk of text the last time and know that it has completed at that point?
January 12, 2005 7:48 PM
 

Rob Garrett said:

Cool, thanks, I'll check out these links.

I tried commenting out some of the code to see what is causing the filter to go bad. It seems that just instantiating the object using Activator.CreateInstance(t) is enough to cause the exception after the object goes out of scope.

I'll investigate further ....
January 12, 2005 10:49 PM
 

Doug said:

IFilters have been driving me insane for an embaressing amount of time, so any code you have would be greatly appreciated. Other examples I've found mysteriously conclude different IFilters for two files of the same type, and C#/Interop is not my strong point.
February 14, 2005 8:14 PM
 

Chris Anzalone said:

Rob

Thanks for posting the code. I just downloaded the Adobe IFilter 6.0 and plan to dig in. I would appreciate any other code you are willing to share on the subject as I am pretty inexperienced in this topic.

Just to highlight that point... I'm guessing that the reason this works well with Microsoft Office documents is that you have Office (2003?) installed on the machine and the code is able to find the appropriate CLSID from the registry?

Or is it that just having indexing service installed is enough to locate the appropriate IFilter for these documents?

Anyway, thanks again.
April 10, 2005 10:06 AM
 

Rob Garrett said:

Correct, I have office 2K3 installed, which installs an IFilter and CLSIDs in the registry.
April 10, 2005 1:53 PM
 

Raynald Gérault said:

Rob

Thanks for the post. I am trying to convert a folder with several document types (PDF, DOC...) to HTML, I think your code could help me to do this.

Raynald
April 19, 2005 8:58 AM
 

Matt Daniel said:

April 21, 2005 3:15 PM
 

Jason Manfield said:

I download IFilter 6.0 and am convert pdf files to text. However, when my application exits, it crashes with the following message:

The instruction at “0X02e161b3” referenced memory at “0x038027b0”. The memory could not be “read”.



Click OK to terminate the program. The popup window has the title: Font Capture: ConvertPDF.exe – Application Error



I am using Visual Studio 2005 and my .NET version 2.0.50215


September 16, 2005 3:12 PM
 

Rob Garrett said:

Right, I had the same problem. I ended up using version 5 instead.
September 16, 2005 3:40 PM
 

shweta said:

i am shweta , working in the span services . persently working on the project which is basically a search engine. so we r planning t implemenat IFilter.
can u pls share with me, how efficient is IFilters ?, what are the known and unknown issues regading IFilters, and any other problem reklated IFIlter's

Regards

shweta

September 26, 2005 3:34 AM
 

Rob Garrett said:

IMOH IFilters are not the best technology in the world - I find that each IFilter implementation is slightly different to the next, even though they support the same interface. So getting them to work can be a trial and error exercise. It also depends on what programming language you're planning on using. IFilters are COM based, so there's the headache of making COM interop calls from your application. C# does a great job, but if the threading model is not correct, some IFilters will blow up. Of course, if you're concerned with efficiency and performance, COM will be a bottleneck for you.

If you're implementing a search engine, I would suggest IFilters for offline indexing - index in batch and store the indexed data in a repository, which can be searched faster online. Using IFilters for dynamic searching would not be efficient.
September 26, 2005 9:03 AM
 

Sandeep Krishna said:

Hi Rob Garrett,

i'm working on a search Engine product. Here i need to even look into .mpp file (Microsoft Project Plan) .. We found the iFilters the best way to extract text and are using the Samples provided in this site.. but Couldnot locate the best iFilter for .MPP files.. very less work is done on this file type.. We found one named "msprofilt_w.dll" but isn't worth and not serving our purpose..

Can u plz help me out with the best iFilter for .mpp file if u have come across any..

my email id is mentioned in URL.. it will be a great help from u r side if u could suggest one..
September 29, 2005 3:40 AM
 

Rob Garrett said:

Unfortunately, this is where I ran into problems also. I was able to write infrastructure code to make use of IFilters, but had a hard time finding IFilter DLLs for certain file types that existed and worked. I never came across a filter for MPP files. Sorry I could not be any further help.
September 29, 2005 9:29 AM
 

shweta said:

Hi Rob

can u pls share your experience of working iFilter DLL.
how much efficient the ifilter are what are the issues that u faces(if any).
any constrains , any other problem regarding memeory , processing speed or multithreaded environment.

as i explain u earlier also we are working on the search engine which is going to index GB's(500 GB) of data. so lots of thread and memery and process will come in picture. so pls share your experience of working with ifilter
September 30, 2005 2:21 AM
 

Rob Garrett said:

Hi Shweta,
This blog post, it's comments, and any software constitute as my sole experiences with IFilters. Below is a summary with regard to your specific concerns:

1. IFilters are NOT efficient - period. You have to consider the COM Interop and Marshalling bottle neck, and you're also at the mercy of whoever implemented the IFilter. If performance is key, then the best approach is to extract the text data from the file directly.

2. Issues that I have experienced are (and not limited to) - bad performance, IFilters crashing (sometimes due to the threading model, sometimes not), some IFilters do not adhere to the spec, and others require some registry hacking to get them to work. On a general level, not all filters return all data - the Visio filter only returns sparse text data.

3. Memory and processing has not been a big issue for me (except for the fact that IFilters are not particularly fast). My colleagues and I wrote a thread based system, which would convert files with IFilters concurrently. We limited the number of concurrent jobs to prevent memory and CPU usage explosion. Once again, memory and CPU usage is at the mercy of the implementation of the IFilter.

4. Do you know what file types you're planning on indexing? You're indexing a lot of data - if the data type is pretty much static, then I would advice that you index the data in the files directly. (Data indexing can take a while, without the overhead of IFilters and COM). If you're trying to index any and every file type on the system, then you may be limited in options. In which case the trade off between performance and convenience may be better justified with IFilters.
September 30, 2005 10:57 AM
 

shweta said:

Hi Rob

Thanks for sharing your valauble web log with us.

Actually we have to index on the 16 types of different format like :

1. Text File
2. HTML File
3. PDF (Acrobat)
4. RTF
5. Word Perfect (Corel)
6. Microsoft Word
7. Microsoft Excel
8. Microsoft Access
9. Microsoft PowerPoint
10. Microsoft Visio
11. Microsoft Project
12. OutLook PST (Microsoft)
13. EML (Microsoft)
15. NSF - Mail Client
16. ZIP

in which for near by 12 format we got IFilters. and as i explain earlier database will be in GB's. so we have write thread based system which concurrently read numbers of files at a time. even i feel IFIlter behaviour are frequently changing. sometime it work fine or sometime crash. and our system that we are designing will be total automated. than i will becoem a issue.

can you help me out or suggest me the alternative way to extraxt text from all these file types. because we can't directly read the file.





October 4, 2005 1:52 AM
 

Rob Garrett said:

Check out my other post at http://robgarrett.com/Blogs/software/archive/2005/01/24/468.aspx Raldo has some extensive knowledge with IFilters, he may be able to help you further.
October 6, 2005 9:27 AM
 

TrackBack said:

October 6, 2005 9:31 AM
 

sunny said:

Hi Rob,

i'm looking a iFilter to fetch data from .mdb file. Please help me for this. Is there any other efficient way to fetch detials from .mdb file apart from iFilters would be a great help.
October 14, 2005 10:31 AM
 

Rob Garrett said:

Hi sunny,
I wish I could help you with extracting data from MDB files, however, my post is only concerned with using IFilters that exist - not about implementing them. I too have had the same problem finding an IFilter to extract data from MDB files.

Bottom line: Microsoft don't really want you snooping around in their proprietary application data files because you'd be able to read them from other non-MS applications.

Good luck.
October 14, 2005 3:13 PM
 

kunal ramesh lalwani said:

Hi Rob,
I am developing a web based application in which i give an option to the registered users of the applications an option to upload files as attachments directly in the blob field of the database. Now Just before the upload of the file occurs i need to extract text from the files and store it as a column in the database. Now i searched a lot About IFilters and came up with a lots of solutions. I just need to detect the extension of the filename that is being uploaded ( mostly all are Microsoft Office Documents). Hence i have to use OffFilt.dll which is the Ifilter for Microsoft Office Files. Now i have to solve the issue by not searching through the registry for the CLSID of the OffFilt.dll and then create an instance and then get chunks and then get the text and properties of the office document. If there is a way to get it directly from the Indexing Server, then i would be grateful.


Thanks in advance
Kunal Ramesh Lalwani
Software Developer
Fahm Softwares
India
October 15, 2005 1:52 AM
 

shweta said:

Hi kunal,

if u want to know to what type of file for loading particular IFilter for that type of file, so no need becuase IFilter itself internally identify(LoadIFilter) which ifilter it sholud load for particular type of files.
October 17, 2005 7:39 AM
 

Venky said:

Hi Rob

Can you put some code on how to read properties of document.
I guess the code has to be extended like

else if (0 == hr && CHUNKSTATE.CHUNK_VALUE == chunk.flags)
{
// Need complete code to do ifilt.GetValue()
}
November 4, 2005 1:09 AM
 

alon said:

Hi.
I am trying to extract text from Excel documents using Microsoft IFilter (OFFFilt.dll)

My problem is with the extraction order - the extracted text is not in the order it appears in spreadsheet (down than over / over than down), I think it is related to the time the cells were created/modified.

Can I control that extraction order?
Thanx alon.
January 3, 2006 5:36 AM
 

Vijay Kumar Raja.Grandhi said:

I gone through your article. can any body help me to provide sample code for extraction text from a given file. When I check the above code I am getting error for files with extension TIFF, PUB...etc. for pdf when i trying to extract code after get the code it is trowing an error.

I seen some body by name swetha working on a similar sort of application which i am currently working on. So please can you provide me code samples which you ppl develop.

Help me my friends. I am in deep deep trouble in completing this task.
I am very poor in COM Interop and Marshalling.

Please help me.
Thank you in advance for your help

my mail id is gvkraj23@gmail.com


February 24, 2006 7:21 AM
 

shweta said:

Hi rob,

have u faced this kind of problem before like HTML IFIlter is not able to extract the tes in the java script tag like what ever in Document. write()

Let ke know if u have some solution

Thanks
shweta
March 11, 2006 6:07 AM
 

Eyal Post said:

March 12, 2006 8:35 AM
 

Swati Sinha said:

Hi everyone!

I am developing a search engine application in asp.net by calling MS's indexing service.This works fine for office docs and HTML files but for.pdf and for.zip files,i guess we gotta download the IFilters and integrate it with our documents.

I am new to this world of IFilters and just followed the instructions given on the following link.
http://www.developerland.com/dotnet/enterprise/382.aspx

I added the dll to the registry and did everything that was given but my application doesn't seem to work.

Can somebody please help?

cheers,
swati
April 10, 2006 6:52 AM

Leave a Comment

(required) 
(optional)
(required) 
Submit

Blurb


Head Shot
Rob Garrett is a British Expat living in Maryland USA. Rob is a trained software engineer and experienced in Windows .NET development.

Rob enjoys listening to Rock music, posting to blogs, driving in the country with the sunroof open, beer (not in conjunction with country driving) and spending time with his family.

This Blog

Syndication

Powered by Community Server, by Telligent Systems