Crossplaform document document indexing with Lucene and OpenOffice

When you have ever worked with Lucene you know that, to index your documents, you have to convert your documents to text. This text is used by Lucene to create the index for the document.
For most formats there are convertors which are either written in Java or native. The problem is that you need to write indexing classes for a lot of different documents; you have to use a lot of different libraries and you have have to make sure your deployment machine contains those libraries…

…or you just install OpenOffice (http://www.openoffice.org) !

OpenOffice is capable of opening most productivity tool formats and even more formats are in the making as we speak. Opening Word, Excel, Powerpoint, Access, dBase, WordPerfect, all StarOffice and OpenOffice formats, Lotus 1-2-3, HTML, XML and much more are no problem for the latest version of OpenOffice.

Now only to get a document converted for indexing with Lucene. This is extremely trivial. I assume you have installed OpenOffice 1.9+ Beta.

  1. Open Writer
  2. Goto Tools -> Macros -> Organize Macros
  3. Click on the Libraries tab
  4. Select ‘My Dialogs & Dialogs’
  5. Click on New
  6. Type ‘MyLibraries’
  7. Click on ‘Modules’
  8. Click on ‘MyLibraries’
  9. Click on New
  10. Type ‘Conversion’
  11. Select ‘Conversion’
  12. Click on Edit
  13. Copy/paste the following code into the editor:

Sub ConvertWordToTxt( cFile )
cURL = ConvertToURL( cFile )

‘ Open the document.
‘ Just blindly assume that the document is of a type that OOo will
‘ correctly recognize and open — without specifying an import filter.
oDoc = StarDesktop.loadComponentFromURL( cURL, “_blank”, 0, Array(_
MakePropertyValue( “Hidden”, True ),_
) )

cFile = Left( cFile, Len( cFile ) – 4 ) + “.txt”
cURL = ConvertToURL( cFile )

‘ Save the document using a filter.
oDoc.storeToURL( cURL, Array(_
MakePropertyValue( “FilterName”, “Text” ),_
)

oDoc.close( True )
End Sub
Function MakePropertyValue( Optional cName As String, Optional uValue ) As com.sun.star.beans.PropertyValue
Dim oPropertyValue As New com.sun.star.beans.PropertyValue
If Not IsMissing( cName ) Then
oPropertyValue.Name = cName
EndIf
If Not IsMissing( uValue ) Then
oPropertyValue.Value = uValue
EndIf
MakePropertyValue() = oPropertyValue
End Function

Now hit CRTL-s and close Writer.

You can now run this code like this:

“c:program filesOpenOffice.org 1.9.79programsoffice” -invisible “macro:///MyLibraries.Conversion.ConvertWordToPDF(DOCUMENTNAME)”

Substitute DOCUMENTNAME with your document and your document will be converted to .txt.

I will explore using this within Lucene a bit later!

Be the first to leave a comment. Don’t be shy.

Join the Discussion

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>