Archive for May, 2005

Crossplaform document document indexing with Lucene and OpenOffice

When you have ever worked with Lucene you know that, to index your documents, you have to convert your documents to text. This text is used by Lucene to create the index for the document.
For most formats there are convertors which are either written in Java or native. The problem is that you need to write indexing classes for a lot of different documents; you have to use a lot of different libraries and you have have to make sure your deployment machine contains those libraries…

…or you just install OpenOffice (http://www.openoffice.org) !

OpenOffice is capable of opening most productivity tool formats and even more formats are in the making as we speak. Opening Word, Excel, Powerpoint, Access, dBase, WordPerfect, all StarOffice and OpenOffice formats, Lotus 1-2-3, HTML, XML and much more are no problem for the latest version of OpenOffice.

Now only to get a document converted for indexing with Lucene. This is extremely trivial. I assume you have installed OpenOffice 1.9+ Beta.

  1. Open Writer
  2. Goto Tools -> Macros -> Organize Macros
  3. Click on the Libraries tab
  4. Select ‘My Dialogs & Dialogs’
  5. Click on New
  6. Type ‘MyLibraries’
  7. Click on ‘Modules’
  8. Click on ‘MyLibraries’
  9. Click on New
  10. Type ‘Conversion’
  11. Select ‘Conversion’
  12. Click on Edit
  13. Copy/paste the following code into the editor:

Sub ConvertWordToTxt( cFile )
cURL = ConvertToURL( cFile )

‘ Open the document.
‘ Just blindly assume that the document is of a type that OOo will
‘ correctly recognize and open — without specifying an import filter.
oDoc = StarDesktop.loadComponentFromURL( cURL, “_blank”, 0, Array(_
MakePropertyValue( “Hidden”, True ),_
) )

cFile = Left( cFile, Len( cFile ) – 4 ) + “.txt”
cURL = ConvertToURL( cFile )

‘ Save the document using a filter.
oDoc.storeToURL( cURL, Array(_
MakePropertyValue( “FilterName”, “Text” ),_
)

oDoc.close( True )
End Sub
Function MakePropertyValue( Optional cName As String, Optional uValue ) As com.sun.star.beans.PropertyValue
Dim oPropertyValue As New com.sun.star.beans.PropertyValue
If Not IsMissing( cName ) Then
oPropertyValue.Name = cName
EndIf
If Not IsMissing( uValue ) Then
oPropertyValue.Value = uValue
EndIf
MakePropertyValue() = oPropertyValue
End Function

Now hit CRTL-s and close Writer.

You can now run this code like this:

“c:program filesOpenOffice.org 1.9.79programsoffice” -invisible “macro:///MyLibraries.Conversion.ConvertWordToPDF(DOCUMENTNAME)”

Substitute DOCUMENTNAME with your document and your document will be converted to .txt.

I will explore using this within Lucene a bit later!

Making your own storage server with cheap hardware

Yesterday I started working on a cheap storage server. I had the following requirements:

  1. I needed a versioning file system (so every document I store is versioned)
  2. I needed access via Webdav, also for this versioning
  3. It should have easy install
  4. I wanted to use cheap hardware (my 400 mhz system I have gathering dust in my cupboard should do nicely)
  5. I wanted to user friendly tools to support the system (under both Windows and Linux)
  6. It had to be secure (SSL enabled on all levels)
  7. It must be searchable, whatever document I throw in (doc, xls, odt, wpd, txt, xml, html, mp3 etc)

After a bit of searching, it became clear I would use Apache 2 and Subversion under Linux to achieve my goal.

The steps I took to install it all where;

  1. Install Fedora Core 3, server install with all installation options marked off and SELinux off (!)
  2. Get the network verified and working
  3. Import the GPG key:
    rpm --import http://www.fedora.us/FEDORA-GPG-KEY
  4. Install Apache: yum install httpd
  5. Install mod_ssl: yum install mod_ssl
  6. Install Subversion: yum install subversion
  7. Install mod_dav_svn: yum install mod_dav_svn

Now your FC installation is ready to go; it already fixed a key for you so the SSL installation is ready.

Make the repository directory:

mkdir /home/svn

chown -R apache.apache /home/svn

Make the directory for your security:

mkdir /home/secvsn

chgrp -R apache /home/secsvn
chmod -R 750 /home/secsvn

You have to change the configuration of Subversion under Apache; it is stored here;

/etc/httpd/conf.d/subversion.conf

Add the following:

<Location /svn>
DAV svn
SVNPath /home/svn
SVNAutoversioning on
AuthType Basic
AuthName “My SVN Repository”
AuthUserFile /home/secsvn/passwd
AuthzSVNAccessFile /home/secsvn/access
Require valid-user
</Location>

Make the password file:

htpasswd -cm /home/secvwn/passwd john

and type a password.

Then make the access file:

cat > /home/secsvn/access
[/]
john = rw

Meaning that john can read and write the entire repository so far. You can add more users if you want ofcourse; read the SVN book to see the syntax.

Start Apache:

/etc/init.d/httpd start

To make sure it starts at the start of the OS:

chkconfig httpd on

Now it should work; make sure you have the firewall open for Apache port 80, if you do not know or have this yet, type;

system-config-securitylevel

and set Apache to the list of allowed services.

From another computer, try;

http://your.ip/svn/

This should ask your for a name and password; this is the aforementioned and created user ‘john’.

If you enter the user and it does not work, look in the /var/log/httpd/error.log file for clues. If there is something wrong with permissions, you problably mounted the repository on another drive and have SELinux security on. To fix this, either turn SELinux off or give the repository the right credentials (Google this).

[TODO: howto switch on SSL for this service]

How I made the search I will put somewhere here later, because it appeared my network card was broken and network switched off automatically after 30 minutes of use on the system.

Executing files with Windows native handler

As I am currently building a cross platform file-manager for our document management tool in Java, I am in need for a way to have files executed by the Windows (and other OSes, but for most clients; Windows) shell as they would be executed when clicking on them in the Windows file explorer.

Doing this in Java is quite easy and a lot of people already figured that out, you simply use the rundll32 command to execute the following:

rundll32 url.dll,FileProtocolHandler URL

where URL can be something on internet or on your local drive. Say you want to ‘run’ a .txt file on your c: drive:

rundll32 url.dll,FileProtocolHandler file:///mytextfile.txt

opens, on my system with notepad.

The problem with this method is, that if there is no handler, nothing will happen…

After playing around a bit with the registry I found the following;

reg query HKCR.ext

gives you information about the extension and what Windows will do with it.

In general, as I have discovered (correct me if I am wrong please!), if is in the result, the file can and will be executed by Windows. You can even use it to extract the file type. Example:

C:>reg query HKCR.dot

! REG.EXE VERSION 3.0

HKEY_CLASSES_ROOT.dot
<NO NAME> REG_SZ OpenOffice.org.dot

HKEY_CLASSES_ROOT.dotPersistentHandler

So the code for running a file with the default associated application under Windows would be:

String file = “/test.html”;
String ext = file.substring(file.lastIndexOf(‘.’));

Process p = Runtime.getRuntime().exec(“reg query HKCR\” + ext);
InputStream in = p.getInputStream();
p.waitFor();
byte[] buffer = new byte[in.available()];
in.read(buffer);
String result = new String(buffer);
in.close();

if (result.indexOf(“<NO NAME>”) > -1) {
p = Runtime.getRuntime().exec(“rundll32 url.dll,FileProtocolHandler file://”+file);
} else {
System.err.println(“No handler available for “+file);
}


There are probably ways to do this faster/easier, but currently we are not in need of this. However, if you know ways to do it, please comment on this post.

Installing Linux on an Acer Ferrari 3400

I recently got to pick new notebook at my work to program Java on; because I like gadgets and because I have been wanting to test out those nice 64 bit processors for a long time already, I pick the Acer Ferrari. The Ferrari logo is a bit of shame (not enough money to buy the car, but still have a Ferrari is kinda sad). The red colour is nice though, and the system is thin and not too heavy.

Windows 64 bit is too new and not included. And why would I want to use Windows anyway?
Lazy as I am, I did not check if any Linux would run on it. I just assumed it would. After a bit of websearching I decided to try some distributions: Ubuntu 64 bit, Knoppix 64 bit and, last but not least Fedora Core 3 64 bit.

FC3 was the only Linux which installed nicely and detected almost everything. Be sure to start the install with “linux nofb”, otherwise you’ll see a very black screen during the install. The install will go flawless.

To use the wireless lan which is integrated under a 64 bit system, you’ll need the 64 bit Windows driver for this Broadcom chipset. You can pick up the driver here; http://ubuntuforums.org/attachment.php?attachmentid=186
Then you should install ndiswrapper (http://ndiswrapper.sourceforge.net). Getting it to work is trivial;

tar xvzf ndiswrapper.tgz
cd ndiswrapper
make
make install
ndiswrapper -i netbc564.inf
modprobe ndiswrapper
iwlist wlan0 scan

now you should see a list of detected networks; if not, something is wrong.

If your computer hangs after the modprobe command, you probably have more than 1 gb of memory and not the latest version of ndiswrapper. I fixed the ndiswrapper together with the guys maintaining the wrapper (long live Open Source!); it took 1 day to figure out what it was and fix the bug (https://sourceforge.net/tracker/?func=detail&atid=604450&aid=1169978&group_id=93482).

After the wireless station is detected, you should make the network work;

ifconfig wlan0 some_ip up
route add default gw some_gw up

When using WEP or other encryption, some more steps are required (Google for iwlist, iwconfig etc).

I work happily with a 64 bit Linux system now, but I still have one problem; the battery overview is incorrect. I use FC3 with KDE and the default notebook app which shows battery status, but the status is actually wrong. If anyone has a fix, please comment on this post!