How to make Namazu document filter - for Namazu 2.0 - 2001/9/21 Kenji Suzuki2001/7/7 Kenji Suzuki version 0.0.3 ----------------------------------------------- This document is under contruction. Description about add_magic() is not exact. If there are errors, shortage, unclear points, inform me, please. ----------------------------------------------- ** What is a document filter? A document filter is a module (Perl script) to extract information(text) from files to index. Namazu can handle various kinds of file to prepare filters for each kind of file. "Weighted scoring" and "making summary" can be done in a document filter. ** Where document filters installed Document filters are installed into {prefix}/share/namazu/filter/. By default, it is /usr/local/share/namazu/filter/. If you install a new document filter into it, the filter can be used automatically. ** Interface of document filters Below subroutines must be defined in a document filter: mediatype() status() recursive() pre_codeconv() post_codeconv() add_magic($) filter($$$$$) * mediatype() Return media type of a file to process. text/x-hdml application/postscript application/x-compress etc. a filter can handle some kinds of media type, return all kinds of media type in an array (eg. mailnews.pl). We recommend returning IANA registerd media type. * status() normally return yes. If a documet filter uses outer command, and the command is not install on the system, the filter can't handle a document correcty. In that case, return no. * recursive() If a HTML file which is compressed by gzip, you must handle it as application/x-gzip first, and after uncompression, handle as text/html. Like this, if you want filter processing recursively, retunr 1. Otherwise return 0. * pre_codeconv() If you want to convert Japanese Kanji code of a document before calling filter(), return 1. Otherwise return 0. Namazu uses EUC internally. * post_codeconv() If you want to convert Japanese Kanji code of a document after calling filter(), return 1. Otherwise return 0. Namazu uses EUC internally. * add_magic() In case File::MMagic fails to recognize file type, you can add information to recognize a file with File::MMagic method. $magic->addSpecials eg: $magic->addSpecials('text/x-hdml', '<[Hh][Dd][Mm][Ll][^>]*>'); $magic->addSpecials("text/plain; x-type=rfc", "^Network Working Group", "^Request [fF]or Comments", "^Obsoletes:", "^Category:", "^Updates:"); $magic->addFileExts Specify file extention. This is for Microsoft Office suites document which we can't write magic entry correctly? eg: $magic->addFileExts('^rfc\d+\.txt$', 'text/plain; x-type=rfc'); $magic->addFileExts('\\.tex$', 'application/x-tex'); $magic->addMagicEntry Specigy magic entry. eg: $magic->addMagicEntry('0 string \