[English | Japanese]

Namazu Tips

Fast indexing
Saving memory for indexing
Score weighting by HTML elements
HTML processing
Line Adjustment
HTML documents digest
Mail/News Digest
Symbol handling
(Pseudo) Phrase searching
Updating index for updated documents and/or deleted documents

Fast indexing

Use Perl Modules
To handle Japanese documents, using NKF, KAKASI (or ChaSen, MeCab) Perl modules makes 1.5 - 2 times faster.
- NKF
- Text::Kakasi
- Text::ChaSen
- MeCab
Specify --media-type=mtype option
mknmz uses File-MMagic module to automatically identify the document type. If you know the document type of target files in advance, you can avoid the automatic document identification processing by specifying the --media-type=mtype option. By doing so, you can gain 10-20% speed improvement.
If the target file is Mail/News, you can use --mailnews instead of --media-type=message/rfc822. Similarly, if the target file is MHonArc, you can use --mhonarc instead of --media-type='text/html; x-type=mhonarc'.
Increase the value of $ON_MEMORY_MAX in mknmzrc
By default, $ON_MEMORY_MAX is set to 5 MB. Through this value, mknmz limits the amounts of documents processed in memory. If you use a machine with 1 GB memory, you will have no trouble setting $ON_MEMORY_MAX to 100 MB. By doing so, the number of times to write to working files will be reduced.
Use fast computer
Namazu is effective when you use a fast computer that has
fast CPU + large memory + fast HDD . Do not forget to change $ON_MEMORY_MAX.

Saving memory for indexing

Indexing takes a lot of memory. If you encounter "Out of memory!" error at runtime of mknmz, the following precautions can be considered.

Decrease the value of $ON_MEMORY_MAX in mknmzrc
By default, $ON_MEMORY_MAX is set to 5 MB. This value assumes your machine has 64 MB memory. If your machine have less memory, decrease this.
Use --checkpoint option for mknmz
mknmz limits the amounts of documents processed in memory through $ON_MEMORY_MAX in mknmzrc. And mknmz writes temporary files when the size of loaded documents reaches the value. Using --checkpoint option makes mknmz re-exec itself at this time. As a result, the size of mknmz's process decreases considerably. Use top command to see behavior if you are curious.

Score weighting by HTML elements

By default, the following rules are applied for score weighting. These values are decided empirically, and has no theoretical foundations.

<title> 16
<h1> 8
<h2> 7
<h3> 6
<h4> 5
<h5> 4
<h6> 3
<a> 4
<strong>, <em>, <code>, <kbd>, <samp>, <cite>, <var> 2

Moreover, for <meta name="keywords" content="foo bar"> foo bar , score 32 is used.

Namazu decodes ", &,<, > as well as named and numbered entity in &#9-10 and &#32-126. Since the internal encoding is EUC-JP, the right half of ISO-8859-1 (0x80-0xff) cannot be used. By the same reason, numbered entity in UCS-4 cannot be used.

Line Adjustment

Spaces, tabs at the beginning and the end of lines and > | # : at the beginning are removed. If the line ends with a Japanese character, the newline code will be ignored. (This prevents segmentation of Japanese words at the end of line.) These processing will be effective particularly for Mail files. Moreover, recovery of English hyphenation will be handled.

HTML documents digest

HTML defines the structure of documents. A simple digest can be made by using the heading information of documents defined by <h[1-6]>. By default, the length of digest is set to 200 characters. If words from the heading are not enough, more words are supplemented from the beginning of the documents. If the target is text file, the first 200 characters of the documents are simply used.

Mail/News Digest

When dealing Mail/News files, quotation indicators as in (foo@example.jp wrote:"), quotation bodies beginning, for example, with > are not included in the Mail/News digest. Note that these messages are not included in the digests, but are included in the search targets.

Symbol handling

Symbol handling is rather a difficult task. Consider a sentence (foo is bar.) . If we separate it with spaces, "(foo", "is", "bar.)" will be indexed and foo or bar cannot be searched.

To solve this problem, the easiest solution is to remove all the symbols. However, we sometimes wish to search words that has symbols as in .emacs, TCP/IP . For a symbol-embedded string "tcp/ip", Namazu decomposes it into 3 terms "tcp/ip", "tcp", "ip" and registers independently.

For (tcp/ip), Namazu decomposes it into 4 terms "(tcp/ip)", "tcp/ip", "tcp", "ip" . Note that no recursive processing is done, ((tcp/ip)) will be decomposed into "((tcp/ip))", "(tcp/ip)", "tcp", "ip". The indexes for the first example (foo is bar.) will be separated as "(foo", "foo", "is", "bar.)", "bar.", "bar", so foo or bar can be searched.

(Pseudo) Phrase searching

A straight-forward phrase searching implementation will lead to an unacceptable index size. Namazu converts words into hash values to reduce the index size.

If the search expression is given as a phrase "foo bar", Namazu first performs AND searching for "foo" and "bar", and then filters the results by the phrase information.

The phrase information is 2-word unit and is recorded as 16 bit hash value. For this reason, phrases with more than 2 words cannot be searched accurately. For a phrase searching "foo bar baz", the documents only including "foo bar" and "bar baz" will also be retrieved:

...
foo bar ...
... bar baz

When collision of hash values occurred, wrong search results may be returned. But, at least, words foo, bar, baz are all included, for mistakenly retrieved documents.

Updating index for updated documents and/or deleted documents

Updating will be done by not updating/deleting of the documents information from index, but recording the deleted documents information. In other words, the index is intact and simply records the ID of the document that is deleted in addition to the original index.

If updates of an index caused by deleted/updated documents are repeated, the information of deleted documents is increased, and consequently the efficiency of index recording will be lost. In this case, we recommend to clean garbage by gcnmz.

Namazu Homepage

$Id: tips.html.en,v 1.17 2009-03-12 13:06:42 opengl2772 Exp $

developers@namazu.org

Namazu Tips

Table of contents