[English | Japanese]
--media-type=mtype
option--media-type=mtype
option. By doing so, you can
gain 10-20% speed improvement.
--mailnews
instead of
--media-type=message/rfc822
. Similarly, if the
target file is MHonArc, you can use --mhonarc
instead of --media-type='text/html;
x-type=mhonarc'
.
$ON_MEMORY_MAX
in mknmzrc$ON_MEMORY_MAX
is set to 5
MB. Through this value, mknmz limits the amounts of
documents processed in memory. If you use a machine with 1
GB memory, you will have no trouble setting
$ON_MEMORY_MAX
to 100 MB. By doing so, the
number of times to write to working files will be reduced.
Indexing takes a lot of memory. If you encounter "Out of memory!" error at runtime of mknmz, the following precautions can be considered.
$ON_MEMORY_MAX
is set to 5
MB. This value assumes your machine has 64 MB memory. If your
machine have less memory, decrease this.
By default, the following rules are applied for score weighting. These values are decided empirically, and has no theoretical foundations.
<title> 16
<h1> 8
<h2> 7
<h3> 6
<h4> 5
<h5> 4
<h6> 3
<a> 4
<strong>, <em>, <code>, <kbd>, <samp>, <cite>, <var> 2
Moreover, for <meta name="keywords"
content="
foo bar">
foo bar , score 32 is used.
Namazu decodes ", &,<, > as well as named and numbered entity in 	-10 and  -126. Since the internal encoding is EUC-JP, the right half of ISO-8859-1 (0x80-0xff) cannot be used. By the same reason, numbered entity in UCS-4 cannot be used.
Spaces, tabs at the beginning and the end of lines and > | # : at the beginning are removed. If the line ends with a Japanese character, the newline code will be ignored. (This prevents segmentation of Japanese words at the end of line.) These processing will be effective particularly for Mail files. Moreover, recovery of English hyphenation will be handled.
HTML defines the structure of documents. A simple digest can be made by using the heading information of documents defined by <h[1-6]>. By default, the length of digest is set to 200 characters. If words from the heading are not enough, more words are supplemented from the beginning of the documents. If the target is text file, the first 200 characters of the documents are simply used.
When dealing Mail/News files, quotation indicators as in
(foo@example.jp wrote:"), quotation bodies beginning, for
example, with >
are not included in the
Mail/News digest. Note that these messages are not included
in the digests, but are included in the search targets.
Symbol handling is rather a difficult task. Consider a
sentence (foo is bar.)
. If we separate it with
spaces, "(foo", "is", "bar.)"
will be indexed
and foo or bar cannot be searched.
To solve this problem, the easiest solution is to remove all
the symbols. However, we sometimes wish to search words that
has symbols as in .emacs, TCP/IP
. For a
symbol-embedded string "tcp/ip
", Namazu
decomposes it into 3 terms "tcp/ip", "tcp",
"ip"
and registers independently.
For (tcp/ip)
, Namazu decomposes it into 4 terms
"(tcp/ip)", "tcp/ip", "tcp", "ip"
. Note that
no recursive processing is done, ((tcp/ip))
will be decomposed into "((tcp/ip))", "(tcp/ip)",
"tcp", "ip"
. The indexes for the first example
(foo is bar.)
will be separated as
"(foo", "foo", "is", "bar.)", "bar.", "bar"
, so
foo or bar can be searched.
A straight-forward phrase searching implementation will lead to an unacceptable index size. Namazu converts words into hash values to reduce the index size.
If the search expression is given as a phrase "foo bar", Namazu first performs AND searching for "foo" and "bar", and then filters the results by the phrase information.
The phrase information is 2-word unit and is recorded as 16 bit hash value. For this reason, phrases with more than 2 words cannot be searched accurately. For a phrase searching "foo bar baz", the documents only including "foo bar" and "bar baz" will also be retrieved:
...
foo bar ...
... bar baz
When collision of hash values occurred, wrong search results may be returned. But, at least, words foo, bar, baz are all included, for mistakenly retrieved documents.
Updating will be done by not updating/deleting of the documents information from index, but recording the deleted documents information. In other words, the index is intact and simply records the ID of the document that is deleted in addition to the original index.
If updates of an index caused by deleted/updated documents are repeated, the information of deleted documents is increased, and consequently the efficiency of index recording will be lost. In this case, we recommend to clean garbage by gcnmz.