Namazu-users-en(old)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mhonarc.pl modifications with MHonArc-2.6.3

From: Earl Hood <earl@xxxxxxxxxxxx>
Date: Mon, 28 Apr 2003 12:15:02 -0500
X-ml-name: namazu-users-en
X-mail-count: 00398
References: <yfmfzo3jj93.wl@u.ki.nu>

On April 28, 2003 at 14:14, Makoto Fujiwara wrote:
(B
(B> I have started to use MHonArc-2.6.3 very recently.
(B> There were some default changes on handling 2 bytes characters.
(B> But if I have the line
(B> <CharsetConverters>
(B> iso-2022-jp; iso_2022_jp::str2html; iso2022jp.pl
(B> </CharsetConverters>
(B> I am getting the behaviors as the previous version (2.5.x or before).
(B
(BThe change to the defaults were made to provide consistancy in the
(Bdefault handling behavior of character sets in v2.6.  The change
(Bin iso-2022-jp default handling is highlighted in the MHonArc
(Brelease notes.
(B
(B> MHonArc now understands MIME mail, (not very recently), sounds great, 
(B> thanks Earl, and I don't need to have 
(B>    /usr/local/bin/nkf -me 
(B> pre-processing for the input. 
(B
(BI'm confused by this statement since MHonArc has understood MIME
(Bfor a long time.  I'm assuming you are refering to the additional
(Bcharacter encoding support included in v2.6.
(B
(B> (1) Internal multi-byte-chars:
(B> One problem not totally related to the Namazu was: with original
(B> mhonarc code, if I have a multi-byte strings defined for a variable
(B> in .mhonarc.mrc file, the output will be the mixture of ISO-2022-JP
(B> and EUC-JAPAN.  
(B>  There are two assumption for this observation:
(B> 
(B> (a) I will process the article with Namazu and Namazu needs 
(B> <CharsetConverters> defined with str2html type processing,
(B> not with "#x86FB;" type encoding.
(B> 
(B> (b) multi-byte chars value in .mhonarc.mrc will be processed by
(B> Perl/MHorArc, needs be not shift-lock 7 bit type charset.  I used 
(B> EUC-JAPAN defining variables in .mhonsrc.mrc file.
(B> 
(B> To solve this (1) charset mixture problem,
(B> I have currently using Jcode::convert(\$_,'euc') in iso-2022-jp.pl
(B> and processing all the text in EUC-JAPAN.
(B> 
(B>  I will post this part to mhonarc-users Mailing List later probably.
(B
(BIf I may try to clarify, you were using one encoding in your mhonarc
(Bresource file but you have mail messages that use a different encoding.
(BIs am correct in my clarification?
(B
(BMixed encodings will always be a problem.  In MHonArc v2.6, you do
(Bhave the ability to normalize different encodings into one encoding.
(BFor example, if you use EUC in in resource page layout, you could
(Bhave MHonArc encode all messages into EUC when processed.  See
(Bthe TEXTENCODE resource for details.
(B
(BNow, if you edit mhonarc resource files in one encoding, but want the
(Bdata to be mapped to another encoding, then you should filter your
(Bresource file.  For example, you edit your resource file in EUC and
(Bthen you post-process it to ISO-2022-JP before passing it to mhonarc.
(B
(B> (2) filter/mhonarc.pl
(B> MHonArc retains MIME B-Encoding on Subject: and From: info in the line
(B> as:
(B> /<!--X-Subject: ([^-]+) -->/) {
(B> and mhonarc.pl returns encoded text in the fields value.
(B
(BYou can avoid the encoding by utilizing the TEXTENCODE resource
(Bin MHonArc.  TEXTENCODE will cause the data to be pre-decoded
(Band stored in the encoding you specify.  It is best when mapping
(Beverything to UTF-8, but it can be used to map to any encoding.
(B
(BYou may also want to look at the DECODEHEADS resource.
(B
(BHowever, you do touch upon a general problem for archives that
(Bdo not use TEXTENCODE and there is non-ASCII encoded data in
(Bthe Subject.  A potential general solution is to utilize Perl's
(BEncode module within the namazu filter to decode the text
(Bdata to the designated encoding in namazu.rc.
(B
(BSince Encode is only available in Perl 5.8 and later, multi-module
(Bchecks could be made (similiar to how MHonArc 2.6 does charset
(Bprocessing) or just document the issue as a limitation for
(Bthose using older versions of Perl.
(B
(B> So I have modifications in mhonarc.pl so that it returns
(B> the string after 'MIME::Base64::decode'd + euc conversion.
(B> 
(B> This mod needs two more external resouces, Jcode.pm and MIME::Base64.pm.
(B> 
(B> (I am not saying this is the good solution, but just telling I have
(B> this kind of problem and avoided by this patch.)
(B
(BGood catch.  You are right in implying that your patch is not
(Bnecessarily the best solution.  It's main problem is that it only
(Bsolves your particular need and not the general problem.
(B
(BA proper patch could key off the namazu.rc lang setting and
(Btry to general map the non-ASCII encoded data to the given locale.
(B
(BSince this problem is not unique to MHonArc (i.e. Namazu can index
(Bregular mail and news posts), a general decoding routine should be
(Bmade available to namazu filters that decodes non-ASCII encoded data
(Bin mail headers.  It is worth noting that other mail headers beyond
(Bthe Subject: header can include non-ASCII encoded data.
(B
(B--ewh

References:
- mhonarc.pl modifications with MHonArc-2.6.3
  - From: Makoto Fujiwara

Prev by Date: mhonarc.pl modifications with MHonArc-2.6.3
Next by Date: Mailman & "Charactères Français"
Previous by thread: mhonarc.pl modifications with MHonArc-2.6.3
Next by thread: Mailman & "Charactères Français"
Index(es):
- Date
- Thread