Namazu-users-en(old)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
mhonarc.pl modifications with MHonArc-2.6.3
- From: Makoto Fujiwara <makoto@xxxxx>
- Date: Mon, 28 Apr 2003 14:14:48 +0900
- X-ml-name: namazu-users-en
- X-mail-count: 00397
Sorry to start talking MHonArc topics on Namazu ML.
I have started to use MHonArc-2.6.3 very recently.
There were some default changes on handling 2 bytes characters.
But if I have the line
<CharsetConverters>
iso-2022-jp; iso_2022_jp::str2html; iso2022jp.pl
</CharsetConverters>
I am getting the behaviors as the previous version (2.5.x or before).
MHonArc now understands MIME mail, (not very recently), sounds great,
thanks Earl, and I don't need to have
/usr/local/bin/nkf -me
pre-processing for the input.
(1) Internal multi-byte-chars:
One problem not totally related to the Namazu was: with original
mhonarc code, if I have a multi-byte strings defined for a variable
in .mhonarc.mrc file, the output will be the mixture of ISO-2022-JP
and EUC-JAPAN.
There are two assumption for this observation:
(a) I will process the article with Namazu and Namazu needs
<CharsetConverters> defined with str2html type processing,
not with "#x86FB;" type encoding.
(b) multi-byte chars value in .mhonarc.mrc will be processed by
Perl/MHorArc, needs be not shift-lock 7 bit type charset. I used
EUC-JAPAN defining variables in .mhonsrc.mrc file.
To solve this (1) charset mixture problem,
I have currently using Jcode::convert(\$_,'euc') in iso-2022-jp.pl
and processing all the text in EUC-JAPAN.
I will post this part to mhonarc-users Mailing List later probably.
(2) filter/mhonarc.pl
MHonArc retains MIME B-Encoding on Subject: and From: info in the line
as:
/<!--X-Subject: ([^-]+) -->/) {
and mhonarc.pl returns encoded text in the fields value.
So I have modifications in mhonarc.pl so that it returns
the string after 'MIME::Base64::decode'd + euc conversion.
This mod needs two more external resouces, Jcode.pm and MIME::Base64.pm.
(I am not saying this is the good solution, but just telling I have
this kind of problem and avoided by this patch.)
One strange thing is there are two j's, iso-2022-jjp, and I don't
know why at the time being.
Makoto Fujiwara, Chiba Japan
--- /home/makoto/mhonarc.pl-1.23.8.2 Sun Apr 27 18:07:50 2003
+++ mhonarc.pl Mon Apr 28 13:39:23 2003
@@ -31,6 +31,8 @@
require 'gfilter.pl';
require 'html.pl';
require 'mailnews.pl';
+use MIME::Base64;
+use Jcode;
#
# This pattern specifies MHonArc's file names.
@@ -155,15 +157,24 @@
my $mha_head = shift;
if ($mha_head =~ /<!--X-Subject: ([^-]+) -->/) {
+# print " (1) ",$1,"\n";
my $subject = uncommentize($1);
+ $subject = base64toeuc($subject);
1 while ($subject =~ s/\A\s*(re|sv|fwd|fw)[\[\]\d]*[:>-]+\s*//i);
$subject =~ s/\A\s*\[[^\]]+\]\s*//;
+# print " (2) ",$subject,"\n";
$fields->{'subject'} = $subject;
}
if ($mha_head =~ /<!--X-From-R13: ([^-]+) -->/) {
- $fields->{'from'} = mrot13(uncommentize($1));
+ my $from = uncommentize($1);
+ $from = mrot13($from);
+ $from = base64toeuc($from);
+ $fields->{'from'} = $from;
} elsif ($mha_head =~ /<!--X-From: ([^-]+) -->/) {
- $fields->{'from'} = uncommentize($1);
+ my $from = uncommentize($1);
+ $from = mrot13($from);
+ $from = base64toeuc($from);
+ $fields->{'from'} = $from;
}
if ($mha_head =~ /<!--X-Message-Id: ([^-]+) -->/) {
$fields->{'message-id'} = '<' . uncommentize($1). '>';
@@ -173,6 +184,12 @@
}
}
+sub base64toeuc {
+ my $str = shift;
+ $str =~ s/=\?iso-2022-jjp\?b\?([0-9A-Za-z\+\/\=]+)\?=/MIME::Base64::decode($1)/egi;
+ Jcode::convert(\$str,'euc');
+ return $str;
+}
sub uncommentize {
my($txt) = $_[0];
$txt =~ s/&#(\d+);/pack("C",$1)/ge;