Namazu-users-en(old)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

mhonarc.pl modifications with MHonArc-2.6.3

From: Makoto Fujiwara <makoto@xxxxx>
Date: Mon, 28 Apr 2003 14:14:48 +0900
X-ml-name: namazu-users-en
X-mail-count: 00397

Sorry to start talking MHonArc topics on Namazu ML.

I have started to use MHonArc-2.6.3 very recently.
There were some default changes on handling 2 bytes characters.
But if I have the line
<CharsetConverters>
iso-2022-jp; iso_2022_jp::str2html; iso2022jp.pl
</CharsetConverters>
I am getting the behaviors as the previous version (2.5.x or before).

MHonArc now understands MIME mail, (not very recently), sounds great, 
thanks Earl, and I don't need to have 
   /usr/local/bin/nkf -me 
pre-processing for the input. 

(1) Internal multi-byte-chars:
One problem not totally related to the Namazu was: with original
mhonarc code, if I have a multi-byte strings defined for a variable
in .mhonarc.mrc file, the output will be the mixture of ISO-2022-JP
and EUC-JAPAN.  
 There are two assumption for this observation:

(a) I will process the article with Namazu and Namazu needs 
<CharsetConverters> defined with str2html type processing,
not with "#x86FB;" type encoding.

(b) multi-byte chars value in .mhonarc.mrc will be processed by
Perl/MHorArc, needs be not shift-lock 7 bit type charset.  I used 
EUC-JAPAN defining variables in .mhonsrc.mrc file.

To solve this (1) charset mixture problem,
I have currently using Jcode::convert(\$_,'euc') in iso-2022-jp.pl
and processing all the text in EUC-JAPAN.

 I will post this part to mhonarc-users Mailing List later probably.

(2) filter/mhonarc.pl
MHonArc retains MIME B-Encoding on Subject: and From: info in the line
as:
/<!--X-Subject: ([^-]+) -->/) {
and mhonarc.pl returns encoded text in the fields value.

So I have modifications in mhonarc.pl so that it returns
the string after 'MIME::Base64::decode'd + euc conversion.

This mod needs two more external resouces, Jcode.pm and MIME::Base64.pm.

(I am not saying this is the good solution, but just telling I have
this kind of problem and avoided by this patch.)

One strange thing is there are two j's, iso-2022-jjp, and I don't
know why at the time being. 

Makoto Fujiwara, Chiba Japan

--- /home/makoto/mhonarc.pl-1.23.8.2	Sun Apr 27 18:07:50 2003
+++ mhonarc.pl	Mon Apr 28 13:39:23 2003
@@ -31,6 +31,8 @@
 require 'gfilter.pl';
 require 'html.pl';
 require 'mailnews.pl';
+use MIME::Base64;
+use Jcode;
 
 #
 # This pattern specifies MHonArc's file names.
@@ -155,15 +157,24 @@
     my $mha_head     = shift;
 
     if ($mha_head =~ /<!--X-Subject: ([^-]+) -->/) {
+#	print "   (1) ",$1,"\n";
 	my $subject = uncommentize($1);
+	$subject = base64toeuc($subject);
 	1  while ($subject =~ s/\A\s*(re|sv|fwd|fw)[\[\]\d]*[:>-]+\s*//i);
 	$subject =~ s/\A\s*\[[^\]]+\]\s*//;
+#	print "   (2) ",$subject,"\n";
 	$fields->{'subject'} = $subject;
     }
     if ($mha_head =~ /<!--X-From-R13: ([^-]+) -->/) {
-	$fields->{'from'} = mrot13(uncommentize($1));
+	my $from = uncommentize($1);
+	$from = mrot13($from);
+	$from = base64toeuc($from);
+	$fields->{'from'} = $from;
     } elsif ($mha_head =~ /<!--X-From: ([^-]+) -->/) {
-	$fields->{'from'} = uncommentize($1);
+	my $from = uncommentize($1);
+	$from = mrot13($from);
+	$from = base64toeuc($from);
+	$fields->{'from'} = $from;
     }
     if ($mha_head =~ /<!--X-Message-Id: ([^-]+) -->/) {
 	$fields->{'message-id'} = '&lt;' . uncommentize($1). '&gt;';
@@ -173,6 +184,12 @@
     }
 }
 
+sub base64toeuc { 
+	my $str = shift;
+	$str =~ s/=\?iso-2022-jjp\?b\?([0-9A-Za-z\+\/\=]+)\?=/MIME::Base64::decode($1)/egi;
+	Jcode::convert(\$str,'euc');
+	return $str;
+}
 sub uncommentize {
     my($txt) = $_[0];
     $txt =~ s/&#(\d+);/pack("C",$1)/ge;

Follow-Ups:
- Re: mhonarc.pl modifications with MHonArc-2.6.3
  - From: Earl Hood

Prev by Date: Re: Polish characters in Namazu
Next by Date: Re: mhonarc.pl modifications with MHonArc-2.6.3
Previous by thread: Use Browser with Namazu
Next by thread: Re: mhonarc.pl modifications with MHonArc-2.6.3
Index(es):
- Date
- Thread