Namazu-users-en(old)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Malformed UTF-8 character ...
- From: Earl Hood <earl@xxxxxxxxxxxx>
- Date: Wed, 05 May 2004 17:11:33 -0500
- X-ml-name: namazu-users-en
- X-mail-count: 00498
Namazu version: 2.0.13
Perl version: 5.8.4
OS: Linux 2.4.21-4.ELsmp #1 SMP Fri Oct 3 17:52:56 EDT 2003 i686 i686 i386 GNU/Linux
Running mknmz generates the following message repeatedly:
Malformed UTF-8 character (unexpected continuation byte 0xa4, with no preceding
start byte) in pattern match (m//) at /usr/local/share/namazu/filter/mailnews.pl
line 216, <GEN5> line 71.
...
Figuring it was a LANG envariable setting, I explicitly sent LANG
to en_US (it was defaulted to en_US.UTF-8), but it did not fix it.
Maybe I should try en_US.ISO-8859-1?
To suppress the message I added a "use bytes" pragma to mailnews.pl
to avoid Perl doing any character processing:
--- mailnews.pl.20040505 2004-05-05 14:52:23.000000000 -0700
+++ mailnews.pl 2004-05-05 14:53:56.000000000 -0700
@@ -209,6 +209,7 @@ sub mailnews_citation_filter ($$) {
$$contref = "";
my $i = 0;
for my $line (@tmp) {
+ use bytes;
# Complete excluding is impossible. I tnink it's good enough.
# Process only first five paragrahs.
# And don't handle the paragrah which has five or longer lines.
I put the pragma just within the block that was generating the
warnings.
I'm unsure if this is the best fix, but since mailnews.pl contains
8-bit values in a regex, something should be done to avoid Perl
trying to interpret the octets under a character encoding.
It may be better to conditionalize the code based upon language
setting. I.e. Have a different regex for each support locale.
--ewh