namazu-dev(ring)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mail from the author of xpdf
Satoru Takabayashi <satoru-t@xxxxxxxxxxxxxxxxxx> wrote:
>次のようなメイルが届きました。今日は試験勉強をしないといけな
>いので返事は明日以降に出します。
次のような返事を出しておきました。問題があれば指摘してくださ
いませ。
Thank you for emailing me for the enquiry below.
"Derek B. Noonburg" <derekn@xxxxxxxxxxx> wrote:
>I received email from Arumugam-san asking about using your Namazu search
>software to index and search PDF files. I'm the author of xpdf, which
>includes a program called pdftotext that extracts the text from PDF
>files. Currently, xpdf can display Japanese text, but pdftotext cannot
>extract it (pdftotext only handles 8-bit fonts).
Wow, it's a good news not only for me but for all Japanese
UNIX users struggling to handle PDF files in text processing!
>It should not be too hard for me to add support for Japanese text to
>pdftotext. One thing I need to know is: what encoding does Namazu use
>for Japanese text? PDF files use Adobe Japan1-2 (and variations)
>internally. I already have a mapping from Japan1-2 to JIS X 0208-1983.
>Is this useful? Also, is there some way of distinguishing 8-bit and
>16-bit characters in the same text file?
Namazu uses a tool called NKF[*1] for reading Japanese texts.
NKF can handle Japanese texts encoded in ISO-2022-JP (RFC
1468), EUC-JP (Extended UNIX Code) and Shift_JIS (which made
by Microsoft).
1. <ftp://ftp.ie.u-ryukyu.ac.jp/pub/software/kono/nkf171.shar>
The internal encoding of Namazu is EUC-JP. My choosing
EUC-JP is that it is very easy for Perl to handle Japanese
texts. (ISO-2022-JP and Shift_JIS are cumbersome
to handle.)
If you want to remove all Japanese characters in a text, you
can just do like this (in Perl):
# text containing Japanese characters encoded in EUC-JP
$content;
# remove all Japanese characters
$content =~ s/[\xa1-\xfe][\xa1-\xfe]//g;
In short, the regex "[\xa1-\xfe][\xa1-\xfe]" expresses one
Japanese character which takes 16-bit in a text. In an
EUC-JP encoded text, a charset of Japanese characters is JIS
X 0208-1983 and all MSB of their characters are set to 1.
Since EUC-JP cannot contain any 8-bit characters coded in
the range of [\x80-\xff] but can contain only ASCII charset
which has the code range of [\x00-\x7f], you can distinguish
8-bit ASCII characters from 16-bit JIS X 0208-1983
characters easily.
In other words, EUC-JP is an encoding constructed in the
following rules:
* For 8-bit characters, EUC-JP uses ASCII charset which
takes 8-bit and has the code range of [\x00-\x7f].
* For 16-bit characters, EUC-JP uses JIS X 0208-1983
charset which takes 16-bit and set all MSB of their
codes to 1. So, the code range is [\xa1-\xfe].
Strictly speaking, there are additional rules for encoding
JIS X 0201, the so-called Hankaku-Kana, and JIS X 0212, the
so-called Hojo-Kanji. But those are rarely used, and
moreover, those rules are difficult to follow. So, I think
you can ignore those rules. (I always ignore those rules.)
If you want to know correct information on Japanese text
processing, I recommend you to consult the book:
* CJKV Information Processing : Chinese, Japanese, Korean & Vietnamese
<http://www.oreilly.com/catalog/cjkvinfo/noframes.html>
and check out its author's webpage:
* Ken Lunde's Home Page
<http://www.ora.com/people/authors/lunde/>
Anyway, I know another tool that extracts a text from a PDF
file. PDF2TXT written in perl by <ishida@xxxxxxxxxxxxxxx>
can handle Japanese texts. I think this will be helpful to
get information on how to support Japanese texts. You can
get PDF2TXT from the following URI.
* /pub/person/ishida/freeware/pdf2txt directory
<ftp://paprika.noc.intec.co.jp/pub/person/ishida/freeware/pdf2txt/>
If you have any question, please feel free to email me.
Regards,
-- Satoru Takabayashi
>mimasa氏に Cc: します。(助言が欲しい)
mimasa氏は最近、何をしています? 暇なら 6月5日の宴会に来ませ
ん? (mimasa氏の仕事に暇という状況はないのでしょうけど ;-)