Namazu-users-ja(旧)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
File::Magicでの挙動
- From: Taiji.Can@xxxxxxxxxxxxxxxxxxx
- Date: Thu, 24 Jul 2003 09:55:59 +0900
- X-ml-name: namazu-users-ja
- X-mail-count: 03401
菅です。
RedHat 9 のサーバで mknmz 時に pdf ファイル解析中(前?)に core を吐いて
しまい、mknmz が実行できないという問題があります。
同じファイルが Solaris 8 では問題なく動いていたので気になっています。
構成は
RedHat 9 Solaris 8
Perl 5.8.0 Perl 5.6.1
xpdf 2.02pl1 xpdf 2.00
namazu 2.0.12 namazu 2.0.12
です。
で、問題のファイルは
% file *
5968-5161E.pdf: Macintosh MacBinary data, type "PDF " (Portable Document Format), creator "CARO"
5968-5162E.pdf: PDF document, version 1.2
5968-5163E.pdf: Macintosh MacBinary data, type "PDF " (Portable Document Format), creator "CARO"
というもので、2000年頃にファイルタイプのチェックで高林さんと野首さんとで
やり取りしたときのスクリプトを入れて試してみました。
% cat File
#! /usr/bin/perl -w
use strict;
#use FileHandle;
use File::MMagic;
use Compress::Zlib;
for my $filename (@ARGV) {
my $mm = new File::MMagic;
$mm->addSpecials("text/plain; x-type=rfc",
"^Network Working Group",
"^Request for Comments:",
"^Obsoletes:",
"^Category:",
"^Updates:");
$mm->addSpecials("application/x-tex",
'^\\\\document(style|class)');
$mm->addFileExts('\\.tex$', 'application/x-tex');
my $fh = new FileHandle "< $filename";
my $cont = join('', <$fh>);
my $type = $mm->checktype_contents($cont);
if ($type =~ /^application\/x-gzip/) {
{
my $offset = 0;
$offset += 3;
my $flags = unpack('C', substr($cont, $offset, 1));
$offset += 1;
$offset += 6;
$cont = substr($cont, $offset);
$cont = substr($cont, 2) if ($flags & 0x04);
$cont =~ s/^[^\0]*\0// if ($flags & 0x08);
$cont =~ s/^[^\0]*\0// if ($flags & 0x10);
$cont = substr($cont, 2) if ($flags & 0x02);
}
my $x = inflateInit(-WindowBits => - MAX_WBITS()) ;
my ($inf, $stat) = $x->inflate($cont);
$cont = $inf if $stat == Z_OK or $stat == Z_STREAM_END ;
$type = $mm->checktype_contents($cont);
print "Compressed:"
}
print "$filename: $type\n";
}
結果は
manager:/home/manager# File *
5968-5161E.pdf: text/plain
5968-5162E.pdf: application/pdf
5968-5163E.pdf: text/plain
です。
なぜ、この後 core になってしまうかは不明なんですが、pdf だと認識できれば
pdftotext では問題ないことがわかっています。
どのような対処が必要でしょうか?
因みに
perl -MFile::MMagic -e '$m = new File::MMagic; print "$File::MMagic::VERSION\n"; $m->check_magic();'
の結果も付けます。
0 string =BZh application/x-bzip2
0 string =#VRML V1.0 ascii model/vrml
0 string =#VRML V2.0 utf8 model/vrml
0 short =51966
>2 short =47806 application/java
0 string =.snd
>12 belong =1 audio/basic
>12 belong =2 audio/basic
>12 belong =3 audio/basic
>12 belong =4 audio/basic
>12 belong =5 audio/basic
>12 belong =6 audio/basic
>12 belong =7 audio/basic
>12 belong =23 audio/x-adpcm
0 lelong =6583086
>12 lelong =1 audio/x-dec-basic
>12 lelong =2 audio/x-dec-basic
>12 lelong =3 audio/x-dec-basic
>12 lelong =4 audio/x-dec-basic
>12 lelong =5 audio/x-dec-basic
>12 lelong =6 audio/x-dec-basic
>12 lelong =7 audio/x-dec-basic
>12 lelong =23 audio/x-dec-adpcm
8 string =AIFF audio/x-aiff
8 string =AIFC audio/x-aiff
8 string =8SVX audio/x-aiff
0 string =MThd audio/unknown
0 string =CTMF audio/unknown
0 string =SBI audio/unknown
0 string =Creative Voice File audio/unknown
0 string =RIFF
>8 string =WAVE audio/x-wav
0 string =/* XPM image/x-xbm
0 string =/* text/plain
0 string =// text/plain
0 string =^_\235 application/x-compress
0 string =^_\213 application/x-gzip
0 string =^_^^ application/octet-stream
0 short =7967 application/octet-stream
0 short =8191 application/octet-stream
0 string =\377^_ application/octet-stream
0 short =51973 application/octet-stream
0 string =<MakerFile application/x-frame
0 string =<MIFFile application/x-frame
0 string =<MakerDictionary application/x-frame
0 string =<MakerScreenFon application/x-frame
0 string =<MML application/x-frame
0 string =<Book application/x-frame
0 string =<Maker application/x-frame
0 string =<HEAD text/html
0 string =<head text/html
0 string =<TITLE text/html
0 string =<title text/html
0 string =<html text/html
0 string =<HTML text/html
0 string =<!-- text/html
0 string =<h1 text/html
0 string =<H1 text/html
0 string =P1 image/x-portable-bitmap
0 string =P2 image/x-portable-greymap
0 string =P3 image/x-portable-pixmap
0 string =P4 image/x-portable-bitmap
0 string =P5 image/x-portable-greymap
0 string =P6 image/x-portable-pixmap
0 string =IIN1 image/x-niff
0 string =MM image/tiff
0 string =II image/tiff
0 string =GIF94z image/unknown
0 string =FGF95a image/unknown
0 string =PBF image/unknown
0 string =GIF image/gif
0 beshort =65496 image/jpeg
0 string =BM image/bmp
0 string =\211PNG image/png
0 string =;; text/plain
0 string =
( application/x-elc
0 string =;ELC^S^@^@^@ application/x-elc
0 string =Relay-Version: message/rfc822
0 string =#! rnews message/rfc822
0 string =N#! rnews message/rfc822
0 string =Forward to message/rfc822
0 string =Pipe to message/rfc822
0 string =Return-Path: message/rfc822
0 string =Path: message/news
0 string =Xref: message/news
0 string =From: message/rfc822
0 string =Article message/news
0 string =\3767^@# application/msword
0 string =\333\245-^@^@^@ application/msword
0 string =%! application/postscript
0 string =^D%! application/postscript
0 string =%PDF- application/pdf
38 string =Spreadsheet application/x-sc
0 string =\367^B application/x-dvi
0 string =\input texinfo text/x-texinfo
0 string =This is Info file text/x-info
0 leshort =759 application/x-dvi
0 string ={\rtf application/rtf
0 string =^@^@^A\263 video/mpeg
0 byte =1 video/unknown
0 byte =2 video/unknown
0 string =DOC
>43 byte =20 application/ichitaro4
>144 string =JDASH application/ichitaro4
0 string =DOC
>43 byte =21 application/ichitaro5
0 string =DOC
>43 byte =22 application/ichitaro6
2080 string =Microsoft Excel 5.0 Worksheet application/excel
2114 string =Biff5 application/excel
0 string =\224\246. application/msword
0 belong =834535424 application/msword
0 string =PO^Q` application/msword
0 string =\320\317^Q\340\241\261^Z\341
>546 string =bjbj application/msword
>546 string =jbjb application/msword
512 string =R^@o^@o^@t^@ ^@E^@n^@t^@r^@y application/msword
2080 string =Microsoft Word 6.0 Document application/msword
2080 string =Documento Microsoft Word 6 application/msword
2112 string =MSWordDoc application/msword
0 string =\320\317^Q\340\241\261^Z\341 application/msword
0 belong =435 video/mpeg
0 belong =442 video/mpeg
0 beshort &65504 audio/mpeg
0 string =MOVI video/quicktime
4 string =moov video/quicktime
4 string =mdat video/quicktime
128 string =PE^@^@ application/octet-stream
0 string =PE^@^@ application/octet-stream
0 string =LZ application/octet-stream
0 string =MZ
>24 string =@ application/octet-stream
0 string =MZ
>30 string =Copyright 1989-1990 PKWARE Inc. application/x-zip
0 string =MZ
>30 string =PKLITE Copr. application/x-zip
0 string =MZ
>36 string =LHa's SFX application/x-lha
0 string =MZ
>36 string =LHA's SFX application/x-lha
0 string =MZ application/octet-stream
2 string =-lh
>6 string =- application/x-lha
0 string =PK application/x-zip
257 string =ustar^@ application/x-tar
257 string =ustar ^@ application/x-gtar
1.12
--
ADVANTEST corp.
Taiji.Can@xxxxxxxxxxxxxxxxxxx