Namazu-devel-ja(旧)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: HTML splitting
Satoru Takabayashi <satoru-t@xxxxxxxxxxxxxxxxxx> wrote:
>だけです。ひとまず「HTMLファイルを処理する際に適切に分割する」
>部分を独立のプログラムとして書いてみます。
さっそく、いい加減なツールを作りました。お試しください。
% ls
htmlsplit manual.html
% perl htmlsplit < manual.html
% ls
TOP.html mailutime.html query-notes.html
bnamazu.html manual.html query-or.html
cgi.html mknmz-option.html query-phrase.html
components.html mknmz.html query-regex.html
default-index.html mknmzrc.html query-substring.html
doc-filter.html namazu-option.html query-term.html
form-idxname.html namazu.html query.html
form-idxnames.html namazurc.html rfnmz.html
form-lang.html nmzgrep.html setting.html
form-subquery.html query-and.html template.html
form.html query-field.html tools.html
gcnmz.html query-grouping.html vfnmz.html
htmlsplit query-not.html
% lynx TOP.html
# manual.html 以外ではテストしていません
-- Satoru Takabayashi
#! /usr/local/bin/perl
use strict;
use FileHandle;
my $Header = << 'EOS';
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>${subject}</title>
</head>
<body>
<h1>${subject}</h1>
<hr>
EOS
my $Footer = << 'EOS';
<hr>
<address>
${author}
</address>
</body>
</html>
EOS
my $cont = join '', <>;
my $Title = get_title(\$cont);
my $Author = get_author(\$cont);
my $Name = "TOP";
$cont =~ s/(<a\s[^>]*href=(["']))#(.+?)(\2[^>]*>)/$1$3.html$4/g;
$cont =~ s/\G(.+?)<a\s[^>]*name=(["'])(.+?)\2[^>]*>/
write_partial_file($1, $3)/sgex;
write_partial_file($cont, "");
sub get_title ($) {
my ($contref) = @_;
my $title = undef;
if ($$contref =~ s!<TITLE[^>]*>([^<]+)</TITLE>!!i) {
$title = $1;
$title =~ s/\s+/ /g;
$title =~ s/^\s+//;
$title =~ s/\s+$//;
} else {
$title = "no title";
}
return $title;
}
sub get_author ($) {
my ($contref) = @_;
my $author = "unknown";
# <LINK REV=MADE HREF="mailto:ccsatoru@xxxxxxxxxxxxxxxxxx">
if ($$contref =~ m!<LINK\s[^>]*?HREF=([\"\'])mailto:(.*?)\1\s*>!i) { #"
$author = $2;
} elsif ($$contref =~ m!.*<ADDRESS[^>]*>([^<]*?)</ADDRESS>!i) {
my $tmp = $1;
# $tmp =~ s/\s//g;
if ($tmp =~ /\b([\w\.\-]+\@[\w\.\-]+(?:\.[\w\.\-]+)+)\b/) {
$author = $1;
}
}
return $author;
}
sub write_partial_file($$) {
my ($cont, $name) = @_;
my $fname = "$Name.html";
# print STDERR "$fname\n";
my $fh = new FileHandle;
$fh->open(">$fname") || die "$fname: $!";
my $header = $Header;
$header =~ s/\$\{subject\}/$Title: [$Name]/g;
print $fh $header;
print $fh $cont;
my $footer = $Footer;
$footer =~ s/\$\{author\}/$Author/g;
print $fh $footer;
$Name = $name;
}