namazu-ml(ring)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Comment
Koji Kishi <kis@xxxxxxxxxxxxxx> wrote:
>Namazu v1.3.0.11 で気がついたんですが(前からそうだったかも)、
>HTML に次のようなコードが入っていると
(snip)
>7行目の "=3 ) { return true;}" 以降を本文として扱ってしまうようです。
>コメントを
> <!-- から >
>
>までにしてるのかなあ。
あ、すみません。HTMLは正規表現でいい加減に除去しています。と
りあえず、mknmzの
sub erase_html_tags ($) {
my ($contents) = @_;
1 while ($$contents =~ s/<\/?([^<>]*)>/tag_to_space_or_null($1)/ge);
}
なる関数を
sub erase_html_tags ($) {
my ($contents) = @_;
$$contents =~ s/<!--.*?-->//gs; # これを追加
1 while ($$contents =~ s/<\/?([^<>]*)>/tag_to_space_or_null($1)/ge);
}
すれば回避できると思います (完全ではないですけど)。開発中の
2.0 では上記の処理を行っています。
ちなみに、HTMLタグの除去は正規表現では正確には行えないことが
知られています。
From: Tom Christiansen <tchrist@xxxxxxxxxxxx>
Newsgroups: comp.lang.perl.misc
Subject: Re: Can't Match Multi-Line Pattern
Date: Fri, 7 Aug 1998 22:38:08 JST
Message-ID: <6qf000$8b4$1@xxxxxxxxxxxxxxxxxxxxxx>
| Question: Assuming $_ contains HTML, which of
| the following substitutions will remove all tags in it?
| Type: Regular Expressions, WWW
| Difficulty: 6/7 (Hard)
|
| Answer: You can't do that.
| Correct: Yes.
| Why: If it weren't for HTML comments, improperly formatted
| HTML, and tags with interesting data like <SCRIPT>, you
| could do this. Alas, you cannot. It takes a lot
| more smarts, and quite frankly, a real parser.
|
|
| Answer: s/<.*>//g;
| Correct: No.
| Why: As written, the dot will not cross newline boundaries, and the
| star is being too greedy. If you add a /s, then yes,
| it will remove all tags -- and a great deal else besides.
|
| Answer: s/<.*?>//gs;
| Correct: No.
| Why: It is easy to construct a tag that will cause this to fail,
| such as the common `<IMG SRC='foo.gif' ALT="> ">' tag.
|
| Answer: s/<\/?[A-Z]\w*(?:\s+[A-Z]\w*(?:\s*=\s*(?:(["']).*?\1|[\w-.]+))?)*\s*>//gsix;
| Correct: No.
| Why: For a good deal of HTML, this will actually work, but
| it will fail on cases with annoying comments, poorly formatted
| HTML, and tags like <SCRIPT> and <STYLE>, which can contain
| things like `while (<FH>) {}' without those being counted
| as tags. Comments that will annoy you include
| <!-- <foo bar = "-->">
| which will remove characters when it shouldn't; it's just
| a comment followed by `">'. And even something like this:
| <!-- <foo bar = "-->
| Most browsers will get right, but the substitution will not.
| And if you have improper HTML, you get into even more
| trouble, like this:
| <foo bar = "bleh" @>
| text text text
| <foo bar = "bleh">
| in which case the .*? will gobble up way more than you
| thought it would.
-- Satoru Takabayashi