问题:RegEx匹配XHTML自包含标签以外的打开标签

我需要匹配所有这些开始标记:

<p>
<a href="foo">

但不是这些:

<br />
<hr class="foo" />

我想到了这个,想确保我做对了。我只捕获a-z

<([a-z]+) *[^/]*?>

我相信它说:

  • 找到小于号,然后
  • 找到(并捕获)a-z一次或多次,然后
  • 找到零个或多个空格,然后
  • 找到任何字符零次或多次,贪婪,除了/,然后
  • 找到大于...

我有权利吗?更重要的是,您怎么看?

标签:html,regex,xhtml

Q: RegEx match open tags except XHTML self-contained tags

I need to match all of these opening tags:

<p>
<a href="foo">

But not these:

<br />
<hr class="foo" />

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.

<([a-z]+) *[^/]*?>

I believe it says:

  • Find a less-than, then
  • Find (and capture) a-z one or more times, then
  • Find zero or more spaces, then
  • Find any character zero or more times, greedy, except /, then
  • Find a greater-than

Do I have that right? And more importantly, what do you think?

回答1:

您无法使用正则表达式解析[X] HTML。因为正则表达式无法解析HTML。正则表达式不是可用于正确解析HTML的工具。正如我之前在这里多次回答HTML和Regex问题一样,使用正则表达式将不允许您使用HTML。正则表达式是一种工具,不够复杂,无法理解HTML所采用的结构。 HTML不是常规语言,因此无法通过常规表达式进行解析。正则表达式查询无法将HTML分解为有意义的部分。有很多次了,但是没有得到我。甚至Perl使用的增强的不规则正则表达式也无法完成HTML解析任务。你永远不会让我崩溃。 HTML是一种足够复杂的语言,无法通过正则表达式进行解析。甚至Jon Skeet也无法使用正则表达式解析HTML。每次您尝试使用正则表达式解析HTML时,这个邪恶的孩子都会哭泣处女之血,俄罗斯黑客会伪装您的Web应用程序。用正则表达式解析HTML会使灵魂陷入生活领域。 HTML和正则表达式可以像爱情,婚姻和仪式杀婴一样一起使用。

无法容纳它为时已晚。正则表达式和HTML共同作用于同一个概念空间中,将破坏您的思维,就像水腻腻一样。如果您使用正则表达式解析HTML,那么您就是在屈服于他们及其亵渎神明的方式,这使我们所有人都为不愿在基本多语言平面中表达其名字的人付出辛劳。 HTML + regexp将在您观察的同时液化知觉的神经,使您的心灵在恐怖的冲击中枯萎。基于Rege̿̔̉x的HTML解析器是杀死StackOverflow的癌症为时已晚,为时已晚,我们无法挽救儿童的混乱确保正则表达式将消耗所有活组织(除了HTML,它不能,如先前所预言的那样:尊敬的上帝,我们将使用regex解析HTML来挽救人类的祸患,这使人类注定要遭受可怕的酷刑和安全漏洞,使用rege x作为一种处理HTML的工具在这个世界和腐败实体(例如SGML实体,但更腐败)的可怕领域之间建立了联系,仅仅是对世界的一瞥。 HTML的reg ex解析器将立即将aprgraphmer的意识传达给不断尖叫的他,他来了 ,农药sl 正则表达式感染会吞噬您的HT ML解析器,像Visual Basic这样的应用和存在的所有时间只会更糟他来了 请不要担心,̕hrai 破坏所有的宽容,HTML标记请不要担心。 > uid p ain,律动的歌曲re 会话解析 将消灭sp中的世俗男人的声音。在这里我可以看到你能看到``我很美丽''我的人的谎言inalsnuf fing the all is LOŚ͖̩͇̗̪̏̈́TA A LL ISL OST th ponyy他来了他走了 es he co <罢工> 我 st he ich 或渗透到我的脸上,我的脸,我的脸,上帝啊! b> O‐ON Θ停下<*>一个g̶͑̾̾l͖͉̗̩̳̟̍ͫͥͨe̠̅s͎a̧͈͖r̽̾̈́͒͑e n ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ T O͇̹̺ͅƝ̴ȳ̳TH̘ Ë͖́̉ ͠P̯͍̭Ó̚N̐Y̸̡̪̯ͨ͊̽̅̾̎ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝ S̨̥̫͎̭ͯ̿̔̀ͅ


您是否尝试过使用XML解析器?


主持人说明

此信息已被锁定,以防止对其内容进行不适当的编辑。该帖子看起来与预期的完全一样-内容没有问题。请不要标记它以引起我们的注意。

A1:

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes , the pestilent sl ithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​re ssion parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮om es he co me s the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ


Have you tried using an XML parser instead?


Moderator's Note

This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.

回答2:

虽然任意仅带有正则表达式的HTML是不可能的,但有时使用它们来解析有限的已知的HTML集合是适当的。

如果您想从一小部分HTML页面中抓取数据,然后填充到数据库中,则正则表达式可能会正常工作。例如,我最近想获得我从议会网站上获得的澳大利亚联邦代表的姓名,政党和地区。这是一项有限的一次性工作。

正则表达式对我来说效果很好,并且设置起来非常快。

A2:

While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML.

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's web site. This was a limited, one-time job.

Regexes worked just fine for me, and were very fast to set up.

回答3:

我认为这里的缺陷在于HTML是 Chomsky Type 2语法(无上下文语法)< / a>,而RegEx是 Chomsky Type 3语法(常规语法)。由于类型2语法从根本上比类型3语法复杂(请参见乔木斯基层次结构),使用RegEx解析XML在数学上是不可能的。

但是很多人会尝试,有些甚至会宣称成功-但是直到其他人发现错误并完全把你弄糟为止。

A3:

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), it is mathematically impossible to parse XML with RegEx.

But many will try, some will even claim success - but until others find the fault and totally mess you up.

回答4:

别听这些家伙。如果将任务分解成较小的部分,则您完全可以 使用正则表达式来解析上下文无关的语法。您可以使用按顺序执行这些操作的脚本来生成正确的模式:

  1. 解决停止问题。
  2. 广场。
  3. 计算O(log n)或以下的旅行商问题。如果不止如此,您将用完RAM,引擎将挂起。
  4. 该模式将很大,因此请确保您有一个无损压缩随机数据的算法。
  5. 几乎在那里-将整个事物除以零。轻松自在。

我自己还没有完成最后一部分,但是我知道我已经接近了。由于某种原因,它总是抛出CthulhuRlyehWgahnaglFhtagnException s,所以我将其移植到VB 6并使用OnErrorResumeNext。一旦调查了刚刚在墙上打开的这扇奇怪的门,我将更新代码。嗯。

P.S。皮埃尔·德·费马特(Pierre de Fermat)也想出了办法,但是他写的利润不足以容纳代码。

A4:

Don't listen to these guys. You totally can parse context-free grammars with regex if you break the task into smaller pieces. You can generate the correct pattern with a script that does each of these in order:

  1. Solve the Halting Problem.
  2. Square a circle.
  3. Work out the Traveling Salesman Problem in O(log n) or less. If it's any more than that, you'll run out of RAM and the engine will hang.
  4. The pattern will be pretty big, so make sure you have an algorithm that losslessly compresses random data.
  5. Almost there - just divide the whole thing by zero. Easy-peasy.

I haven't quite finished the last part myself, but I know I'm getting close. It keeps throwing CthulhuRlyehWgahnaglFhtagnExceptions for some reason, so I'm going to port it to VB 6 and use On Error Resume Next. I'll update with the code once I investigate this strange door that just opened in the wall. Hmm.

P.S. Pierre de Fermat also figured out how to do it, but the margin he was writing in wasn't big enough for the code.

回答5:

免责声明:如果有选择,请使用解析器。那就是...

这是我使用(!)来匹配HTML标签的正则表达式:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

这可能并不完美,但是我通过很多 HTML运行了这段代码。请注意,它甚至会捕获显示在网络上的奇怪内容,例如

我想使其与自包含标签不匹配,您要么要使用科比的负面印象:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>

或仅将是否合并

致反对者:这是实际产品中的有效代码。我怀疑任何阅读此页面的人都会得到这样的印象:在HTML上使用正则表达式在社会上是可以接受的。

注意事项:我应该注意到,在存在CDATA块,注释以及脚本和样式元素的情况下,此正则表达式仍会崩溃。好消息是,您可以摆脱使用正则表达式的那些人...

A5:

Disclaimer: use a parser if you have the option. That said...

This is the regex I use (!) to match HTML tags:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like <a name="badgenerator"">, which show up on the web.

I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>

or just combine if and if not.

To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.

Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...

回答6:

有些人会告诉你地球是圆形的(或者,如果他们想使用奇怪的单词,也许地球是扁圆球体)。他们在撒谎。

有人会告诉你正则表达式不应该是递归的。他们限制了你。他们需要征服您,并且通过让您无知来做到这一点。

您可以生活在现实中,也可以服用红色药丸。

像元帅勋爵(他是元帅.NET班的亲戚吗?),我见过 逆反 基于堆栈的正则表达式,并返回 权力 您无法想象的知识。是的,我认为有一个或两个老人在保护他们,但是他们正在电视上看足球,所以这并不困难。

我认为XML情况很简单。 regEx(使用.NET语法)在base64中进行了压缩和编码,以使您更容易理解,这应该是这样的:

7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28
995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F
86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169
OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq
i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv
p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf
LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e
Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7
O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm
rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv
z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme
nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e
vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y
gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs
mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH
W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52
MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU
1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn
xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ
GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY
12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37
R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn
3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25
D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP
mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS
mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX
X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8
DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c
etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3
zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS
ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ
j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX
/ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d
mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u
v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj
4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq
GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6
mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K
MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z
0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26
7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29
7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9
r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va
j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd
w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa
2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm
AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C
j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8
fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+
+fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx
+r/vD34mUADO1P4/AQAA//8=

要设置的选项是RegexOptions.ExplicitCapture。您要查找的捕获组是ELEMENTNAME。如果捕获组ERROR不为空,则说明存在解析错误,并且正则表达式已停止。

如果您在将其转换为可读的正则表达式时遇到问题,这应该会有所帮助:

static string FromBase64(string str)
{
    byte[] byteArray = Convert.FromBase64String(str);

    using (var msIn = new MemoryStream(byteArray))
    using (var msOut = new MemoryStream()) {
        using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) {
            ds.CopyTo(msOut);
        }

        return Encoding.UTF8.GetString(msOut.ToArray());
    }
}

如果您不确定,不,我不是在开玩笑(但也许我在撒谎)。它将起作用。我已经建立了大量的单元测试来对其进行测试,甚至使用了一致性测试(的一部分) 。这是一个标记器,而不是成熟的解析器,因此它将仅将XML拆分为其组件标记。它不会解析/集成DTD。

哦...如果您想要正则表达式的源代码,请使用一些辅助方法:

regex以标记化xml 完整的正则表达式

A6:

There are people that will tell you that the Earth is round (or perhaps that the Earth is an oblate spheroid if they want to use strange words). They are lying.

There are people that will tell you that Regular Expressions shouldn't be recursive. They are limiting you. They need to subjugate you, and they do it by keeping you in ignorance.

You can live in their reality or take the red pill.

Like Lord Marshal (is he a relative of the Marshal .NET class?), I have seen the Underverse Stack Based Regex-Verse and returned with powers knowledge you can't imagine. Yes, I think there were an Old One or two protecting them, but they were watching football on the TV, so it wasn't difficult.

I think the XML case is quite simple. The RegEx (in the .NET syntax), deflated and coded in base64 to make it easier to comprehend by your feeble mind, should be something like this:

7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28
995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F
86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169
OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq
i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv
p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf
LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e
Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7
O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm
rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv
z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme
nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e
vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y
gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs
mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH
W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52
MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU
1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn
xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ
GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY
12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37
R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn
3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25
D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP
mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS
mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX
X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8
DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c
etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3
zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS
ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ
j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX
/ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d
mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u
v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj
4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq
GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6
mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K
MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z
0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26
7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29
7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9
r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va
j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd
w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa
2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm
AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C
j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8
fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+
+fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx
+r/vD34mUADO1P4/AQAA//8=

The options to set is RegexOptions.ExplicitCapture. The capture group you are looking for is ELEMENTNAME. If the capture group ERROR is not empty then there was a parsing error and the Regex stopped.

If you have problems reconverting it to a human-readable regex, this should help:

static string FromBase64(string str)
{
    byte[] byteArray = Convert.FromBase64String(str);

    using (var msIn = new MemoryStream(byteArray))
    using (var msOut = new MemoryStream()) {
        using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) {
            ds.CopyTo(msOut);
        }

        return Encoding.UTF8.GetString(msOut.ToArray());
    }
}

If you are unsure, no, I'm NOT kidding (but perhaps I'm lying). It WILL work. I've built tons of unit tests to test it, and I have even used (part of) the conformance tests. It's a tokenizer, not a full-blown parser, so it will only split the XML into its component tokens. It won't parse/integrate DTDs.

Oh... if you want the source code of the regex, with some auxiliary methods:

regex to tokenize an xml or the full plain regex

回答7:

在外壳程序中,您可以使用 HTML .wikipedia.org/wiki/Sed" rel="noreferrer"> sed :

  1. Turing.sed
  2. 编写HTML解析器(作业)
  3. ???
  4. 利润!

相关(为什么不应该使用正则表达式匹配):

A7:

In shell, you can parse HTML using sed:

  1. Turing.sed
  2. Write HTML parser (homework)
  3. ???
  4. Profit!

Related (why you shouldn't use regex match):

回答8:

我同意解析XML和尤其是HTML 的正确工具是解析器,而不是正则表达式引擎。但是,就像其他人指出的那样,有时使用正则表达式会更快,更容易,并且只要您知道数据格式,就可以完成工作。

Microsoft实际上有一小节 .NET Framework中正则表达式的最佳做法并专门讨论考虑输入源

正则表达式确实有局限性,但是您是否考虑了以下内容?

.NET框架在正则表达式方面是唯一的,因为它支持平衡组定义

由于这个原因,我相信您可以使用正则表达式解析XML。但是请注意,它必须是有效的XML (浏览器非常宽容HTML,并且允许HTML内使用错误的XML语法)。这是可能的,因为"平衡组定义"将允许正则表达式引擎充当PDA。

上面引用的第1条语录:

.NET正则表达式引擎

如上所述,不能通过正则表达式描述适当平衡的构造。但是,.NET正则表达式引擎提供了一些允许识别平衡构造的构造。

  • (? ) -将具有名称组的捕获结果推送到捕获堆栈中。
  • (?<-group>)-从捕获堆栈中弹出名称组最高的捕获。
  • (?(group)yes|no)-如果存在名称为group的组,则匹配yes,否则不匹配。

这些构造允许.NET正则表达式通过本质上允许堆栈操作的简单版本(即push,pop和empty)来模拟受限的PDA。简单的操作分别相当于递增,递减和与零比较。这使.NET正则表达式引擎可以识别上下文无关语言的子集,尤其是那些只需要简单计数器的语言。反过来,这使非传统的.NET正则表达式可以识别各个适当平衡的构造。

考虑以下正则表达式:

(?=<ul\s+id="matchMe"\s+type="square"\s*>)
(?>
   <!-- .*? -->                  |
   <[^>]*/>                      |
   (?<opentag><(?!/)[^>]*[^/]>)  |
   (?<-opentag></[^>]*[^/]>)     |
   [^<>]*
)*
(?(opentag)(?!))

使用标志:

  • 单行
  • IgnorePatternWhitespace(如果您折叠正则表达式并删除所有空白,则不必这样做
  • IgnoreCase(不必要)

正则表达式解释(内联)

(?=<ul\s+id="matchMe"\s+type="square"\s*>) # match start with <ul id="matchMe"...
(?>                                        # atomic group / don't backtrack (faster)
   <!-- .*? -->                 |          # match xml / html comment
   <[^>]*/>                     |          # self closing tag
   (?<opentag><(?!/)[^>]*[^/]>) |          # push opening xml tag
   (?<-opentag></[^>]*[^/]>)    |          # pop closing xml tag
   [^<>]*                                  # something between tags
)*                                         # match as many xml tags as possible
(?(opentag)(?!))                           # ensure no 'opentag' groups are on stack

您可以在 A上进行尝试更好的.NET正则表达式测试器

我使用了以下示例源:

<html>
<body>
<div>
   <br />
   <ul id="matchMe" type="square">
      <li>stuff...</li>
      <li>more stuff</li>
      <li>
          <div>
               <span>still more</span>
               <ul>
                    <li>Another &gt;ul&lt;, oh my!</li>
                    <li>...</li>
               </ul>
          </div>
      </li>
   </ul>
</div>
</body>
</html>

这找到了匹配项:

   <ul id="matchMe" type="square">
      <li>stuff...</li>
      <li>more stuff</li>
      <li>
          <div>
               <span>still more</span>
               <ul>
                    <li>Another &gt;ul&lt;, oh my!</li>
                    <li>...</li>
               </ul>
          </div>
      </li>
   </ul>

尽管它实际上是这样显示的:

<ul id="matchMe" type="square">           <li>stuff...</li>           <li>more stuff</li>           <li>               <div>                    <span>still more</span>                    <ul>                         <li>Another &gt;ul&lt;, oh my!</li>                         <li>...</li>                    </ul>               </div>           </li>        </ul>

最后,我非常喜欢Jeff Atwood的文章:解析Cthulhu方式HTML 。有趣的是,它引用了这个问题的答案,目前该问题的投票已超过4k。

A8:

I agree that the right tool to parse XML and especially HTML is a parser and not a regular expression engine. However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format.

Microsoft actually has a section of Best Practices for Regular Expressions in the .NET Framework and specifically talks about Consider[ing] the Input Source.

Regular Expressions do have limitations, but have you considered the following?

The .NET framework is unique when it comes to regular expressions in that it supports Balancing Group Definitions.

For this reason, I believe you CAN parse XML using regular expressions. Note however, that it must be valid XML (browsers are very forgiving of HTML and allow bad XML syntax inside HTML). This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA.

Quote from article 1 cited above:

.NET Regular Expression Engine

As described above properly balanced constructs cannot be described by a regular expression. However, the .NET regular expression engine provides a few constructs that allow balanced constructs to be recognized.

  • (?<group>) - pushes the captured result on the capture stack with the name group.
  • (?<-group>) - pops the top most capture with the name group off the capture stack.
  • (?(group)yes|no) - matches the yes part if there exists a group with the name group otherwise matches no part.

These constructs allow for a .NET regular expression to emulate a restricted PDA by essentially allowing simple versions of the stack operations: push, pop and empty. The simple operations are pretty much equivalent to increment, decrement and compare to zero respectively. This allows for the .NET regular expression engine to recognize a subset of the context-free languages, in particular the ones that only require a simple counter. This in turn allows for the non-traditional .NET regular expressions to recognize individual properly balanced constructs.

Consider the following regular expression:

(?=<ul\s+id="matchMe"\s+type="square"\s*>)
(?>
   <!-- .*? -->                  |
   <[^>]*/>                      |
   (?<opentag><(?!/)[^>]*[^/]>)  |
   (?<-opentag></[^>]*[^/]>)     |
   [^<>]*
)*
(?(opentag)(?!))

Use the flags:

  • Singleline
  • IgnorePatternWhitespace (not necessary if you collapse regex and remove all whitespace)
  • IgnoreCase (not necessary)

Regular Expression Explained (inline)

(?=<ul\s+id="matchMe"\s+type="square"\s*>) # match start with <ul id="matchMe"...
(?>                                        # atomic group / don't backtrack (faster)
   <!-- .*? -->                 |          # match xml / html comment
   <[^>]*/>                     |          # self closing tag
   (?<opentag><(?!/)[^>]*[^/]>) |          # push opening xml tag
   (?<-opentag></[^>]*[^/]>)    |          # pop closing xml tag
   [^<>]*                                  # something between tags
)*                                         # match as many xml tags as possible
(?(opentag)(?!))                           # ensure no 'opentag' groups are on stack

You can try this at A Better .NET Regular Expression Tester.

I used the sample source of:

<html>
<body>
<div>
   <br />
   <ul id="matchMe" type="square">
      <li>stuff...</li>
      <li>more stuff</li>
      <li>
          <div>
               <span>still more</span>
               <ul>
                    <li>Another &gt;ul&lt;, oh my!</li>
                    <li>...</li>
               </ul>
          </div>
      </li>
   </ul>
</div>
</body>
</html>

This found the match:

   <ul id="matchMe" type="square">
      <li>stuff...</li>
      <li>more stuff</li>
      <li>
          <div>
               <span>still more</span>
               <ul>
                    <li>Another &gt;ul&lt;, oh my!</li>
                    <li>...</li>
               </ul>
          </div>
      </li>
   </ul>

although it actually came out like this:

<ul id="matchMe" type="square">           <li>stuff...</li>           <li>more stuff</li>           <li>               <div>                    <span>still more</span>                    <ul>                         <li>Another &gt;ul&lt;, oh my!</li>                         <li>...</li>                    </ul>               </div>           </li>        </ul>

Lastly, I really enjoyed Jeff Atwood's article: Parsing Html The Cthulhu Way. Funny enough, it cites the answer to this question that currently has over 4k votes.

回答9:

我建议使用 QueryPath 来解析PHP中的XML和HTML。它的语法基本上与jQuery相同,只是在服务器端。

A9:

I suggest using QueryPath for parsing XML and HTML in PHP. It's basically much the same syntax as jQuery, only it's on the server side.

回答10:

虽然您无法使用正则表达式解析HTML的答案是正确的,但它们不适用于此处。 OP只想用正则表达式解析一个HTML标记,而这可以通过正则表达式来完成。

建议的正则表达式是错误的,

<([a-z]+) *[^/]*?>

如果您向正则表达式添加内容,则通过回溯可以强制它匹配>[^/]等愚蠢的东西。还要注意, *[^/]* 是多余的,因为[^/]*也可以匹配空格。

我的建议是

<([a-z]+)[^>]*(?<!/)>

在Perl正则表达式中,(? 是负向后看。它显示为"一个<,然后是一个单词,然后是不是>的任何内容,最后一个可能不是/,然后是>"。

请注意,这允许使用之类的东西(就像原始的正则表达式一样),因此,如果您想要更严格的限制,则需要构建一个正则表达式以匹配用空格分隔的属性对。

A10:

While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.

The suggested regex is wrong, though:

<([a-z]+) *[^/]*?>

If you add something to the regex, by backtracking it can be forced to match silly things like <a >>, [^/] is too permissive. Also note that <space>*[^/]* is redundant, because the [^/]* can also match spaces.

My suggestion would be

<([a-z]+)[^>]*(?<!/)>

Where (?<! ... ) is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".

Note that this allows things like <a/ > (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.

回答11:

尝试:

<([^\s]+)(\s[^>]*?)?(?<!/)>

它与您的相似,但是最后一个> 不能在斜杠后,并且也可以接受h1

A11:

Try:

<([^\s]+)(\s[^>]*?)?(?<!/)>

It is similar to yours, but the last > must not be after a slash, and also accepts h1.

回答12:

中国古代的战略家,一般哲学家孙子说:

据说,如果您认识自己的敌人并认识自己,那么您就可以赢得一百场战斗而不会遭受任何损失。如果您只了解自己而不是对手,那么您可能会赢或输。如果您既不认识自己,也不认识敌人,那么您将永远危害自己。

在这种情况下,您的敌人是HTML,而您本身就是正则表达式。您甚至可能是使用不规则正则表达式的Perl。懂HTML。了解你自己。

我写了一个描述HTML本质的句。

HTML has
complexity exceeding
regular language.

我还撰写了一个hai句,描述了Perl中正则表达式的性质。

The regex you seek
is defined within the phrase
<([a-zA-Z]+)(?:[^>]*[^/]*)?>

A12:

Sun Tzu, an ancient Chinese strategist, general, and philosopher, said:

It is said that if you know your enemies and know yourself, you can win a hundred battles without a single loss. If you only know yourself, but not your opponent, you may win or may lose. If you know neither yourself nor your enemy, you will always endanger yourself.

In this case your enemy is HTML and you are either yourself or regex. You might even be Perl with irregular regex. Know HTML. Know yourself.

I have composed a haiku describing the nature of HTML.

HTML has
complexity exceeding
regular language.

I have also composed a haiku describing the nature of regex in Perl.

The regex you seek
is defined within the phrase
<([a-zA-Z]+)(?:[^>]*[^/]*)?>

回答13:

<?php
$selfClosing = explode(',', 'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed');

$html = '
<p><a href="#">foo</a></p>
<hr/>
<br/>
<div>name</div>';

$dom = new DOMDocument();
$dom->loadHTML($html);
$els = $dom->getElementsByTagName('*');
foreach ( $els as $el ) {
    $nodeName = strtolower($el->nodeName);
    if ( !in_array( $nodeName, $selfClosing ) ) {
        var_dump( $nodeName );
    }
}

输出:

string(4) "html"
string(4) "body"
string(1) "p"
string(1) "a"
string(3) "div"

基本上,只需定义自动关闭的元素节点名称,将整个html字符串加载到DOM库中,抓取所有元素,遍历并过滤掉那些非自动关闭的元素并对其进行操作。

我确定您现在已经知道不应为此使用正则表达式。

A13:

<?php
$selfClosing = explode(',', 'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed');

$html = '
<p><a href="#">foo</a></p>
<hr/>
<br/>
<div>name</div>';

$dom = new DOMDocument();
$dom->loadHTML($html);
$els = $dom->getElementsByTagName('*');
foreach ( $els as $el ) {
    $nodeName = strtolower($el->nodeName);
    if ( !in_array( $nodeName, $selfClosing ) ) {
        var_dump( $nodeName );
    }
}

Output:

string(4) "html"
string(4) "body"
string(1) "p"
string(1) "a"
string(3) "div"

Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.

I'm sure you already know by now that you shouldn't use regex for this purpose.

回答14:

我不知道您对此有确切的需求,但是如果您还使用.NET,则不能使用 Html Agility Pack

节选:

这是一个.NET代码库,可让您解析"网络外" HTML文件。解析器对格式错误的"真实世界" HTML非常宽容。

A14:

I don't know your exact need for this, but if you are also using .NET, couldn't you use Html Agility Pack?

Excerpt:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML.

回答15:

您希望第一个> 不以/开头。在此处中查找有关操作方法的详细信息。称为负向后看。

但是,在此示例文档中,纯真的实现最终会匹配

<foo><bar/></foo>

您能否提供更多有关您要解决的问题的信息?您是否正在以编程方式遍历标签?

A15:

You want the first > not preceded by a /. Look here for details on how to do that. It's referred to as negative lookbehind.

However, a naïve implementation of that will end up matching <bar/></foo> in this example document

<foo><bar/></foo>

Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?

回答16:

W3C以伪正则表达式形式解释了解析:
W3C链接

按照QNameSAttribute的var链接获得更清晰的图片。
在此基础上,您可以创建一个很好的正则表达式,可以处理诸如剥离标签之类的事情。

A16:

The W3C explains parsing in a pseudo regexp form:
W3C Link

Follow the var links for QName, S, and Attribute to get a clearer picture.
Based on that you can create a pretty good regexp to handle things like stripping tags.

回答17:

如果您需要PHP:

PHP DOM 功能除非格式正确,否则无法正常工作。不管它们对全人类的使用有多好。

simplehtmldom 很好,但是我发现它有点毛病,而且占用了大量内存[Will在大页面上崩溃。]

我从未使用过查询路径,因此无法评论其有用性。

另一个可以尝试的方法是我的 DOMParser ,它对资源的需求很少,我一直很开心地使用它一阵子。简单易学,功能强大。

对于Python和Java,发布了类似的链接。

对于低俗人士-我仅在XML解析器证明无法承受实际使用时编写类。宗教否决只会阻止发布有用的答案-请在问题的角度范围内。

A17:

If you need this for PHP:

The PHP DOM functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind.

simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.]

I have never used querypath, so can't comment on its usefulness.

Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.

For Python and Java, similar links were posted.

For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.

回答18:

这是解决方案:

<?php
// here's the pattern:
$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*(\/>|>)/';

// a string to parse:
$string = 'Hello, try clicking <a href="#paragraph">here</a>
    <br/>and check out.<hr />
    <h2>title</h2>
    <a name ="paragraph" rel= "I\'m an anchor"></a>
    Fine, <span title=\'highlight the "punch"\'>thanks<span>.
    <div class = "clear"></div>
    <br>';

// let's get the occurrences:
preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER);

// print the result:
print_r($matches[0]);
?>

要进行深入测试,我输入了如下字符串自动关闭标签:




我还输入了以下标签:

  1. 一个属性
  2. 多个属性
  3. 将哪个值绑定到单引号双引号
  4. 当分隔符为双引号时,
  5. 包含单引号的属性,反之亦然
  6. " unpretty"属性在" ="符号之前,之后,前后都有一个空格。

如果您在上面的概念验证中发现不起作用的地方,可以通过分析代码来提高自己的技能。

我忘记了用户的问题是避免解析自动关闭标签。在这种情况下,模式更简单,变成这样:

$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*>/';

用户@ridgerunner注意到该模式不允许未加引号的属性没有值的属性。在这种情况下,微调为我们带来了以下模式:

$pattern = '/<(\w+)(\s+(\w+)(\s*\=\s*(\'|"|)(.*?)\\5\s*)?)*\s*>/';

了解模式

如果有人有兴趣了解有关该模式的更多信息,请提供以下内容:

  1. 第一个子表达式(\ w +)与标记名称匹配
  2. 第二个子表达式包含属性的模式。由以下人员组成:
    1. 一个或多个空格\ s +
    2. 属性名称(\ w +)
    3. 零个或多个空格\ s *(是否可能,此处留空)
    4. " ="符号
    5. 再次
    6. 零个或多个空格
    7. 属性值的分隔符,单引号或双引号('|")。在模式中,单引号被转义,因为它与PHP字符串分隔符重合。此子表达式用括号捕获,因此可以再次引用以解析属性的关闭,这就是为什么它非常重要的原因。
    8. 该属性的值,与几乎匹配:在这种特定语法中,RegExp引擎使用贪婪匹配(星号后的问号)启用了类似于"先行查找"的运算符,该运算符除了该子表达式之后的内容之外,都匹配
    9. 有趣之处在于:\ 4部分是 backreference运算符,它引用模式中之前定义的子表达式,在这种情况下,我指的是第四子表达式,这是找到的第一个属性定界符
    10. 零个或多个空格\ s *
    11. 属性子表达式到此结束,并指定由星号指定的零个或多个可能出现的事件。
  3. 然后,由于标签可能以">"符号前的空格结尾,因此零个或多个空格与\ s *子模式匹配。
  4. 要匹配的标签可能以简单的">"符号或可能的XHTML闭包结尾,该闭包使用其前的斜杠((||>))。由于斜杠与正则表达式定界符重合,因此当然可以逃脱。

小技巧:由于我没有提供任何转义的HTML特殊字符,因此为了更好地分析此代码,有必要查看生成的源代码。

A18:

Here's the solution:

<?php
// here's the pattern:
$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*(\/>|>)/';

// a string to parse:
$string = 'Hello, try clicking <a href="#paragraph">here</a>
    <br/>and check out.<hr />
    <h2>title</h2>
    <a name ="paragraph" rel= "I\'m an anchor"></a>
    Fine, <span title=\'highlight the "punch"\'>thanks<span>.
    <div class = "clear"></div>
    <br>';

// let's get the occurrences:
preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER);

// print the result:
print_r($matches[0]);
?>

To test it deeply, I entered in the string auto-closing tags like:

  1. <hr />
  2. <br/>
  3. <br>

I also entered tags with:

  1. one attribute
  2. more than one attribute
  3. attributes which value is bound either into single quotes or into double quotes
  4. attributes containing single quotes when the delimiter is a double quote and vice versa
  5. "unpretty" attributes with a space before the "=" symbol, after it and both before and after it.

Should you find something which does not work in the proof of concept above, I am available in analyzing the code to improve my skills.

<EDIT> I forgot that the question from the user was to avoid the parsing of self-closing tags. In this case the pattern is simpler, turning into this:

$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*>/';

The user @ridgerunner noticed that the pattern does not allow unquoted attributes or attributes with no value. In this case a fine tuning brings us the following pattern:

$pattern = '/<(\w+)(\s+(\w+)(\s*\=\s*(\'|"|)(.*?)\\5\s*)?)*\s*>/';

</EDIT>

Understanding the pattern

If someone is interested in learning more about the pattern, I provide some line:

  1. the first sub-expression (\w+) matches the tag name
  2. the second sub-expression contains the pattern of an attribute. It is composed by:
    1. one or more whitespaces \s+
    2. the name of the attribute (\w+)
    3. zero or more whitespaces \s* (it is possible or not, leaving blanks here)
    4. the "=" symbol
    5. again, zero or more whitespaces
    6. the delimiter of the attribute value, a single or double quote ('|"). In the pattern, the single quote is escaped because it coincides with the PHP string delimiter. This sub-expression is captured with the parentheses so it can be referenced again to parse the closure of the attribute, that's why it is very important.
    7. the value of the attribute, matched by almost anything: (.*?); in this specific syntax, using the greedy match (the question mark after the asterisk) the RegExp engine enables a "look-ahead"-like operator, which matches anything but what follows this sub-expression
    8. here comes the fun: the \4 part is a backreference operator, which refers to a sub-expression defined before in the pattern, in this case, I am referring to the fourth sub-expression, which is the first attribute delimiter found
    9. zero or more whitespaces \s*
    10. the attribute sub-expression ends here, with the specification of zero or more possible occurrences, given by the asterisk.
  3. Then, since a tag may end with a whitespace before the ">" symbol, zero or more whitespaces are matched with the \s* subpattern.
  4. The tag to match may end with a simple ">" symbol, or a possible XHTML closure, which makes use of the slash before it: (/>|>). The slash is, of course, escaped since it coincides with the regular expression delimiter.

Small tip: to better analyze this code it is necessary looking at the source code generated since I did not provide any HTML special characters escaping.

回答19:

每当我需要快速从HTML文档中提取内容时,我都会使用Tidy将其转换为XML,然后使用XPath或XSLT来获取所需的内容。您的情况是这样的:

//p/a[@href='foo']

A19:

Whenever I need to quickly extract something from an HTML document, I use Tidy to convert it to XML and then use XPath or XSLT to get what I need. In your case, something like this:

//p/a[@href='foo']

回答20:

我以前使用过一个名为 HTMLParser 的开源工具。它旨在以各种方式解析HTML,并且很好地达到了目的。它可以将HTML解析为不同的treenode,并且您可以轻松地使用其API从节点中获取属性。检查一下,看看是否可以帮到您。

A20:

I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you.

回答21:

我喜欢用正则表达式解析HTML。我不会尝试解析故意损坏的白痴HTML。这段代码是我的主要解析器(Perl版):

$_ = join "",<STDIN>; tr/\n\r \t/ /s; s/</\n</g; s/>/>\n/g; s/\n ?\n/\n/g;
s/^ ?\n//s; s/ $//s; print

它称为htmlsplit,将HTML分成几行,每行上有一个标签或文本块。然后可以使用其他文本工具和脚本进一步处理这些行,例如 grep sed ,Perl等。我什至没有在开玩笑:)享受。

如果您希望处理巨大的网页,那么很简单就可以将我的Slurp-Perthing-first-Perl脚本重新绑定到不错的流式处理中。但这不是真的必要。

我敢打赌我会为此而感到沮丧。

HTML拆分


与我的期望相反,它获得了一些好评,所以我将建议一些更好的正则表达式:

/(<.*?>|[^<]+)\s*/g    # get tags and text
/(\w+)="(.*?)"/g       # get attibutes

它们对于XML / XHTML很有用。

有一些细微的变化,它可以处理杂乱的HTML ...或先转换HTML-> XHTML。


编写正则表达式的最好方法是 Lex / Yacc 样式,而不是不透明的单线或注释过的多行怪物。我还没有在这里做;这些人几乎不需要它。

A21:

I like to parse HTML with regular expressions. I don't attempt to parse idiot HTML that is deliberately broken. This code is my main parser (Perl edition):

$_ = join "",<STDIN>; tr/\n\r \t/ /s; s/</\n</g; s/>/>\n/g; s/\n ?\n/\n/g;
s/^ ?\n//s; s/ $//s; print

It's called htmlsplit, splits the HTML into lines, with one tag or chunk of text on each line. The lines can then be processed further with other text tools and scripts, such as grep, sed, Perl, etc. I'm not even joking :) Enjoy.

It is simple enough to rejig my slurp-everything-first Perl script into a nice streaming thing, if you wish to process enormous web pages. But it's not really necessary.

I bet I will get downvoted for this.

HTML Split


Against my expectation this got some upvotes, so I'll suggest some better regular expressions:

/(<.*?>|[^<]+)\s*/g    # get tags and text
/(\w+)="(.*?)"/g       # get attibutes

They are good for XML / XHTML.

With minor variations, it can cope with messy HTML... or convert the HTML -> XHTML first.


The best way to write regular expressions is in the Lex / Yacc style, not as opaque one-liners or commented multi-line monstrosities. I didn't do that here, yet; these ones barely need it.

回答22:

这是一个基于PHP的解析器,它使用一些不合规则的正则表达式来解析HTML。作为该项目的作者,我可以告诉您可以使用正则表达式解析HTML,但是效率不高。如果您需要服务器端解决方案(如我对 wp-Typography WordPress插件所做的),这有效。

A22:

Here is a PHP based parser that parses HTML using some ungodly regex. As the author of this project, I can tell you it is possible to parse HTML with regex, but not efficient. If you need a server-side solution (as I did for my wp-Typography WordPress plugin), this works.

回答23:

有一些不错的正则表达式,可以在BBCode 此处替换HTML。对于所有反对者,请注意,他并没有完全解析HTML,只是为了清理HTML。他可能有能力杀死他简单的"解析器"无法理解的标签。

例如:

$store =~ s/http:/http:\/\//gi;
$store =~ s/https:/https:\/\//gi;
$baseurl = $store;

if (!$query->param("ascii")) {
    $html =~ s/\s\s+/\n/gi;
    $html =~ s/<pre(.*?)>(.*?)<\/pre>/\[code]$2\[\/code]/sgmi;
}

$html =~ s/\n//gi;
$html =~ s/\r\r//gi;
$html =~ s/$baseurl//gi;
$html =~ s/<h[1-7](.*?)>(.*?)<\/h[1-7]>/\n\[b]$2\[\/b]\n/sgmi;
$html =~ s/<p>/\n\n/gi;
$html =~ s/<br(.*?)>/\n/gi;
$html =~ s/<textarea(.*?)>(.*?)<\/textarea>/\[code]$2\[\/code]/sgmi;
$html =~ s/<b>(.*?)<\/b>/\[b]$1\[\/b]/gi;
$html =~ s/<i>(.*?)<\/i>/\[i]$1\[\/i]/gi;
$html =~ s/<u>(.*?)<\/u>/\[u]$1\[\/u]/gi;
$html =~ s/<em>(.*?)<\/em>/\[i]$1\[\/i]/gi;
$html =~ s/<strong>(.*?)<\/strong>/\[b]$1\[\/b]/gi;
$html =~ s/<cite>(.*?)<\/cite>/\[i]$1\[\/i]/gi;
$html =~ s/<font color="(.*?)">(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<font color=(.*?)>(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<link(.*?)>//gi;
$html =~ s/<li(.*?)>(.*?)<\/li>/\[\*]$2/gi;
$html =~ s/<ul(.*?)>/\[list]/gi;
$html =~ s/<\/ul>/\[\/list]/gi;
$html =~ s/<div>/\n/gi;
$html =~ s/<\/div>/\n/gi;
$html =~ s/<td(.*?)>/ /gi;
$html =~ s/<tr(.*?)>/\n/gi;

$html =~ s/<img(.*?)src="(.*?)"(.*?)>/\[img]$baseurl\/$2\[\/img]/gi;
$html =~ s/<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>/\[url=$baseurl\/$2]$4\[\/url]/gi;
$html =~ s/\[url=$baseurl\/http:\/\/(.*?)](.*?)\[\/url]/\[url=http:\/\/$1]$2\[\/url]/gi;
$html =~ s/\[img]$baseurl\/http:\/\/(.*?)\[\/img]/\[img]http:\/\/$1\[\/img]/gi;

$html =~ s/<head>(.*?)<\/head>//sgmi;
$html =~ s/<object>(.*?)<\/object>//sgmi;
$html =~ s/<script(.*?)>(.*?)<\/script>//sgmi;
$html =~ s/<style(.*?)>(.*?)<\/style>//sgmi;
$html =~ s/<title>(.*?)<\/title>//sgmi;
$html =~ s/<!--(.*?)-->/\n/sgmi;

$html =~ s/\/\//\//gi;
$html =~ s/http:\//http:\/\//gi;
$html =~ s/https:\//https:\/\//gi;

$html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gsi;
$html =~ s/\r\r//gi;
$html =~ s/\[img]\//\[img]/gi;
$html =~ s/\[url=\//\[url=/gi;

A23:

There are some nice regexes for replacing HTML with BBCode here. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.

For example:

$store =~ s/http:/http:\/\//gi;
$store =~ s/https:/https:\/\//gi;
$baseurl = $store;

if (!$query->param("ascii")) {
    $html =~ s/\s\s+/\n/gi;
    $html =~ s/<pre(.*?)>(.*?)<\/pre>/\[code]$2\[\/code]/sgmi;
}

$html =~ s/\n//gi;
$html =~ s/\r\r//gi;
$html =~ s/$baseurl//gi;
$html =~ s/<h[1-7](.*?)>(.*?)<\/h[1-7]>/\n\[b]$2\[\/b]\n/sgmi;
$html =~ s/<p>/\n\n/gi;
$html =~ s/<br(.*?)>/\n/gi;
$html =~ s/<textarea(.*?)>(.*?)<\/textarea>/\[code]$2\[\/code]/sgmi;
$html =~ s/<b>(.*?)<\/b>/\[b]$1\[\/b]/gi;
$html =~ s/<i>(.*?)<\/i>/\[i]$1\[\/i]/gi;
$html =~ s/<u>(.*?)<\/u>/\[u]$1\[\/u]/gi;
$html =~ s/<em>(.*?)<\/em>/\[i]$1\[\/i]/gi;
$html =~ s/<strong>(.*?)<\/strong>/\[b]$1\[\/b]/gi;
$html =~ s/<cite>(.*?)<\/cite>/\[i]$1\[\/i]/gi;
$html =~ s/<font color="(.*?)">(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<font color=(.*?)>(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<link(.*?)>//gi;
$html =~ s/<li(.*?)>(.*?)<\/li>/\[\*]$2/gi;
$html =~ s/<ul(.*?)>/\[list]/gi;
$html =~ s/<\/ul>/\[\/list]/gi;
$html =~ s/<div>/\n/gi;
$html =~ s/<\/div>/\n/gi;
$html =~ s/<td(.*?)>/ /gi;
$html =~ s/<tr(.*?)>/\n/gi;

$html =~ s/<img(.*?)src="(.*?)"(.*?)>/\[img]$baseurl\/$2\[\/img]/gi;
$html =~ s/<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>/\[url=$baseurl\/$2]$4\[\/url]/gi;
$html =~ s/\[url=$baseurl\/http:\/\/(.*?)](.*?)\[\/url]/\[url=http:\/\/$1]$2\[\/url]/gi;
$html =~ s/\[img]$baseurl\/http:\/\/(.*?)\[\/img]/\[img]http:\/\/$1\[\/img]/gi;

$html =~ s/<head>(.*?)<\/head>//sgmi;
$html =~ s/<object>(.*?)<\/object>//sgmi;
$html =~ s/<script(.*?)>(.*?)<\/script>//sgmi;
$html =~ s/<style(.*?)>(.*?)<\/style>//sgmi;
$html =~ s/<title>(.*?)<\/title>//sgmi;
$html =~ s/<!--(.*?)-->/\n/sgmi;

$html =~ s/\/\//\//gi;
$html =~ s/http:\//http:\/\//gi;
$html =~ s/https:\//https:\/\//gi;

$html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gsi;
$html =~ s/\r\r//gi;
$html =~ s/\[img]\//\[img]/gi;
$html =~ s/\[url=\//\[url=/gi;

回答24:

关于用于解析(x)HTML的RegExp方法的问题,回答所有有关某些限制的人的答案是:自以来,您尚未受过足够的训练来统治这种强大武器的力量NOBODY 在这里谈到了递归

与RegExp无关的一位同事通知了我这个讨论,这肯定不是网络上有关这个古老而又热门话题的第一个讨论。

在阅读了一些帖子之后,我要做的第一件事是在该线程中寻找"?R"字符串。第二个是搜索"递归"。
不,圣牛,找不到匹配项。
由于没有人提到解析器所基于的主要机制,所以我很快意识到没人明白这一点。

如果(x)HTML解析器需要递归,则没有递归的RegExp解析器不足以实现此目的。这是一个简单的构造。

RegExp的黑人技巧很难掌握,因此,在尝试测试我们的个人解决方案以一只手捕获整个Web时,也许还有其他可能性……对此很确定:)

这是魔术图案:

$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s";

只需尝试一下。
它是用PHP字符串编写的,因此" s"修饰符使类包含换行符。
这是我在一月份写的 PHP手册样本示例。 :参考

(请注意,在该注释中,我错误地使用了" m"修饰符;尽管RegExp引擎已将其丢弃,但由于没有使用^或$定位符,因此应将其删除)。

现在,我们可以从更明智的角度讨论这种方法的局限性:

  1. 根据RegExp引擎的特定实现,递归可能会限制已解析的嵌套模式的数量,但这取决于所使用的语言
  2. 尽管(x)损坏的HTML不会导致严重错误,但尚未对其进行消毒。

无论如何,这只是一个RegExp模式,但它揭示了开发许多强大实现的可能性。
我编写了此模式,以为模板引擎的递归下降解析器提供动力。内置在我的框架中,无论是在执行时间还是在内存使用方面,性能都非常出色(与使用相同语法的其他模板引擎无关)。

A24:

About the question of the RegExp methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since NOBODY here spoke about recursion.

A RegExp-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.

After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found.
Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.

If an (x)HTML parser needs recursion, a RegExp parser without recursion is not enough for the purpose. It's a simple construct.

The black art of RegExp is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)

Here's the magic pattern:

$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s";

Just try it.
It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote on January: Reference

(Take care, in that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the RegExp engine, since no ^ or $ anchorage was used).

Now, we could speak about the limits of this method from a more informed point of view:

  1. according to the specific implementation of the RegExp engine, recursion may have a limit in the number of nested patterns parsed, but it depends on the language used
  2. although corrupted (x)HTML does not drive into severe errors, it is not sanitized.

Anyhow it is only a RegExp pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).

回答25:

正如许多人已经指出的那样,HTML不是一种常规语言,因此很难解析。我的解决方案是使用整洁的程序将其转换为常规语言,然后使用XML解析器使用结果。为此有很多不错的选择。我的程序是使用Java和 jtidy 库编写的,将HTML转换为XML,然后通过Jaxen将xpath转换为结果

A25:

As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.

回答26:

<\s*(\w+)[^/>]*>

解释的部分:

<:起始字符

\s*:标签名称前可能有空格(丑陋,但可能)。

(\w+):标签可以包含字母和数字(h1)。好吧,\w也匹配'_',但是我猜并没有伤害。如果好奇,请改用([a-zA-Z0-9] +)。

[^/>]*:除> /之外的所有内容,直到关闭>

> :关闭>

不相关

对于那些低估正则表达式的人来说,它们仅与正则语言一样强大:

a n ba n ba n 不是常规的,甚至不是上下文无关的,可以与^(a+)b \ 1b \ 1 $

反引用 FTW

A26:

<\s*(\w+)[^/>]*>

The parts explained:

<: starting character

\s*: it may have whitespaces before tag name (ugly but possible).

(\w+): tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious use ([a-zA-Z0-9]+) instead.

[^/>]*: anything except > and / until closing >

>: closing >

UNRELATED

And to fellows who underestimate regular expressions saying they are only as powerful as regular languages:

anbanban which is not regular and not even context free, can be matched with ^(a+)b\1b\1$

Backreferencing FTW!

回答27:

如果您只是想查找这些标签(没有解析的野心),请尝试以下正则表达式:

/<[^/]*?>/g

我在30秒内写下了它,并在此处进行了测试: http://gskinner.com/RegExr/

它与您提到的标签类型匹配,而忽略了您想忽略的标签类型。

A27:

If you're simply trying to find those tags (without ambitions of parsing) try this regular expression:

/<[^/]*?>/g

I wrote it in 30 seconds, and tested here: http://gskinner.com/RegExr/

It matches the types of tags you mentioned, while ignoring the types you said you wanted to ignore.

回答28:

在我看来,您正在尝试匹配末尾没有" /"的标记。试试这个:

<([a-zA-Z][a-zA-Z0-9]*)[^>]*(?<!/)>

A28:

It seems to me you're trying to match tags without a "/" at the end. Try this:

<([a-zA-Z][a-zA-Z0-9]*)[^>]*(?<!/)>

回答29:

确实,在编程时,处理HTML时通常最好使用专用的解析器和API而不是正则表达式,尤其是在精度至关重要的情况下(例如,如果您的处理可能会带来安全隐患)。但是,我不认为教条认为XML样式的标记绝对不能使用正则表达式来处理。在某些情况下,正则表达式是完成这项工作的理想工具,例如在文本编辑器中进行一次性编辑,修复损坏的XML文件或处理看起来像但不完全是XML的文件格式时。有一些问题需要注意,但并不是不可克服的,甚至不一定是相关的。

在以下情况下,像<([[^>"']|"[^"]*"|'[^']*')*>这样的简单正则表达式通常就足够了我刚才提到的那些。考虑到所有因素,这是一个幼稚的解决方案,但是它确实允许属性值中使用未编码的> 符号。如果要查找table标记,可以将其改编为 "']|"[[^"]*"|'[^'] *')*>

仅是为了了解更高级的HTML正则表达式是什么,下面的代码在模拟真实浏览器行为和HTML5解析算法方面做得相当不错:

</?([A-Za-z][^\s>/]*)(?:=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)|[^>])*(?:>|$)

以下内容符合XML标记的相当严格的定义(尽管它不能解决XML名称中允许的完整Unicode字符集):

<(?:([_:A-Z][-.:\w]*)(?:\s+[_:A-Z][-.:\w]*\s*=\s*(?:"[^"]*"|'[^']*'))*\s*/?|/([_:A-Z][-.:\w]*)\s*)>

当然,这些并不能说明周围的环境和一些极端情况,但是即使您确实愿意,也可以处理此类事情(例如,通过在另一个正则表达式的匹配项之间进行搜索)。

一天结束时,即使工作正好是正则表达式,也要使用最合适的工具。

A29:

It's true that when programming it's usually best to use dedicated parsers and APIs instead of regular expressions when dealing with HTML, especially if accuracy is paramount (e.g., if your processing might have security implications). However, I don’t ascribe to a dogmatic view that XML-style markup should never be processed with regular expressions. There are cases when regular expressions are a great tool for the job, such as when making one-time edits in a text editor, fixing broken XML files, or dealing with file formats that look like but aren’t quite XML. There are some issues to be aware of, but they're not insurmountable or even necessarily relevant.

A simple regex like <([^>"']|"[^"]*"|'[^']*')*> is usually good enough, in cases such as those I just mentioned. It's a naive solution, all things considered, but it does correctly allow unencoded > symbols in attribute values. If you're looking for, e.g., a table tag, you could adapt it as </?table\b([^>"']|"[^"]*"|'[^']*')*>.

Just to give a sense of what a more "advanced" HTML regex would look like, the following does a fairly respectable job of emulating real-world browser behavior and the HTML5 parsing algorithm:

</?([A-Za-z][^\s>/]*)(?:=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)|[^>])*(?:>|$)

The following matches a fairly strict definition of XML tags (although it doesn't account for the full set of Unicode characters allowed in XML names):

<(?:([_:A-Z][-.:\w]*)(?:\s+[_:A-Z][-.:\w]*\s*=\s*(?:"[^"]*"|'[^']*'))*\s*/?|/([_:A-Z][-.:\w]*)\s*)>

Granted, these don't account for surrounding context and a few edge cases, but even such things could be dealt with if you really wanted to (e.g., by searching between the matches of another regex).

At the end of the day, use the most appropriate tool for the job, even in the cases when that tool happens to be a regex.

回答30:

尽管为此目的使用正则表达式不适当且无效,但有时正则表达式可以为简单的匹配问题提供快速解决方案,并且在我看来,对于琐碎的工作使用正则表达式并不那么麻烦。

权威博客文章关于匹配最里面的HTML元素史蒂文·莱维森(Steven Levithan)。

A30:

Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works.

There is a definitive blog post about matching innermost HTML elements written by Steven Levithan.

回到顶部