BeautifulSoup を使ったコード片のメモ

昔作ったはずなんだけど、コードをなくしたらしい。なので作り直した。はてな上にメモしておけば、今度こそなくさない？（いや、ちゃんとバックアップ取れよオレ）

BeautifulSoup とは？

HTML, XML パーサ。メモリにツリー状のデータを構築するタイプ。標準ライブラリ xml.sax のようなイベント駆動型ではない。BeautifulSoup に近い標準ライブラリは xml.etree や xml.dom 。

サイトの URL は http://www.crummy.com/software/BeautifulSoup/ 。

テキスト抽出

BeautifulSoup のドキュメントにもあった例を関数にしたもの。

def collect_string(tag):
    return u''.join(tag.findAll(text=True))

もうすこし凝ったことがしたくなったら find* とか *Generator を使って中身を順に取り出しつつ処理。たとえば recursiveChildGenerator 。これを使って似たものを実装するとこんな感じ。

def collect_string(tag):
    for next_tag in tag.recursiveChildGenerator():
        if isinstance(next_tag, unicode):
            yield next_tag

使用例。

>>> from BeautifulSoup import BeautifulSoup
>>> source = u'<html><body><p>foo</p><p>bar<br />baz</p></body></html>'
>>> soup = BeautifulSoup(source)
>>> print u''.join(collect_string(soup.body))
foobarbaz

これをベースに必要に応じて改良する。たとえば br タグを改行にしたかったらこう。

def collect_string(tag):
    for next_tag in tag.recursiveChildGenerator():
        if isinstance(next_tag, unicode):
            yield next_tag
        elif getattr(next_tag, 'name', None) == u'br':
            yield u'\n'

>>> from BeautifulSoup import BeautifulSoup
>>> source = u'<html><body><p>foo</p><p>bar<br />baz</p></body></html>'
>>> soup = BeautifulSoup(source)
>>> print u''.join(collect_string(soup.body))
foobar
baz

recursiveChildGenerator の代わりに nextGenerator を使い、抽象度を上げていけばイベント駆動型のパーサになりそうだが、作りこみすぎは無駄におわりそう。本格的なイベント駆動型のパーサが要るならば sgmllib (Python 3 で削除) や HTMLParser (Python 3 では html.parser) 、 xml.sax (XML 限定) 、 xml.parser.expat (XML 限定) でよいから。 BeautifulSoup もこれらを使って作られているのだし*1。

table 解析

まずは rowspan, colspan 無視版。

def parse_table(tbody):
    u"""table を行列に分割する

    rowspan, colspan は無視される。"""

    table_cols = len(tbody.find('tr'))
    for tr in tbody.findAll('tr', recursive=False):
        yield (td for td in tr.findAll(['td', 'th'], recursive=False))

引数は tr タグを直接子として持っているタグ。 tbody, thead, tfoot タグなど。 tbody が省略されている表なら table タグそのもの。

>>> from BeautifulSoup import BeautifulSoup
>>> source = u"""
... <html><body>
... <table>
...   <thead>
...     <tr><td>a</td><td>b</td><td>c</td></tr>
...   </thead>
...   <tbody>
...     <tr><td>d</td><td>e</td><td>f</td></tr>
...     <tr><td colspan="2">g</td><td rowspan="2">h</td></tr>
...     <tr><td>i</td><td>j</td></tr>
...   </tbody>
... </table>
... </body></html>
... """
>>> for cols in parse_table(soup.tbody):
...   for col in cols:
...     print col,
...   print
...
<td>d</td> <td>e</td> <td>f</td>
<td colspan="2">g</td> <td rowspan="2">h</td>
<td>i</td> <td>j</td>

parse_table は tbody が複数ある、省略されている、 thead, tfoot も読みたい、といった各表のこまかい差異には対応しない。呼び出し側でなんとかして、という方針。たとえば thead も読む場合はこう。

>>> from itertools import chain, imap
>>> for cols in chain.from_iterable(
...     imap(parse_table, (soup.thead, soup.tbody))):
...   for col in cols:
...     print col,
...   print
...
<td>a</td> <td>b</td> <td>c</td>
<td>d</td> <td>e</td> <td>f</td>
<td colspan="2">g</td> <td rowspan="2">h</td>
<td>i</td> <td>j</td>

rowspan, colspan を考慮に入れた parse_table はこう。

def parse_table(tbody):
    u"""table を行列に分割する"""

    # はじめに列数をカウントする
    table_cols = 0
    for td in tbody.find('tr').findAll(['td', 'th'], recursive=False):
        table_cols += int(td.get('colspan', 1))

    # [繰り返し回数, 対象 column] を記録するリストを作成
    old = [[0, None] for i in xrange(table_cols)]

    # 行を分解
    for tr in tbody.findAll('tr', recursive=False):
        # 列を分解
        cols = []
        col = 0
        tds = tr.findAll(['td', 'th'], recursive=False)
        while col < table_cols:
            if old[col][0]:
                td = old[col][1]
                old[col][0] -= 1
            else:
                td = tds.pop(0)
                rowspan = int(td.get('rowspan', 1))
                rowspan_m1 = rowspan - 1
                if rowspan_m1:
                    old[col] = [rowspan_m1, td]

            colspan = int(td.get('colspan', 1))
            for i in xrange(colspan):
                cols.append(td)
                col += 1
        yield iter(cols)

>|python|
>>> from BeautifulSoup import BeautifulSoup
>>> source = u"""
... <html><body>
... <table>
...   <thead>
...     <tr><td>a</td><td>b</td><td>c</td></tr>
...   </thead>
...   <tbody>
...     <tr><td>d</td><td>e</td><td>f</td></tr>
...     <tr><td colspan="2">g</td><td rowspan="2">h</td></tr>
...     <tr><td>i</td><td>j</td></tr>
...   </tbody>
... </table>
... </body></html>
... """
>>> for cols in parse_table(soup.tbody):
...   for col in cols:
...     print col,
...   print
...
<td>d</td> <td>e</td> <td>f</td>
<td colspan="2">g</td> <td colspan="2">g</td> <td rowspan="2">h</td>
<td>i</td> <td>j</td> <td rowspan="2">h</td>

前述のテキスト抽出と組み合わせるとこうなる。

>>> for cols in parse_table(soup.tbody):
...   for col in cols:
...     print collect_string(col),
...   print
...
d e f
g g h
i j h

*1:BeautifulSoup 3.0.8 までは sgmllib, 以後は HTMLParser で作られている。

銀月の符号

Python 使い見習いの日記・雑記

BeautifulSoup を使ったコード片のメモ

BeautifulSoup とは？

テキスト抽出

table 解析