Google app engineでBeautifulsoupを使う

概要

GAEでHTMLをパースして加工したいのでBeautifulsoupというモジュールをインストールして使う。

デフォルトGAE環境では利用できないのでBeautifulsoupファイルをアップして使う。

環境

System環境 : iMac.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386

python : Python 2.5.5

ダウンロード

圧縮ファイルダウンロードおよび解凍

fetch 'http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.0.8.tar.gz'
tar -xzf BeautifulSoup-3.0.8.tar.gz

インストール

解凍したディレクトリに移動し以下のコマンドを実行

$ python setup.py install
running install
running build
running build_py
running install_lib
running install_egg_info
Removing /opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/BeautifulSoup-3.0.8-py2.5.egg-info

Beautifulsoup.pyをGAEディレクトリにコピー

cp /opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/BeautifulSoup.py  ~/work/GAEFolder

※インストールされたディレクトリPATHは環境によって異なります。

$ sudo easy_install BeautifulSoup
Searching for BeautifulSoup
Reading http://pypi.python.org/simple/BeautifulSoup/
Reading http://www.crummy.com/software/BeautifulSoup/
Reading http://www.crummy.com/software/BeautifulSoup/download/
Best match: BeautifulSoup 3.1.0.1
Downloading http://www.crummy.com/software/BeautifulSoup/download/3.1.x/BeautifulSoup-3.1.0.1.tar.gz
Processing BeautifulSoup-3.1.0.1.tar.gz
Running BeautifulSoup-3.1.0.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-Y3wWtw/BeautifulSoup-3.1.0.1/egg-dist-tmp-PKKJzH
zip_safe flag not set; analyzing archive contents...
/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/setuptools/command/bdist_egg.py:422: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  symbols = dict.fromkeys(iter_symbols(code))
Adding BeautifulSoup 3.1.0.1 to easy-install.pth file

Installed /Library/Python/2.6/site-packages/BeautifulSoup-3.1.0.1-py2.6.egg
Processing dependencies for BeautifulSoup
Finished processing dependencies for BeautifulSoup

※easy_installが設定されていればサイトからのダウンロード無しでコマンドだけで解決。

Beautifulsoup使用例

はてブモバイルページのaタグを全て抽出する

#! /usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2
#Beautifulsoupのimport
from BeautifulSoup import BeautifulSoup

opener = urllib2.build_opener()
url="http://d.hatena.ne.jp/yutakikuchi/"

#useragent設定
opener.addheaders = [('User-Agent', 'SoftBank/1.0/912T/TJ001[/Serial] Browser/NetFront/3.3 Profile/MIDP-2.0 Configuration/CLDC-1.1' )] 

#結果の取得/表示
result = opener.open( url ).read()
#print result

#beautifulsoupの使用
soup = BeautifulSoup( result )
linklist = soup.findAll( 'a' )
for l in linklist:
    print l

スクレイピング実行結果

(省略)
<a href="/yutakikuchi/mobile">Happy Hacker WebEngineerのTechﾌﾞﾛｸﾞ</a>
<a href="./aboutmobile">yutakikuchi</a>
<a href="/yutakikuchi/archivemobile">記事の一覧</a>
(省略)