MongoDBをpythonから利用する

MongoDB

特徴

アメリカの10gen社によってC++で書かれたドキュメント指向データベース。
DBへのアクセスはJavascriptによって行う。
ドキュメント指向データベースとは、自由なデータ構造のドキュメントを登録する事が可能。
MongoDBのデータベースはそれぞれ独立している。
テーブルの構造を事前に決めないのでスキーマレス。
１つのデータベースには1つ以上のコレクションとコレクションにはドキュメント(オブジェクト)から成り立つ。
コレクションとはドキュメントのグループ。RDBMSのテーブルに相当。
ドキュメントとは登録データ(オブジェクト)。
RDBMSにオブジェクトを登録することはプログラムが複雑化するが、ドキュメント指向データベースならオブジェクトをそのまま保存。
ドキュメント指向データベースに他にCouchDBがある。
RDMBSと比較して簡単にスケールアウトできる仕組み。
BSONというバイナリ形式のデータを内部で保存。BSONとはJSONを元にしたKey-Value型のデータ構造。
BSONのStringCodeはUTF-8。
indexの設定が可能。
データを複数のデータベースに分割するShardingが可能。
JOIN機能が無いのでテーブルの結合などの必要がある場合はRDBMSを利用する。

BSONのデータ構造

MongoDBの各ドキュメントは_idというuniq keyを保持。
idはBSONのobject idでといった12Byteの値。
idはコレクション内でユニークであり、コレクションがindexを持つ場合はユニークが強制される。
ユーザ自身で_idを指定する事も可能だが、指定しなかった場合は自動で付与される。
idは12バイトの値でほとんどのケースでユニークになるようにデザイン。
MongoDBのシェルとして提供されているObjectId()はidを生成するために使う。Object( 'e4d1a14651d0f018e0000004' )のように使い16進数文字列から生成。
idは4バイトのタイムスタンプ(epoch秒)、3バイトのマシーンID、2バイトのプロセスID、3バイトのカウンターで構成され、計12Byte。
タイムスタンプカウンターフィールドはbig endianで格納。バイト毎に比較した時に必ず昇順になることを保証。

install

macOSXに対しての手順を記述する。
~~1. port install~~ port installは時間が掛かりすぎてやめた。

~~$ sudo port install mongodb~~

1. download and install
Downloads - MongoDB

$ tar xzf mongodb-osx-x86_64-1.8.2.tar
$ sudo cp -R mongodb-osx-x86_64-1.8.2/ /usr/local/mongodb/

2. make directory

$ sudo mkdir -p /data/db

3. start up

$ sudo /usr/local/mongodb/bin/mongod

4. connection

$ sudo /usr/local/mongodb/bin/mongo
Password:
MongoDB shell version: 1.8.2
connecting to: test
> db.foo.save( { a: 1 } )
> db.foo.find()
{ "_id" : ObjectId("4e4934874764ff0cf48a480c"), "a" : 1 }

webtool

http://:28017/

admin用のツールが用意されている。
28017がデフォルトのport番号。

pymongo

特徴

MongoDBをpythonから扱うためのdriver。
easy_installで設定可能。
このページにAPIドキュメントが存在。
python-MongoDB間の文字コード処理を自動で変換。
- Python内部では文字コードをUnicodeを扱うのが一般的だが、BSONはUTF-8形式なので注意が必要。Pymongoではどうやらencode/decodeが自動的に行われている様子。

install

$ sudo easy_install pymongo

※easy_installしたpackageはeggファイルとしてsite-packages以下に保存される。

環境変数の設定

この項目は必要に応じて設定。pymongoへのpathが設定されていないとimport時にエラーが発生する。今回はコマンドラインでの実行になるので.zshrcにexport文を追記する。

$ vim ~/.zshrc
export PYTHONPATH=/Library/Python/2.6/site-packages/:$PYTHONPATH

pymongo サンプルコードの実行

mongodbを起動した状態で以下のサンプルコードファイルを実行する。実行は当然pythonコマンドを使う。

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import sys
from pymongo import Connection

#connection
connect = Connection('localhost', 27017)

#testdbを取得
db = connect.test
#次のような記述もOK db = con['test']

#db名を出力
print "db name is = "
print db.name

#foo collection
collect = db.foo
#次のような記述もOK col = db['foo']

#save
collect.save({'x':10})
collect.save({'x':8})
collect.save({'x':11})

#最初に格納されているドキュメントを検索
print "find_one = "
print collect.find_one()

#コレクションに登録されているドキュメント全部を検索
print "find = "
for data in collect.find():
    print data

#条件を指定してドキュメントを検索
print "find_query = "
for data in collect.find({'x':10}):
    print data

pymongo 実行結果

db name is = 
test
find_one = 
{u'x': 10, u'_id': ObjectId('4e4aa5ac651d0f0200000000')}
find =
{u'x': 10, u'_id': ObjectId('4e4aa5ac651d0f0200000000')}
{u'x': 8, u'_id': ObjectId('4e4aa5ac651d0f0200000001')}
{u'x': 11, u'_id': ObjectId('4e4aa5ac651d0f0200000002')}
{u'x': 10, u'_id': ObjectId('4e4aa631651d0f0206000000')}
{u'x': 8, u'_id': ObjectId('4e4aa631651d0f0206000001')}
{u'x': 11, u'_id': ObjectId('4e4aa631651d0f0206000002')}
(略)
find_query = 
{u'x': 10, u'_id': ObjectId('4e4aa5ac651d0f0200000000')}
{u'x': 10, u'_id': ObjectId('4e4aa631651d0f0206000000')}
{u'x': 10, u'_id': ObjectId('4e4aa63e651d0f0208000000')}
{u'x': 10, u'_id': ObjectId('4e4d1a14651d0f018e000000')}

結果から分かるように同一のKey-valueの値が登録できてしまう。これはObjectIdとしてタイムスタンプ、マシーンID、プロセスID、カウンタ
が管理されているからである。

pymongo tutorial

こちらに掲載されている内容をもう少し詳しく見てみる。

1.Connection生成
簡単な指定。

from pymongo import Connection
connection = Connection()

hostとportを指定。

connection = Connection('localhost', 27017)

2.Databaseの取得
test_databaseを取得。

db = connection.test_database

以下のように記述も可能。

db = connection['test-database']

3.Collectionを取得
test_collectionを取得。

collection = db.test_collection

以下のように記述も可能。

collection = db['test-collection']

4.Documentの定義
JSON-styleのオブジェクトを定義。DocumentはPython型のデータを保持する事は可能だが、保存される時に自動的にBSON型に変換される。

import datetime
post = {"author": "Mike",
"text": "My first blog post!",
"tags": ["mongodb", "python", "pymongo"],
"date": datetime.datetime.utcnow()}

5.Documentの登録
collectionに対してDocumentを登録。insertメソッドを利用する。

posts = db.posts
posts.insert(post)

登録後はcollectionをDBの中で修正/参照可能。

db.collection_names()

6.単一のDocument検索
find_oneを利用。条件に一致したものか、一致した最小のDocumentを返す。

posts.find_one()

find_oneの中に検索条件を指定する事が可能。条件指定を謝ると検査結果は得られない。

posts.find_one({"author": "Mike"})

7. Bulk Insert
RDBMSと同じようにBulk Insertが可能。

new_posts = [{"author": "Mike",
"text": "Another post!",
"tags": ["bulk", "insert"],
"date": datetime.datetime(2009, 11, 12, 11, 14)},
{"author": "Eliot",
"title": "MongoDB is fun",
"text": "and pretty easy too!",
"date": datetime.datetime(2009, 11, 10, 10, 45)}]
posts.insert(new_posts)

8.一つ以上のDocumentを取得したい
findメソッドを利用する。

for post in posts.find():
post

find_oneと同じように条件の指定が可能。

for post in posts.find({"author": "Mike"}):
post

9.countの取得
Documentの個数を取得

posts.count()

条件に一致したDocumentの個数を取得

posts.find({"author": "Mike"}).count()

10.特定の条件指定とsort
変数での条件指定で検索し、特定のkeyでソートする。

d = datetime.datetime(2009, 11, 12, 12)
for post in posts.find({"date": {"$lt": d}}).sort("author"):
post

11.index
queryを速くするためにindexを利用する。explainメソッドでindexの状況を知る事ができる。

posts.find({"date": {"$lt": d}}).sort("author").explain()["cursor"]
posts.find({"date": {"$lt": d}}).sort("author").explain()["nscanned"]

create_indexでindexの設定が可能。

from pymongo import ASCENDING, DESCENDING
posts.create_index([("date", DESCENDING), ("author", ASCENDING)])
posts.find({"date": {"$lt": d}}).sort("author").explain()["cursor"]
posts.find({"date": {"$lt": d}}).sort("author").explain()["nscanned"]

次回

次回はMongoDBの内部構造や設定についてもう少し詳しく追ってみたいと思う。

Y's note

Web技術・プロダクトマネジメント・そして経営について

MongoDBをpythonから利用する

MongoDB

特徴

BSONのデータ構造

install

webtool

pymongo

特徴

install

環境変数の設定

pymongo サンプルコードの実行

pymongo 実行結果

pymongo tutorial

次回

link