Apache Mahout 機械学習Libraryを使って「魔法少女まどか☆マギカ」の台詞をテキストマイニングしてみた

作者: Sean Owen,Robin Anil,Ted Dunning,Ellen Friedman
出版社/メーカー: Manning Pubns Co
発売日: 2011/10/28
メディア: ペーパーバック
購入: 4人クリック: 81回
この商品を含むブログ (10件) を見る

Index

Information & Links

Apache Mahout

Abouc Apache Mahout

Mahout has machine learning libraries

Mahout Download / Setting

Madmagi Words

Scraping

Word MA

Mecab MA

HDFS PUT

Clustering Theory

TF/IDF

K-Means

Canopy Clustering

Word Vector

Clustering

Graph Display

Required JAR

Sample Graph Image

Information & Links

この記事は「魔法少女まどか☆マギカ」の台詞を機械学習によりClusteringした内容についての記録です。Clustering/Graph Image出力などの実験/検証が不十分であるため後日再挑戦しますができたところまで公開します。以下はこの記事で参考にしたリンクです。

「魔法少女まどか☆マギカ」の台詞をJavaScriptでMapReduceしてGoogle Chart APIでグラフ出力したよ！ - Yuta.Kikuchiの日記

試すのが難しい―機械学習の常識はMahoutで変わる (1/3) - ＠IT

Apache Mahout の紹介

Mahout JP

テキストマイニングで始める実践Hadoop活用

Overview (Hadoop 0.20.2 API)

Open Source Java Code Online - JavaSourceCode

Jar File Download examples (example source code) Organized by topic

Apache Mahout

Apache Mahout

Apache Mahout: Scalable machine learning and data mining
Mahout(マハウト)と呼ぶApache Hadoopと利用できるスケーラブルな機械学習ライブラリです。協調学習やユーザレコメンド、k-meansなどの機械学習のライブラリ機能を持っているためそれらを簡単に利用できます。

Mahout has machine learning libraries

Mahoutには以下の機械学習ライブラリが備わっています。

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

High performance java collections (previously colt collections)

A vibrant community

and many more cool stuff to come by this summer thanks to Google summer of code
Mahout Download / Setting

http://ftp.meisei-u.ac.jp/mirror/apache/dist/mahout/
ここからMahoutをダウンロード/インストールします。ここでの説明は既にHadoopがインストールされている事を前提としています。Hadoopの設定についてはCentOSでHadoopを使ってみる - Yuta.Kikuchiの日記を参照してください。基本的にはmahout本体の展開だけで動くと思いますが、Mavenが必要な場合はインストールしてください。尚今回インストールした環境はCentos5.7になります。Mahoutを動かすにはJAVA_HOME、HADOOP_HOME、HADOOP_CONF_DIRの環境設定が必要なので.zshrcに加えておきます。
$ cat /etc/redhat-release 
CentOS release 5.7 (Final)

$ wget http://ftp.meisei-u.ac.jp/mirror/apache/dist/mahout/0.6/mahout-distribution-0.6.tar.gz
$ tar -xzf mahout-distribution-0.6.tar.gz
$ file mahout-distribution-0.6/bin/mahout //実行コマンド
mahout-distribution-0.6/bin/mahout: Bourne-Again shell script text executable
$ cd mahout-distribution-0.6
$ ls -al
-rw-r--r--  1 yuta yuta    39588  2月  1 22:30 LICENSE.txt
-rw-r--r--  1 yuta yuta     1888  2月  1 22:30 NOTICE.txt
-rw-r--r--  1 yuta yuta     1200  2月  1 22:30 README.txt
drwxr-xr-x  2 yuta yuta     4096  4月  8 19:15 bin
drwxr-xr-x  3 yuta yuta     4096  4月  8 19:15 buildtools
drwxr-xr-x  2 yuta yuta     4096  2月  1 22:29 conf
drwxr-xr-x  3 yuta yuta     4096  4月  8 19:15 core
drwxr-xr-x  3 yuta yuta     4096  4月  8 19:15 distribution
drwxr-xr-x  6 yuta yuta     4096  4月  8 19:15 docs
drwxr-xr-x  5 yuta yuta     4096  4月  8 19:15 examples
drwxr-xr-x  3 yuta yuta     4096  4月  8 19:15 integration
drwxr-xr-x  2 yuta yuta     4096  4月  8 19:15 lib
-rw-r--r--  1 yuta yuta 11190212  2月  1 22:31 mahout-core-0.6-job.jar
-rw-r--r--  1 yuta yuta  1662876  2月  1 22:31 mahout-core-0.6.jar
-rw-r--r--  1 yuta yuta 23593299  2月  1 22:33 mahout-examples-0.6-job.jar
-rw-r--r--  1 yuta yuta   379461  2月  1 22:33 mahout-examples-0.6.jar
-rw-r--r--  1 yuta yuta   284781  2月  1 22:32 mahout-integration-0.6.jar
-rw-r--r--  1 yuta yuta   288914  2月  1 22:30 mahout-math-0.6.jar
drwxr-xr-x  3 yuta yuta     4096  4月  8 19:15 math
export JAVA_HOME=/usr/java/default/
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/usr/lib/hadoop-0.20
export HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
export PATH=$HADOOP_HOME/bin:$PATH

Madmagi Words

魔法少女まどか☆マギカの台詞をScrapingします。Scrapingした結果をNLTKを利用して単語区切りとMecabによるMAした結果をHadoop HDFS上に保存します。

Scraping

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys,re,urllib,urllib2
urls = ( 'http://www22.atwiki.jp/madoka-magica/pages/170.html',
         'http://www22.atwiki.jp/madoka-magica/pages/175.html',
         'http://www22.atwiki.jp/madoka-magica/pages/179.html',
         'http://www22.atwiki.jp/madoka-magica/pages/180.html',
         'http://www22.atwiki.jp/madoka-magica/pages/200.html',
         'http://www22.atwiki.jp/madoka-magica/pages/247.html',
         'http://www22.atwiki.jp/madoka-magica/pages/244.html',
         'http://www22.atwiki.jp/madoka-magica/pages/249.html',
         'http://www22.atwiki.jp/madoka-magica/pages/250.html',
         'http://www22.atwiki.jp/madoka-magica/pages/252.html',
         'http://www22.atwiki.jp/madoka-magica/pages/241.html',
         'http://www22.atwiki.jp/madoka-magica/pages/254.html'
         )   
f = open( './madmagi.txt', 'w' )
opener = urllib2.build_opener()
ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.51.22 (KHTML, like Gecko) Version/5.1.1 Safari/    534.51.22'
referer = 'http://www22.atwiki.jp/madoka-magica/'
opener.addheaders = [( 'User-Agent', ua ),( 'Referer', referer )]
for url in urls:
    content = opener.open( url ).read()
    if re.compile( r'<div class="contents".*?>((.|\n)*?)</div>', re.M ).search( content ) is not None:
        data = re.compile( r'<div class="contents".*?>((.|\n)*?)</div>', re.M ).search( content ).group()
        if re.compile( r'「(.*?)」', re.M ).search( data ) is not None: 
            lines = re.compile( r'「(.*?)」', re.M ).findall( data )
            for line in lines:
                f.write( line + "\n" )
f.close()

Word MA

スペース区切りで分かち書きを行います。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

import nltk
from nltk.corpus.reader import *
from nltk.corpus.reader.util import *
from nltk.text import Text

jp_sent_tokenizer = nltk.RegexpTokenizer(u'[^　「」！？。]*[！？。]')
jp_chartype_tokenizer = nltk.RegexpTokenizer(u'([ぁ-んー]+|[ァ-ンー]+|[\u4e00-\u9FFF]+|[^ぁ-んァ-ンー\u4e00-\u9FFF]+)')
data = PlaintextCorpusReader( './', r'madmagi.txt',
                              encoding='utf-8',
                              para_block_reader=read_line_block,
                              sent_tokenizer=jp_sent_tokenizer,
                              word_tokenizer=jp_chartype_tokenizer )

#ファイル保存
f = open( './word.txt', 'w' )
for i in data.words():
    f.write( i + " " )
f.close

Mecab MA

Mecabによる分かち書きを行います。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

import MeCab
mecab = MeCab.Tagger('-Ochasen')

data = open( './madmagi.txt' ).read()

f = open( './ma.txt', 'w' )

node = mecab.parseToNode( data )
phrases = node.next
while phrases:
    try:
        k = node.surface
        f.write( k + " " )
        node = node.next
    except AttributeError:
       break 
f.close()

HDFS PUT

抽出したWordMA/MecabMAデータをHDFS上にputします。

$ alias hdfs='hadoop dfs'
$ hdfs -mkdir madmagi
$ hdfs -put data/ma.txt madmagi_in/
$ hdfs -put data/word.txt madmagi_in/
$ hdfs -lsr madmagi_in
-rw-r--r--   1 yuta supergroup     104440 2012-03-26 01:16 /user/yuta/madmagi_in/ma.txt
-rw-r--r--   1 yuta supergroup     101266 2012-03-26 01:16 /user/yuta/madmagi_in/word.txt

Clustering Theory

Mahoutの処理を行う前に理論的な内容を少し整理します。

TF/IDF

tf-idf - Wikipedia
情報検索に置ける単語の重み付けの手法。TFが単語出現頻度、IDFが逆文書頻度といった内容で単語の重要度を測る手法です。

K-Means

K平均法 - Wikipedia
K平均法は単純なクラスタリングアルゴリズムで、クラスタの中心計算を反復させより精度の高いクラスタを導きだす仕組みです。以下に単純な処理の流れを記載します。

各nodeに対してランダムなクラスタを割り当てる。

割り当てたクラスタの中心を計算する。

計算した中心に対して近いnodeを割り当てる。

再度クラスタの中心を計算。以降クラスタが変化しなくなるまで任意の数繰り返す。

Canopy Clustering

Canopy Clustering
Canopy Clusteringについても簡単に触れます。Canopyはクラスタリングの中心点からT1、T2という最大半径、最小半径を指定しそれらを同一クラスタとして認識させます。

Word Vector

Mahoutを利用してMecab MAした結果をvectorに変換します。※Word MAの方については記述しませんが、MecabMAと同様に行えばClustering可能だと思います。変換にはMahoutのseqdirectoryとseq2sparseコマンドを使います。まずはseqdirectoryとseq2sparseコマンドのhelpを見てみます。

$ bin/mahout seqdirectory -h
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                             Path to job input directory.   
  --output (-o) output                           The directory pathname for     
                                                 output.                        
  --overwrite (-ow)                              If present, overwrite the      
                                                 output directory before        
                                                 running job                    
  --chunkSize (-chunk) chunkSize                 The chunkSize in MegaBytes.    
                                                 Defaults to 64                 
  --fileFilterClass (-filter) fileFilterClass    The name of the class to use   
                                                 for file parsing. Default:     
                                                 org.apache.mahout.text.PrefixAd
                                                 ditionFilter                   
  --keyPrefix (-prefix) keyPrefix                The prefix to be prepended to  
                                                 the key                        
  --charset (-c) charset                         The name of the character      
                                                 encoding of the input files.   
                                                 Default to UTF-8               
  --help (-h)                                    Print out help                 
  --tempDir tempDir                              Intermediate output directory  
  --startPhase startPhase                        First phase to run             
  --endPhase endPhase                            Last phase to run              

$ bin/mahout seq2sparse -h  
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Usage:                                                                          
 [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize           
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma      
<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>      
--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>        
--overwrite --help --sequentialAccessVector --namedVector --logNormalize]       
Options                                                                         
  --minSupport (-s) minSupport        (Optional) Minimum Support. Default       
                                      Value: 2                                  
  --analyzerName (-a) analyzerName    The class name of the analyzer            
  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB  
  --output (-o) output                The directory pathname for output.        
  --input (-i) input                  Path to job input directory.              
  --minDF (-md) minDF                 The minimum document frequency.  Default  
                                      is 1                                      
  --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf) vectors   
                                      to be used, expressed in times the        
                                      standard deviation (sigma) of the         
                                      document frequencies of these vectors.    
                                      Can be used to remove really high         
                                      frequency terms. Expressed as a double    
                                      value. Good value to be specified is 3.0. 
                                      In case the value is less then 0 no       
                                      vectors will be filtered out. Default is  
                                      -1.0.  Overrides maxDFPercent             
  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.    
                                      Can be used to remove really high         
                                      frequency terms. Expressed as an integer  
                                      between 0 and 100. Default is 99.  If     
                                      maxDFSigma is also set, it will override  
                                      this value.                               
  --weight (-wt) weight               The kind of weight to use. Currently TF   
                                      or TFIDF                                  
  --norm (-n) norm                    The norm to use, expressed as either a    
                                      float or "INF" if you want to use the     
                                      Infinite norm.  Must be greater or equal  
                                      to 0.  The default is not to normalize    
  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood      
                                      Ratio(Float)  Default is 1.0              
  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.        
                                      Default Value: 1                          
  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to  
                                      create (2 = bigrams, 3 = trigrams, etc)   
                                      Default Value:1                           
  --overwrite (-ow)                   If set, overwrite the output directory    
  --help (-h)                         Print out help                            
  --sequentialAccessVector (-seq)     (Optional) Whether output vectors should  
                                      be SequentialAccessVectors. If set true   
                                      else false                                
  --namedVector (-nv)                 (Optional) Whether output vectors should  
                                      be NamedVectors. If set true else false   
  --logNormalize (-lnorm)             (Optional) Whether output vectors should  
                                      be logNormalize. If set true else false

seqdirectoryにてテキストファイルをSequenceFileに変換、seq2sparseにてSequenceFileをvectorにします。seq2sparseのオプションで重要な項目を表にします。

option	説明
minSupport	レコードごとの最少登場回数
minDF	最少登場文書数
maxDFPercent	登場する文書の割合がこれを超えたら不採用
maxNGramSize	熟語の可能性を検討する最大語数
minLLR	熟語として採用するための同時発生確率の最少値
sequentialAccessVector	逐次アクセス用のファイル形式
namedVector	各ベクタに名前を付ける

$ bin/mahout seqdirectory \
--input madmagi_in/ma.txt \
--output madmagi_out_ma/seq \
-c UTF-8
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/05/03 08:46:20 INFO common.AbstractJob: Command line arguments: {--charset=UTF-8, --chunkSize=64, --endPhase=2147483647, --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter, --input=madmagi_in/ma.txt, --keyPrefix=, --output=madmagi_out_ma/seq, --startPhase=0, --tempDir=temp}
12/05/03 08:46:25 INFO driver.MahoutDriver: Program took 5835 ms (Minutes: 0.09725)

$ bin/mahout seq2sparse \
--input madmagi_out_ma/seq \
--output madmagi_out_ma/vector \
--minSupport 10 \
--minDF 20 \
--maxDFPercent 40 \
--maxNGramSize 3

それぞれの解析結果により複数種類のファイルが作成されます。

$ hdfs -ls madmagi_out_ma/vector
Found 7 items
drwxr-xr-x   - yuta supergroup          0 2012-05-03 09:32 /user/yuta/madmagi_out_ma/vector/df-count
-rw-r--r--   1 yuta supergroup      17602 2012-05-03 09:30 /user/yuta/madmagi_out_ma/vector/dictionary.file-0
-rw-r--r--   1 yuta supergroup      17613 2012-05-03 09:32 /user/yuta/madmagi_out_ma/vector/frequency.file-0
drwxr-xr-x   - yuta supergroup          0 2012-05-03 09:31 /user/yuta/madmagi_out_ma/vector/tf-vectors
drwxr-xr-x   - yuta supergroup          0 2012-05-03 09:33 /user/yuta/madmagi_out_ma/vector/tfidf-vectors
drwxr-xr-x   - yuta supergroup          0 2012-05-03 09:29 /user/yuta/madmagi_out_ma/vector/tokenized-documents
drwxr-xr-x   - yuta supergroup          0 2012-05-03 09:30 /user/yuta/madmagi_out_ma/vector/wordcount

seqdumperまたはvectordumpを使って作成されたSequenceFileの中身を見てみます。tfidfのvector値とwordcountのが分かります。wordcountはグリーフシード/ソウルジェム/ワルプルギス/マミなどの特徴的な単語数が多く目立ちます。

$ bin/mahout seqdumper -s /user/yuta/madmagi_out_ma/vector/tfidf-vectors/part-r-00000
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/05/03 10:38:17 INFO common.AbstractJob: Command line arguments: &#230;--endPhase=2147483647, --seqFile=/user/yuta/madmagi_out_ma/vector/tfidf-vectors/part-r-00000, --startPhase=0, --tempDir=temp&#229;
Input Path: /user/yuta/madmagi_out_ma/vector/tfidf-vectors/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable
Key: /ma.txt: Value: &#230;867:-6.465347766876221,866:-7.082433700561523,865:-9.58966064453125,
864:-15.704275131225586,863:-16.735137939453125,862:-9.369179725646973,861:-6.465347766876221,860:-15.299805641174316,859:-9.80518627166748,
858:-8.674174308776855,857:-6.780914306640625,856:-6.465347766876221,855:-8.674174308776855,854:-10.818595886230469,853:-7.918401718139648,

$ bin/mahout seqdumper -s /user/yuta/madmagi_out_ma/vector/wordcount/ngrams/part-r-00000
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/05/03 10:41:25 INFO common.AbstractJob: Command line arguments: &#230;--endPhase=2147483647, --seqFile=/user/yuta/madmagi_out_ma/vector/wordcount/ngrams/part-r-00000, --startPhase=0, --tempDir=temp&#229;
Input Path: /user/yuta/madmagi_out_ma/vector/wordcount/ngrams/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.DoubleWritable

Key: アイツ: Value: 12.0
Key: アタシ: Value: 30.0
Key: アンタ: Value: 30.0
Key: エネルギー: Value: 12.0
Key: キュゥ: Value: 13.0
Key: グリーフシード: Value: 10.0
Key: ソウルジェム: Value: 17.0
Key: ダメ: Value: 15.0
Key: バカ: Value: 16.0
Key: ホント: Value: 10.0
Key: マミ: Value: 20.0
Key: ワルプルギス: Value: 10.0

上のseq2sparseの結果のTF/IDFの値がマイナスになっているのが気持ち悪いのでseq2sparseの実行オプションを修正して再度実行します。

$ bin/mahout seq2sparse \
--input madmagi_out_ma/seq \
--output madmagi_out_ma_test/vector \
--maxDFPercent 40 \
--maxNGramSize 6 \
--sequentialAccessVector \
--namedVector

$ bin/mahout seqdumper -s /user/yuta/madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/05/03 12:58:43 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --seqFile=/user/yuta/madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000, --startPhase=0, --tempDir=temp}
Input Path: /user/yuta/madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable
Key: /ma.txt: Value: /ma.txt:{0:0.4339554011821747,1:0.4339554011821747,2:0.811856210231781,3:2.1479697227478027,4:5.765240669250488,5:0.6861437559127808,6:10.023337364196777,7:1.0177156925201416,8:6.868295669555664,9:2.1479697227478027,
10:4.653653144836426,11:2.744575023651123,12:8.564437866210938,13:5.699537754058838,14:4.01262092590332,15:1.4716144800186157,16:4.572004318237305,17:1.3722875118255615,18:4.713962554931641,
19:1.708484172821045,20:5.341354846954346,21:2.0584311485290527

$ bin/mahout seqdumper -s /user/yuta/madmagi_out_ma_test/vector/wordcount/ngrams/part-r-00000
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/05/03 13:06:39 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --seqFile=/user/yuta/madmagi_out_ma_test/vector/wordcount/ngrams/part-r-00000, --startPhase=0, --tempDir=temp}
Input Path: /user/yuta/madmagi_out_ma_test/vector/wordcount/ngrams/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.DoubleWritable

Key: アイツ: Value: 12.0
Key: アタシ: Value: 30.0
Key: アンタ: Value: 30.0
Key: イレギュラー: Value: 2.0
Key: インキュベーター: Value: 3.0
Key: ウザ: Value: 3.0
Key: ウゼェ: Value: 2.0
Key: エネルギー: Value: 12.0
Key: エントロピー: Value: 2.0
Key: オイ: Value: 5.0
Key: オイッ: Value: 2.0
Key: カッコ: Value: 3.0
Key: キュゥ: Value: 13.0
Key: キュウ: Value: 2.0
Key: クラス: Value: 2.0
Key: グリーフシード: Value: 10.0
Key: コイツ: Value: 5.0
Key: ゴメン: Value: 2.0
Key: ゼロ: Value: 2.0
Key: ソウルジェム: Value: 17.0
Key: ゾンビ: Value: 2.0
Key: タツヤ: Value: 2.0
Key: ダメ: Value: 15.0
Key: チッ: Value: 2.0
Key: ッ: Value: 5.0
Key: テメェ: Value: 9.0
Key: ナメ: Value: 3.0
Key: ハッ: Value: 3.0
Key: バカ: Value: 16.0
Key: バランス: Value: 2.0
Key: バレ: Value: 2.0
Key: ベテラン: Value: 2.0
Key: ホント: Value: 10.0
Key: ママ: Value: 2.0
Key: マミ: Value: 20.0
Key: ミス: Value: 3.0
Key: モノ: Value: 2.0
Key: リハビリ: Value: 4.0
Key: リンゴ: Value: 2.0
Key: ルール: Value: 3.0
Key: ワルプルギス: Value: 10.0

Clustering

以下ではCanopyとK-meansのClusteringを行います。vectorの値はTF/IDFの値を利用します。またclusterdumpというコマンドを使って実際に抽出されたClusteringを見てみます。まずは最初にcanopyとkmeansのhelpを確認します。

$ bin/mahout canopy -h
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                                    Path to job input       
                                                        directory.              
  --output (-o) output                                  The directory pathname  
                                                        for output.             
  --distanceMeasure (-dm) distanceMeasure               The classname of the    
                                                        DistanceMeasure.        
                                                        Default is              
                                                        SquaredEuclidean        
  --t1 (-t1) t1                                         T1 threshold value      
  --t2 (-t2) t2                                         T2 threshold value      
  --t3 (-t3) t3                                         T3 (Reducer T1)         
                                                        threshold value         
  --t4 (-t4) t4                                         T4 (Reducer T2)         
                                                        threshold value         
  --clusterFilter (-cf,-clusterFilter) clusterFilter    Cluster filter          
                                                        suppresses small        
                                                        canopies from mapper    
  --overwrite (-ow)                                     If present, overwrite   
                                                        the output directory    
                                                        before running job      
  --clustering (-cl)                                    If present, run         
                                                        clustering after the    
                                                        iterations have taken   
                                                        place                   
  --method (-xm) method                                 The execution method to 
                                                        use: sequential or      
                                                        mapreduce. Default is   
                                                        mapreduce               
  --help (-h)                                           Print out help          
  --tempDir tempDir                                     Intermediate output     
                                                        directory               
  --startPhase startPhase                               First phase to run      
  --endPhase endPhase                                   Last phase to run   

$ bin/mahout kmeans -h
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                           Path to job input directory.     
  --output (-o) output                         The directory pathname for       
                                               output.                          
  --distanceMeasure (-dm) distanceMeasure      The classname of the             
                                               DistanceMeasure. Default is      
                                               SquaredEuclidean                 
  --clusters (-c) clusters                     The input centroids, as Vectors. 
                                               Must be a SequenceFile of        
                                               Writable, Cluster/Canopy.  If k  
                                               is also specified, then a random 
                                               set of vectors will be selected  
                                               and written out to this path     
                                               first                            
  --numClusters (-k) k                         The k in k-Means.  If specified, 
                                               then a random selection of k     
                                               Vectors will be chosen as the    
                                               Centroid and written to the      
                                               clusters input path.             
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value.     
                                               Default is 0.5                   
  --maxIter (-x) maxIter                       The maximum number of            
                                               iterations.                      
  --overwrite (-ow)                            If present, overwrite the output 
                                               directory before running job     
  --clustering (-cl)                           If present, run clustering after 
                                               the iterations have taken place  
  --method (-xm) method                        The execution method to use:     
                                               sequential or mapreduce. Default 
                                               is mapreduce                     
  --help (-h)                                  Print out help                   
  --tempDir tempDir                            Intermediate output directory    
  --startPhase startPhase                      First phase to run               
  --endPhase endPhase                          Last phase to run

Canopy

$ bin/mahout canopy \
--input madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000 \
--output madmagi_out_ma_test/canopy \
--t1 0.9 \
--t2 0.8 \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/05/03 13:44:35 INFO common.AbstractJob: Command line arguments: {--distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure, --endPhase=2147483647, --input=madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000, --method=mapreduce, --output=madmagi_out_ma_test/canopy, --startPhase=0, --t1=0.8, --t2=0.7, --tempDir=temp}
12/05/03 13:44:35 INFO canopy.CanopyDriver: Build Clusters Input: madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000 Out: madmagi_out_ma_test/canopy Measure: org.apache.mahout.common.distance.CosineDistanceMeasure@177d59d4 t1: 0.8 t2: 0.5
12/05/03 13:44:39 INFO input.FileInputFormat: Total input paths to process : 1
12/05/03 13:44:40 INFO mapred.JobClient: Running job: job_201205030840_0073
12/05/03 13:44:41 INFO mapred.JobClient:  map 0% reduce 0%
12/05/03 13:44:54 INFO mapred.JobClient:  map 100% reduce 0%
12/05/03 13:45:05 INFO mapred.JobClient:  map 100% reduce 33%
12/05/03 13:45:07 INFO mapred.JobClient:  map 100% reduce 100%
12/05/03 13:45:11 INFO mapred.JobClient: Job complete: job_201205030840_0073

$ bin/mahout seqdumper -s /user/yuta/madmagi_out_ma_test/canopy/clusters-0-final/part-r-00000
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/05/03 13:10:46 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --seqFile=/user/yuta/madmagi_out_ma_test/canopy/clusters-0-final/part-r-00000, --startPhase=0, --tempDir=temp}
Input Path: /user/yuta/madmagi_out_ma_test/canopy/clusters-0-final/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.clustering.canopy.Canopy
Key: C-0: Value: C-0: {0:0.4339554011821747,1:0.4339554011821747,2:0.811856210231781,3:2.1479697227478027,4:5.765240669250488,5:0.6861437559127808,6:10.023337364196777,7:1.0177156925201416,8:6.868295669555664,9:2.1479697227478027,10:4.653653144836426,
11:2.744575023651123,12:8.564437866210938,13:5.699537754058838,14:4.01262092590332,15:1.4716144800186157,16:4.572004318237305,17:1.3722875118255615,18:4.713962554931641,19:1.708484172821045,20:5.341354846954346,21:2.0584311485290527,

$ bin/mahout clusterdump \
--seqFileDir madmagi_out_ma_test/canopy/clusters-0-final \
--dictionary  madmagi_out_ma_test/vector/dictionary.file-0 \
--dictionaryType sequencefile \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure \
--numWords 100

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/05/03 13:47:14 INFO common.AbstractJob: Command line arguments: {--dictionary=madmagi_out_ma_test/vector/dictionary.file-0, --dictionaryType=sequencefile, --distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure, --endPhase=2147483647, --numWords=100, --outputFormat=TEXT, --seqFileDir=madmagi_out_ma_test/canopy/clusters-0-final, --startPhase=0, --tempDir=temp}
C-0{n=1 c=[10:0.434, 100:0.434, 々:0.812, ぁ:2.148, あ:5.765, ぃ:0.686, い:10.023, ぅ:1.018, う:6.868, ぇ:2.148, え:4.654, お:2.745, か:8.564, が:5.700, き:4.013, ぎ:1.472, く:4.572, ぐ:1.372, け:4.714, げ:1.708, こ:5.341, ご:2.058, さ:5.814, ざ:0.921, し:7.183, じ:3.906, す:4.252, ず:2.081, せ:2.959, ぜ:0.752, そ:5.540, ぞ:0.970, た:8.090, だ:7.548, ち:5.226, っ:8.993, つ:3.319, づ:1.018, て:9.144, で:6.044, と:6.531, ど:4.976, な:10.237, に:7.196, ぬ:0.812, ね:4.852, の:9.206, は:7.084, ば:2.727, ぱ:0.868, ひ:1.302, び:1.106, ふ:1.302, ぶ:1.018, へ:0.921, べ:1.867, ほ:2.604, ぼ:0.686, ぽ:0.531, ま:5.252,
（略）
Top Terms: 
		な                                       =>  10.237117767333984
		い                                       =>  10.023337364196777
		の                                       =>   9.205584526062012
		て                                       =>    9.14400863647461
		っ                                       =>   8.993457794189453
		ん                                       =>   8.597356796264648
		か                                       =>   8.564437866210938
		た                                       =>   8.089515686035156
		だ                                       =>   7.547581672668457
		に                                       =>   7.196336269378662
		し                                       =>   7.183239936828613
		は                                       =>    7.08424711227417
		う                                       =>   6.868295669555664
		も                                       =>   6.659483432769775
		る                                       =>   6.588409423828125
		と                                       =>  6.5309929847717285
		ら                                       =>   6.296092987060547
		っ て                                     =>   6.137056350708008
（略）
	        そ う                                     =>  2.5489108562469482
		っ と                                     =>  2.5489108562469482
		か ち                                     =>  2.5116984844207764
		て い                                     =>  2.5116984844207764
		魔 法                                     =>  2.5116984844207764
		か ち ゃ                                   =>  2.4928841590881348
		か ち ゃ ん                                 =>  2.4928841590881348
		さ や か ち                                 =>  2.4928841590881348
		さ や か ち ゃ                               =>  2.4928841590881348
		さ や か ち ゃ ん                             =>  2.4928841590881348

K-Means

canopyで抽出したclusterをK-Meansに当てはめます。--numClustersで10と指定していますが、なぜか1個のClusterしか作成されません。※ここは原因を調査中で分かり次第内容を追記します。

$ bin/mahout kmeans \
--input madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000 \
--output madmagi_out_ma_test/kmeans \
--clusters madmagi_out_ma_test/canopy/clusters-0-final \
--maxIter 40 \
--numClusters 10 \
--convergenceDelta 0.01 \
--clustering \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure

$ bin/mahout clusterdump \
--seqFileDir madmagi_out_ma_test/kmeans/clusters-1-final \
--dictionary madmagi_out_ma_test/vector/dictionary.file-0 \
--pointsDir madmagi_out_ma_test/kmeans/clusteredPoints \
--dictionaryType sequencefile \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure \
--numWords 100

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/05/03 14:03:20 INFO common.AbstractJob: Command line arguments: {--dictionary=madmagi_out_ma_test/vector/dictionary.file-0, --dictionaryType=sequencefile, --distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure, --endPhase=2147483647, --numWords=100, --outputFormat=TEXT, --pointsDir=madmagi_out_ma_test/kmeans/clusteredPoints, --seqFileDir=madmagi_out_ma_test/kmeans/clusters-1-final, --startPhase=0, --tempDir=temp}
VL-0{n=1 c=[10:0.434, 100:0.434, 々:0.812, ぁ:2.148, あ:5.765, ぃ:0.686, い:10.023, ぅ:1.018, う:6.868, ぇ:2.148, え:4.654, お:2.745, か:8.564, が:5.700, き:4.013, ぎ:1.472, く:4.572, ぐ:1.372, け:4.714, げ:1.708, こ:5.341, ご:2.058, さ:5.814, ざ:0.921, し:7.183, じ:3.906, す:4.252, ず:2.081, せ:2.959, ぜ:0.752, そ:5.540, ぞ:0.970, た:8.090, だ:7.548, ち:5.226, っ:8.993, つ:3.319, づ:1.018, て:9.144, で:6.044, と:6.531, ど:4.976, な:10.237, に:7.196, ぬ:0.812, ね:4.852, の:9.206, は:7.084, ば:2.727, ぱ:0.868, ひ:1.302, び:1.106, ふ:1.302, ぶ:1.018, へ:0.921, べ:1.867, ほ:2.604, ぼ:0.686, ぽ:0.531, ま:5.252,
（略）

	Top Terms: 
		な                                       =>  10.237117767333984
		い                                       =>  10.023337364196777
		の                                       =>   9.205584526062012
		て                                       =>    9.14400863647461
		っ                                       =>   8.993457794189453
		ん                                       =>   8.597356796264648
		か                                       =>   8.564437866210938
		た                                       =>   8.089515686035156
		だ                                       =>   7.547581672668457
		に                                       =>   7.196336269378662
		し                                       =>   7.183239936828613
		は                                       =>    7.08424711227417
		う                                       =>   6.868295669555664
		も                                       =>   6.659483432769775
		る                                       =>   6.588409423828125
		と                                       =>  6.5309929847717285
		ら                                       =>   6.296092987060547
		っ て                                     =>   6.137056350708008
（略）
	　　  そ う                                     =>  2.5489108562469482
		っ と                                     =>  2.5489108562469482
		か ち                                     =>  2.5116984844207764
		て い                                     =>  2.5116984844207764
		魔 法                                     =>  2.5116984844207764
		か ち ゃ                                   =>  2.4928841590881348
		か ち ゃ ん                                 =>  2.4928841590881348
		さ や か ち                                 =>  2.4928841590881348
		さ や か ち ゃ                               =>  2.4928841590881348
		さ や か ち ゃ ん                             =>  2.4928841590881348
	Weight : [props - optional]:  Point:
	1.0 : [distance=0.0]: /ma.txt = [10:0.434, 100:0.434, 々:0.812, ぁ:2.148, あ:5.765, ぃ:0.686, い:10.023, ぅ:1.018, う:6.868, ぇ:2.148, え:4.654, お:2.745, か:8.564, が:5.700, き:4.013, ぎ:1.472, く:4.572, ぐ:1.372, け:4.714, げ:1.708, こ:5.341, ご:2.058, さ:5.814, ざ:0.921, し:7.183, じ:3.906, す:4.252, ず:2.081, せ:2.959, ぜ:0.752, そ:5.540, ぞ:0.970, た:8.090, だ:7.548, ち:5.226, っ:8.993, つ:3.319, づ:1.018, て:9.144, で:6.044, と:6.531, ど:4.976, な:10.237, に:7.196, ぬ:0.812, ね:4.852, の:9.206, は:7.084, ば:2.727, ぱ:0.868, ひ:1.302, び:1.106, ふ:1.302, ぶ:1.018, へ:0.921, べ:1.867, ほ:2.604,

Graph Display

Required JAR

Jar File Download examples (example source code) Organized by topic
Clustering結果が生のデータだと視覚的に分かりづらいので、GUIのGraphを生成します。Graph生成には幾つかJARファイルが必要なので上のサイトから検索してDownLoadします。try and errorを繰り返した結果以下のJARが必要ということが分かりました。

com.google.common_1.0.0.201004262004.jar
google-collections-1.0.jar
uncommons-maths-1.2.jar

$ pwd
/home/yuta/work/src/mahout/mahout-distribution-0.6
$ wget http://www.java2s.com/Code/JarDownload/uncommons/uncommons-maths-1.2.jar.zip
$ wget http://www.java2s.com/Code/JarDownload/com.google/com.google.common_1.0.0.201004262004.jar.zip
$ wget http://www.java2s.com/Code/JarDownload/google-collections/google-collections-1.0.jar.zip
$ unzip *.zip

さらにHADOOP_CLASSPATHをexportしないとjava errorが出る事から以下のように設定します。.zshrcなどに書いておくと良いと思います。

export JAVA_HOME=/usr/java/default/
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/usr/lib/hadoop-0.20
export HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/
export PATH=$HADOOP_HOME/bin:$PATH
export MAHOUT_HOME=/home/yuta/work/src/mahout/mahout-distribution-0.6
export HADOOP_CLASSPATH=$MAHOUT_HOME/mahout-math-0.6.jar:$MAHOUT_HOME/mahout-core-0.6.jar:$MAHOUT_HOME/ommons-cli-2.0-mahout.jar:$MAHOUT_HOME/mahout-integration-0.6.jar:$MAHOUT_HOME/google-collections-1.0.jar:$MAHOUT_HOME/uncommons-maths-1.2.jar:$MAHOUT_HOME/com.google.common_1.0.0.201004262004.jar:$MAHOUT_HOME/lib/mahout-collections-1.0.jar

Sample Graph Image

JARファイルと環境変数の設定が完了したら以下のコマンドを実行するとサンプルのKmeansのグラフ画像が出力されます。

$ hadoop jar /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6.jar org.apache.mahout.clustering.display.DisplayClustering
$ hadoop jar /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6.jar \
org.apache.mahout.clustering.display.DisplayKMeans

DisplayKMeans.java

mahout-examples-0.6-sources.jarを取得してsampleのClusteringがどうなっているのかを見てみます。どうやらimport org.apache.mahout.common.RandomUtils;とDisplayClusteringクラスでランダムなSampleデータを取得して表示しているだけのようです。これを応用してMadmagi Wordにも当てはめられればGraph Imageが出力されると思います。この続きは次回行います。

$ wget http://mirrors.ibiblio.org/maven2/org/apache/mahout/mahout-examples/0.6/mahout-examples-0.6-sources.jar
$ unzip mahout-examples-0.6-sources.jar
$ vi org/apache/mahout/clustering/display/DisplayKMeans.java

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.mahout.clustering.display;

import java.awt.Graphics;
import java.awt.Graphics2D;
import java.io.IOException;
import java.util.Collection;
import java.util.List;

import com.google.common.collect.Lists;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.mahout.clustering.Cluster;
import org.apache.mahout.clustering.ClusterClassifier;
import org.apache.mahout.clustering.ClusterIterator;
import org.apache.mahout.clustering.ClusteringPolicy;
import org.apache.mahout.clustering.KMeansClusteringPolicy;
import org.apache.mahout.clustering.kmeans.KMeansDriver;
import org.apache.mahout.clustering.kmeans.RandomSeedGenerator;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.common.RandomUtils;
import org.apache.mahout.common.distance.DistanceMeasure;
import org.apache.mahout.common.distance.ManhattanDistanceMeasure;
import org.apache.mahout.math.Vector;

public class DisplayKMeans extends DisplayClustering {

  DisplayKMeans() {
    initialize();
    this.setTitle("k-Means Clusters (>" + (int) (significance * 100) + "% of population)");
  }
  
  public static void main(String[] args) throws Exception {
    DistanceMeasure measure = new ManhattanDistanceMeasure();
    Path samples = new Path("samples");
    Path output = new Path("output");
    Configuration conf = new Configuration();
    HadoopUtil.delete(conf, samples);
    HadoopUtil.delete(conf, output);
    
    RandomUtils.useTestSeed();
    DisplayClustering.generateSamples();
    writeSampleData(samples);
    boolean runClusterer = false;
    if (runClusterer) {
      int numClusters = 3;
      runSequentialKMeansClusterer(conf, samples, output, measure, numClusters);
    } else {
      int maxIterations = 10;
      runSequentialKMeansClassifier(conf, samples, output, measure, maxIterations);
    }
    new DisplayKMeans();
  }
  
  private static void runSequentialKMeansClassifier(Configuration conf,
                                                    Path samples,
                                                    Path output,
                                                    DistanceMeasure measure,
                                                    int numClusters) throws IOException {
    Collection<Vector> points = Lists.newArrayList();
    for (int i = 0; i < numClusters; i++) {
      points.add(SAMPLE_DATA.get(i).get());
    }
    List<Cluster> initialClusters = Lists.newArrayList();
    int id = 0;
    for (Vector point : points) {
      initialClusters.add(new org.apache.mahout.clustering.kmeans.Cluster(
          point, id++, measure));
    }
    ClusterClassifier prior = new ClusterClassifier(initialClusters);
    Path priorClassifier = new Path(output, "clusters-0");
    writeClassifier(prior, conf, priorClassifier);
    
    int maxIter = 10;
    ClusteringPolicy policy = new KMeansClusteringPolicy();
    new ClusterIterator(policy).iterateSeq(samples, priorClassifier, output, maxIter);
    for (int i = 1; i <= maxIter; i++) {
      ClusterClassifier posterior = readClassifier(conf, new Path(output, "classifier-" + i));
      CLUSTERS.add(posterior.getModels());
    }
  }
  
  private static void runSequentialKMeansClusterer(Configuration conf,
                                                   Path samples,
                                                   Path output,
                                                   DistanceMeasure measure,
                                                   int maxIterations)
    throws IOException, InterruptedException, ClassNotFoundException {
    Path clusters = RandomSeedGenerator.buildRandom(conf, samples, new Path(
        output, "clusters-0"), 3, measure);
    double distanceThreshold = 0.001;
    KMeansDriver.run(samples, clusters, output, measure, distanceThreshold,
        maxIterations, true, true);
    loadClusters(output);
  }
  
  // Override the paint() method
  @Override
  public void paint(Graphics g) {
    plotSampleData((Graphics2D) g);
    plotClusters((Graphics2D) g);
  }
}

Y's note

Web技術・プロダクトマネジメント・そして経営について

Apache Mahout 機械学習Libraryを使って「魔法少女まどか☆マギカ」の台詞をテキストマイニングしてみた

Index

Information & Links

Apache Mahout

Apache Mahout

Mahout has machine learning libraries

Mahout Download / Setting

Madmagi Words

Scraping

Word MA

Mecab MA

HDFS PUT

Clustering Theory

TF/IDF

K-Means

Canopy Clustering

Word Vector

Clustering

Canopy

K-Means

Graph Display

Required JAR

Sample Graph Image

DisplayKMeans.java