by shigemk2

当面は技術的なことしか書かない

はてなカウンターのログをTreasure Dataに突っ込んでみたmk2 CSVファイルをそのままbulk import

実データで覚える Treasure Client コマンドラインリファンス 〜1.Data Import〜 - doryokujin's blog

基本的に上のリンクをそのままに。

# テーブルを作る
$ td table:create test shigemk2_bulk
Table 'test.shigemk2_bulk' is created.
# セッションを作る
$ td import:create session_shigemk2 test shigemk2_bulk
Bulk import session 'session_shigemk2' is created.
# 1行目をヘッダーとして準備用データを用意する これを利用して何度もimportできるようにする
$ td import:prepare 101-2014-02.csv --format csv --column-header --time-column 'time' -o ./parts/

Preparing sources
  Output dir   : ./parts/
  Source     : 101-2014-02.csv (13842646 bytes)

Converting '101-2014-02.csv'...
sample row: {"time":0,"device":"1366x768","browser":"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/32.0.1700.102 Safari\/537.36","unknown":24,"language":"ja,en-US;q=0.8,en;q=0.6","referer":"http:\/\/zenback.itmedia.co.jp\/contents","ip":"xxx.xx.xxx.xxx"}

Prepare status:
  Source    : 101-2014-02.csv
    Status          : SUCCESS
    Read lines      : 37881
    Valid rows      : 37880
    Invalid rows    : 0
    Converted Files : ./parts/101-2014-02_csv_0.msgpack.gz (2084235 bytes)


Next steps:
  => execute following 'td import:upload' command. if the bulk import session is not created yet, please create it with 'td import:create <session> <database> <table>' command.
     $ td import:upload <session> './parts/101-2014-02_csv_0.msgpack.gz'
# データをアップロードする。この段階ではデータをあげているだけ。
$ td import:upload session_shigemk2 './parts/101-2014-02_csv_0.msgpack.gz'
Uploading prepared sources
  Session name : session_shigemk2
  Source     : ./parts/101-2014-02_csv_0.msgpack.gz (2084235 bytes)

Uploading ./parts/101-2014-02_csv_0.msgpack.gz (2084235 bytes)...

Upload status:
  Source  : ./parts/101-2014-02_csv_0.msgpack.gz
    Status          : SUCCESS
    Part name       : 101-2014-02_csv_0_msgpack_gz
    Size            : 2084235
    Retry count     : 0


Next Steps:
  => execute 'td import:perform session_shigemk2'.

# データの保存。結構時間かかった
$ td import:perform session_shigemk2
Job 9279134 is queued.
Use 'td job:show [-w] 9279134' to show the status.
$ td job:show -w 9279134           JobID       : 9279134
Status      : running
Type        : bulk_import_perform
Database    : test
queued...
  started at 2014-04-03T22:41:06Z
  14/04/03 22:41:11 INFO log.MLog: MLog clients using log4j logging.
  14/04/03 22:41:11 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
  14/04/03 22:41:12 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  14/04/03 22:41:17 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
  finished at 2014-04-03T23:32:15Z
Use '-v' option to show detailed messages.
# データのコミット
$ td import:commit session_shigemk2
Bulk import session 'session_shigemk2' started to commit.