2017-06-16

patriot-workflow-scheduler pr 70

Ruby

ジョブを検索するときに%をつけないといけないのはあんまり直感的じゃないから直した

2017-06-16

React v15.6.1

React

React DOMの修正

Fix a crash on iOS Safari. (@jquense in #9960)
Don’t add px to custom CSS property values. (@TrySound in #9966)

Release v15.6.1 · facebook/react · GitHub

2017-06-15

memo DMMでのprestoおよび周辺エコシステムの運用事例 #prestodb

presto

システム構成
コンポーネントの役割
prestoリソースの活用
まとめ
バッチ環境
- hadoop
- ほかシステム
- Hadoopに対するETLはやってない(digdag/scoopとかは使っている)
- 定型的なクエリがメイン
presto
- 分析用クラスタ 0.170
- 1 coord, 20 worker
- 5分おきにクエリチェックをするクエリを流している
- query logをfluentd経由でKibanaで見れるようにしている
redash
- 1.0.1+b2845
- google oauth連携
amache zeppelin
- アナリスト用notebook
- https://zeppelin.apache.org
kibana
- 利用状況モニタリング
multi-coordinator
- https://prestodb.io/docs/current/installation/deployment.html
バッチ
- coordinatorの切り替えを考慮してリトライを実装
withをcreate文で使う場合は、先にcreate文を書くとwith句使いやすい
slack + Prestoでslow queryの検知とcannel
- 5分おきにクエリチェッククエリを流す
- slack + hubot
分析prestoクラスタ
バッチ分析クラスタ

2017-06-15

memo Presto - 僕とヤフーと時々Teradata #prestodb

Presto AWS

yahoo

multi big data company
- hadoop
- rdb
- nosql
- object storage
さまざまなストレージにあるデータをインタラクティブに分析→presto
広告入稿システムシステムサマリー
- NFSを退役
- Goでヤフーの分散オブジェクトストレージを作った話
- EOSL問題
- ファイルフォーマットをカラムナへ ORC ファイルサイズが劇的に減少
presto使ってモダン化
- redash
- fluentd
- kafka
- orc
- presto
- hive
- object storage
レポート出力時間の大幅短縮
ORCファイルの変換が手間なのがつらい
- Javaに慣れてないとつらい
- PrestoのINSERTでORCファイルを生成できるがPrestoのリソースをselectに集中させたい
coordinatorが冗長化出来ない
1クラスターに2台以上coordinatorが登録できない
ダウンタイムゼロ運用は厳しい
NWの設計は非常に大事

周辺ツール OSS紹介

presto-audit
- 再起動でクエリログが消えないようにする
- system connectorでクエリ履歴を取得
- githubで近日公開予定
presto-admin
- fabricぽいやつ

presto

利用方法を間違えなければよいプロダクト
運用方法は考えないといけない
導入は簡単

2017-06-15

memo Prestoベースのマネージドクエリサービス Amazon Athena #prestodb

Presto AWS

概要

AWS

Athenaの紹介 + AWSのサービス紹介 + Athenaのベストプラクティス

Athena

S3のデータに対してSQLを投げるサービス
re:Invent 2016
tokyoリージョンではまだ使えない

データ分析基盤の進化の流れ

1985 Dataware house
2006 Hadoop clusters
2009 Decoupled EMR clusters
2012 Cloud DWH redshift
2016 clusterless クエリを投げるだけ！
サーバーレス
高速なクエリ
S3にクエリを投げれるのでロードは必要ない
クエリのスキャンに課金(使わなければ発生しない)
JDBC / API / CLIからクエリを投げれる

Athenaの想定ユースケース

ユースケース
データ
ユーザー
S3にデータを溜め込み続けて、必要が出たときに投げる

アーキテクチャ

Prestoと完全互換ではないがPrestoベース
- 細かい違いはいくつかある
Presto on EMRでHive connectorをつかってS3データと読み込むのと同義
floatやtime型はathenaにはない
AWSのブログには書いている
- パーティション
- 列指向フォーマット
- データ圧縮
- いちおう無邪気なJSONを置いてもクエリは投げれる

I/Oの制約

どのリージョンのS3にも使える
リージョンをまたぐと転送料金と時間は別途かかる
暗号化に対応

料金

1TBあたり5ドル
- 軽いクエリをぽこぽこ投げたい場合はお安い
- 生のCSVを圧縮したほうが料金は安くなる(ファイルサイズベースなので)

demo

cloudtrailのデータを抽出(失敗)

アップデートについて

あたらしいSerDeのサポート
opencsvserdeのサポート
- 囲み文字を指定できる
パーティションが多いときのスキャン速度を改善
msck repair tableの速度を改善
オハイオリージョン
入出力データの暗号化
JDBCドライバ
alter table
LZO対応
クエリ関連処理がAPIに対応
aws sdkやaws cliもAPIに対応した形でアップデート
性能改善

API

クエリ実行
ネームドクエリ
- よく使われるクエリの保存

サードパーティー製品対応

tableau

Athenaを含むAWSのデータ分析サービス

なぜAthenaを出したのか
- データストレージにデータを置くだけじゃなくて、必要な処理を必要なタイミングでできるようにしたかった
- データストアとデータ処理の分離
- 用途に応じた適切な処理方法
- その中心にS3
ETL EMR Glue(正式リリースされてはいない)
データウェアハウス RedShift(定常的なものはこちら)
BI QuickSight
機械学習 EC2 Batch EMR
アドホッククエリ Athena(不定期なものを手軽に抽出したい)
1000単位のモデルを並列に動かす

Athenaに向いていないもの

フルスキャン EMR
リトライ EMR − 多段ETL処理 EMR Glue
サブクエリ JOIN 長期間かかるクエリ redshift
Atehna サーバーレスなので利用者では制御できないものがある
- ピークタイミングのノード数
- prestoパラメタ設定
- prestoバージョン固定
- 料金価格

Presto on EMR

クラスタの構築/運用/チューニング
インスタンスフリートスポットブロックでコスト削減

Redshift + Spectrum

Spectrum RedshiftからS3へ直接データをロードする
ほっとデータに対する重いワークロード主体の場合はredshiftを使ったほうがいい
spectrumで取ってきたデータとredshiftとで組み合わせて抽出できる

事例 NASDAQ

オンプレでprestoを動かしたけど、サイジングとかの関係でしんどかった
でもフィアルサイズが大きすぎてS3には全部入れられない
ホットデータはRedshift
古いデータはpresto on EMR x BI
Dataxu
Japan Taxi
- athenaしつつ、digdag/embulkなどと組み合わせてデータを加工整形

まとめ

Athena S3上のデータに対するprestoベースのインタラクティブなクエリサービス
多くの機能追加/機能改善
Athenaは万能じゃないので適材適所

質疑

プロトコルはs3のみ(s3aとかは使えない)
Athenaはread-only(コンシステンシーの問題は起きない)
ユースケースとして重たいレポートを提供したいならSpectrum
アドホックにデータを見たい場合はAthena
- 両方共我々は力を入れている

2017-06-15

memo Presto at Treasure Data #prestodb

Presto

1500 users
150000 queries

hosting prest as a serviceとしてのtreasure data

aws(us-east)
aws(tokyo)
IDCF
マルチテナンシークラスター
S3 + PlazmaDB インデックスを貼ってファイルのスキャンを早くする試み
storage format: MPC(message pack)
- 複数タイプのデータをカラムとして持つことができる(カラムの型をintからstringへカラムデータを変更することなく変換できる)
200B JVM memory per node
- 運用してみないとメモリの計算がわからないので、予め大きめのメモリを積んでおく
- メモリが枯渇してwaiting for memoryになると他のクエリが動かなくなる
- major GCが起きやすくなるため
prestoは普通より遅いというクレーム
20%くらいしかTDのスケジューリングを使っていない
85%くらいはサードパーティツール的なものでスケジューリングされている
ServiceLevelObjectiveを捕まえないといけないかが課題
SLOはFAQとしてドキュメント化
TDでスケジュールされたクエリは統計を見れる
サードパーティでスケジュールされたクエリは確認しないといけない
- implicit SLOを決めている
アプローチとしてのイベントドリブン
- prestoからクエリのログをfluentd経由でTDに流して解析している
- スキーマレスなので1回もテーブルを作り直したことがない
- 2億くらいのクエリログ
queryのイベントログ
- テーブルスキャン
- split
- query
85%くらいのクエリがスケジューリングされている
- が、どんなクエリをどこから投げられているかが分からない
- 特定の構文を簡単な文字列に変換してわかりやすくする
- どんなクエリがどのくらい投げられているかを統計できる
- クエリのばらつきをCoV(coefficient of variation=MAD / median)
- Median absolute deviation
- SLO violation 違反していたらクエリの実行時間が多めにかかっているかどうかわかる
典型的なボトルネック
- S3にアクセスしすぎ
- シングルノード
  - order by
  - window
  - シングルノードのオペレーションなのにcount distinct
- など
Presto独自のリソースマネージャーを実装している
- クエリがソースを使いすぎるのを避けたい
- create split resource manager
- クエリがノードに使うリソースを制限する
presto ops robot
- JMXでメトリックをdatadogに流して、アラートの出たノードに対して重いクエリのプロセスを殺すとかgraceful shutdownするなどする
S3 Access Performance
- S3のGETは30ms-50msのレイテンシがある(低くない)
- S3から500が帰ってくること
- S3のレイテンシがボトルネックになりうる
- 独自のI/O manager なるたけ多くのリクエストをまとめて送る
presto stella: plazma storage optimizer
- データがあとからやってくる時差が発生する
- S3データのフラグメンテーションが起きる
- ファイルのマージにもS3が使われている
- ファイルの数はすごく重要
new direction explored by presto
- DBA(required dataase administrator)からdata providerによるスキーマデザイン
- prestobase proxy: prestoにアクセスするためのプロキシ(Scalaのfinagleをベースに実装)
  - たとえばprestoはユーザー認証の機能がない
  - DIをふんだんに使ってる(airframe)
  - セッションマネジメントを最初から入れている
optimizing query results transfer in prestobase
- application/x-msgpack
- prestoを返すオブジェクトをmsgpackに入れる
- クエリ高速化が見込める
modules
- prestobase-proxy
- prestobase-agent
- prestobase-vcr
- prestobase-codec
- prestobase-hq
- prestobase-counductor
- 一部はオープンソースにしてもいいかな
bridging gaps between sql and programming language
- 既存のアプローチ ORM
- SQL Firstへ
クエリを書いたらScalaのコード(case classとか関数とか)を自動生成してくれる
- xerial/sbt-sql
- パイプラインが壊れたことを検知しつつコードやクエリを書ける

課題

膨大なデータをどうやって処理するか
- インクリメンタルプロセッシング
  - 解決するためのdigdag
  - YAMLを書くだけでprestoのクエリのパイプラインを作っていくことができる
MessageFrame
- tabular data format
- layer-0
  - 2つ以上に分ける
  - カラムのメタデータを入れる
- layer-n

まとめ

managing implicit SLOs
- Presto Fluentd TD Presto
- digdagなどで

2017-06-15

memo Presto Updates to 0.178 #prestodb

Presto

0.178の変更点

TD

おさらいデータの抽出や可視化を行うプラットフォーム
prestoは抽出周り
- 分散SQLエンジン
- スケールしやすい
- SQL的なインターフェイス
- コードの書けないデータアナリストが大規模なデータに対して分析を行える
- データ分析基盤としてTDはPrestoを使っている
1ヶ月400万クエリ
月間 6PB のデータ

TDでのPrestoの使い方

presto-client-ruby(ruby clientを使っている)
presto coordinator
postgre sql(プラズマDB)
presto worker
S3(実データ)

メインは0.152で、0.178にアップデートしようとしている

0.178 新機能

lambda expression
filtered aggregation
and more

lambda expression

クロージャーっぽいやつをprestoのSQLにかける

FBでは、複数カラムのなかにarrayを処理したいという要望から追加した

SELECT filter(ARRAY [],  x -> true); -- []
SELECT reduce(ARRAY [],  0, (s, x) -> s + x s -> s); -- 0

filtered aggregation

サブクエリを減らせる

   select sum(a) filter (where a > 0)

validate mode in explain

explainでシンタックスのチェック

explain (type validate) select ...

compressed exchange

ワーカー同士でのデータやり取りを圧縮データで行う(デフォルトはLZ4)

complex grouping operation

union all + group byと同等のことができる

select host, path, code, avg(size)
from www_access
group by grouping sets(
   (host),
   (path),
   (host,code)
);
````

- このクエリのように書けば、unionが要らなくなる。
    - 既存のクエリではスキャン3回だったのが1回になる

### new functions その他

- array_overlap(x,y), array_except(x,y)
- codepoint()

### Misc

- IntをIntegerとして使える

### future works

- FB HQでpresto meetup

- disk spill
- warning framework
     - 前のバージョンでdeprecatedなど機能はwarningを出す
- cost based optimizer

### caution

- deprecated.legacy-order-by
    - SQL準拠のせいで失敗するクエリがある
- deprecated.legacy-map-subscript
    - Map型について、キーに値がない場合、このフラグがあるとしのげる
- 0.179
    - deprecated.legacy-order-by
        - complex grouping operation/legacy_order_byのコンボを使うとクエリがエラーになる

0.179のリリースノートを見てね！

2017-06-15

how-many

Emacs Lisp

how-many is an interactive compiled Lisp function in ‘replace.el’.

(how-many REGEXP &optional RSTART REND INTERACTIVE)

Print and return number of matches for REGEXP following point. When called from Lisp and INTERACTIVE is omitted or nil, just return the number, do not print it; if INTERACTIVE is t, the function behaves in all respects as if it had been called interactively.

If REGEXP contains upper case characters (excluding those preceded by ‘\’) and ‘search-upper-case’ is non-nil, the matching is case-sensitive.

Second and third arg RSTART and REND specify the region to operate on.

Interactively, in Transient Mark mode when the mark is active, operate on the contents of the region. Otherwise, operate from point to the end of (the accessible portion of) the buffer.

This function starts looking for the next match from the end of the previous match. Hence, it ignores matches that overlap a previously found match.

マッチしたワードの数を返す

2017-06-15

split-string

Emacs Lisp

split-string is a compiled Lisp function in ‘subr.el’.

(split-string STRING &optional SEPARATORS OMIT-NULLS TRIM)

Split STRING into substrings bounded by matches for SEPARATORS.

The beginning and end of STRING, and each match for SEPARATORS, are splitting points. The substrings matching SEPARATORS are removed, and the substrings between the splitting points are collected as a list, which is returned.

文字列を分解してlistにする

2017-06-15

buffer-substring

Emacs Lisp

buffer-substring is a built-in function in ‘C source code’.

(buffer-substring START END)

Return the contents of part of the current buffer as a string. The two arguments START and END are character positions; they can be in either order. The string returned is multibyte if the buffer is multibyte.

This function copies the text properties of that part of the buffer into the result string; if you don’t want the text properties, use ‘buffer-substring-no-properties’ instead.

バッファを文字列に変換する(start endで場所を指定できる)

引数名がbegだったりstartだったりでイマイチ安定しない。

2017-06-15

search Emacs function

Emacs Lisp

f1 f or describe-function

2017-06-15

delete-duplicate-lines

Emacs

delete-duplicate-lines is an interactive autoloaded compiled Lisp function in ‘sort.el’.

(delete-duplicate-lines BEG END &optional REVERSE ADJACENT KEEP-BLANKS INTERACTIVE)

Delete all but one copy of any identical lines in the region. Non-interactively, arguments BEG and END delimit the region. Normally it searches forwards, keeping the first instance of each identical line. If REVERSE is non-nil (interactively, with a C-u prefix), it searches backwards and keeps the last instance of each repeated line.

Identical lines need not be adjacent, unless the argument ADJACENT is non-nil (interactively, with a C-u C-u prefix). This is a more efficient mode of operation, and may be useful on large regions that have already been sorted.

If the argument KEEP-BLANKS is non-nil (interactively, with a C-u C-u C-u prefix), it retains repeated blank lines.

Returns the number of deleted lines. Interactively, or if INTERACTIVE is non-nil, it also prints a message describing the number of deletions.

重複行を削除する (begとendで場所は指定できる)

2017-06-15

with-temp-buffer

Emacs

with-temp-buffer is a Lisp macro in ‘subr.el’.

(with-temp-buffer &rest BODY)

Create a temporary buffer, and evaluate BODY there like ‘progn’. See also ‘with-temp-file’ and ‘with-output-to-string’.

一時的なバッファを作成する

2017-06-15

point-max

Emacs

point-max is a built-in function in ‘C source code’.

(point-max)

Return the maximum permissible value of point in the current buffer. This is (1+ (buffer-size)), unless narrowing (a buffer restriction) is in effect, in which case it is less.

バッファの最大値を求める

2017-06-15

pointmin

Emacs

point-min is a built-in function in ‘C source code’.

(point-min)

Return the minimum permissible value of point in the current buffer. This is 1, unless narrowing (a buffer restriction) is in effect.

バッファの最小値をもとめる