[BigQuery]Apply ML using SQL with BigQuery ML

Choose the right model type for your structed data use case

bigquery-ml

structured data를 활용하여 ML을 할떄 적절한 model type을 찾아야함
어떤 activity를 할 것이느냐에 따라 다름
Supervised ML는 아래 3종류로 분류
1. forecaset
  - linear regression
2. classify
  - logistic regression (binary or multi-class)
3. recommend
  - matrixx factorization
더 복잡한 model도 가능
- deep neural network
- decision tree
- random forests
적절한 model을 선택 한 후, 학습을 위한 high-quality training data가 필요
Machine Learning
- Machine : linear regression과 같은 알고리즘 혹은 tool
- Learning : Insights 즉 ,model은 known과 unknown data사이 관계에대한 통찰력(우리가 modeling라고 부른는 것)

biguerty-ml-2

record or row는 1개의 instance or observation
위의 예시는 8개의 instance가 존재
label은 역사적으로 알려진 답
- ex) 이 특정한 고개들이 얼마나 소비하였는지
- 미래의 데이터는 unknown 이거나 missing
model, traning 그리고 prediction이 필요
가령, website에서 많은 시간을 소비한 트랜잭션을 만든 고객을 알고 있다면, 그 고객들이 매출을 위한 high LTV라고 밝혀졌다면, 우리는 이 데이터를 활용하여 prediction이 가능
- 예측을 위해서 linear regression을 활용
label은 또한 binary value가 될 수도 있음
- ex) high-value customer or not
table의 columns들은 features라고 부름 또는 potential features
- model의 input
table의 각각의 column의 quality를 이해하고, 다른 팀들과 더 많은 features를 찾는것은 ML 프로젝트에서 가장 어려운 파트 중 하나
Feature Engineering : combine or transform features과정을 일컫음

ML model을 작업하기 위해서는 보통 data scientists들은 datalake로부터 조금씩 data를 IPython Notebook으로 추출하여 pandas같은 data handling framework를 활용 매우 많은 시간이 소모
custome model을 만들기 위해서 모든 데이터를 전처리/변형을 하고 모든 feature engineering이 필요
model을 만들고 tnesorflow 같은 라이브러리를 활용하여 traning을 함
더 많은 feature가 필요하거나 performance를 향상 시키기 위해서 위의 과정을 계속 반복해야함

Standard SQL and UDFS within the ML queries
Linear Regression(Forecasting)
Binary and Multi-class Logistics Regression(Classficiation)
Model evalutaion functions for stadard metrics, including ROC and precision-recall curves
Model weight inspection
Feature distribution analysis through statndard fucntions

bigquery-ml-process

bigquery-ml-process

Label : alias a column ‘label’ or 특정 column in OPTIONS using input_label_cols
Feature : passed through to the model as part of your SQL SELECT statement
- SELECT * FROM ML.FEATURE_INFO(MODEL mydataset.mymodel)
Model : an object created in BigQuery
Model Types : Linear Regression, Logistic Regression

CREATE OR REPLACE MODEL <dataset>.<name>
OPTIONS (model_type='<type>') AS
<training dataset>

Traning Progress : SELECT * FROM ML.TRAINING_INFO(MODEL mydataset.mymodel)
Inspect Weigths : SELECT * FROM ML.WEIGHTS(MODEL mydataset.mymodel), )
Evaluation : SELECT * FROM ML.EVALUATE((MODEL mydataset.mymodel)
Prediction : SELECT * FROM ML.PREDICT((MODEL mydataset.mymodel, )