當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

snowflake 使用_如何使用机器学习模型直接从Snowflake进行预测

發布時間：2023/12/15 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 snowflake 使用_如何使用机器学习模型直接从Snowflake进行预测小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

snowflake 使用

Often, we are faced with the scenarios (and myself recently), where the model which was deployed by the data scientist runs on a schedule and whether that’s once an hour, once a day, or once a week…you get the point. However, there are times when out-of-schedule results are required to make decisions for a meeting or analysis.

通常，我們面臨著場景(以及我最近遇到的場景)，其中數據科學家部署的模型按計劃運行，而無論是每小時，每天還是每周一次……您都明白了。但是，有時需要超出計劃范圍的結果才能為會議或分析做出決策。

With that being said, there are a few ways to get out-of-schedule predictions…

話雖這么說，但是有幾種方法可以進行超出預期的預測…

超出計劃外的預測 (Getting Out-of-schedule Predictions)

Users can use a notebook instance, connect to the datastore, unload the data onto to S3, reference the data for prediction, and copy the result back onto the datastore.

用戶可以使用筆記本實例，連接到數據存儲，將數據卸載到S3上，引用數據進行預測，然后將結果復制回數據存儲。

The developer can build a model hosting API in which users can use the datastore to extract the data, and POST onto the hosting API for prediction.

開發人員可以構建一個模型托管API，在該模型中，用戶可以使用數據存儲區提取數據，然后發布到托管API上進行預測。

Build a pipeline to allow users to unloaded data directly using SQL to invoke a batch prediction.

建立管道以允許用戶直接使用SQL調用批處理預測來卸載數據。

Consequent, even though the Data Scientist & Co could implement a batch prediction application for others to use for out-of-schedule case, it would be intuitive to bring non-technical users closer to the model themselves and give them the power to run predictions from SQL.

因此，即使Data Scientist＆Co可以實施批處理預測應用程序以供其他人在計劃外的情況下使用，也可以很直觀地使非技術用戶更接近模型本身，并賦予他們進行預測的能力從SQL。

縮小Snowflake上運行預測與SQL之間的差距 (Bridging the gap between running prediction and SQL on Snowflake)

Inspired by Amazon Aurora Machine Learning, I spent a couple of days thinking about how to bridge this gap, and put together an architecture and build that will allow non-technical users to perform batch prediction from the comfort of SQL. This is all within Snowflake using Stored Procedure, Snowpipe, Streams and Tasks, and SageMaker’s batch prediction job (Batch Transform), to create a batch inference data pipeline.

受Amazon Aurora機器學習的啟發，我花了幾天的時間思考如何彌合這一差距，并構建了一個架構和構建，它將允許非技術用戶從SQL的舒適性中執行批量預測。這一切都在Snowflake中完成，使用存儲過程，Snowpipe，流和任務以及SageMaker的批處理預測作業(批處理轉換)來創建批處理推理數據管道。

雪花機器學習-建筑設計 (Snowflake Machine Learning - Architectural Design)

Architectural diagram of the build構建的架構圖

The user unloads the data into S3 in the required format which will trigger a Lambda.

用戶以所需格式將數據卸載到S3中，這將觸發Lambda。

SageMaker Batch Transform job is called to make batch predictions on the data using a trained model.

調用SageMaker Batch Transform作業以使用訓練有素的模型對數據進行批量預測。

The result from the prediction is written back onto the S3 bucket

預測結果將寫回到S3存儲桶

SQS is set up on that S3 bucket to auto-ingest the predicted result onto Snowflake

在該S3存儲桶上設置SQS，以將預測結果自動添加到Snowflake

Once the data lands onto Snowflake, Streams and Tasks are called.

數據降落到Snowflake后，將調用Streams和Tasks。

卸載到S3上-使用存儲過程 (Unloading onto S3 — Use of Stored Procedure)

Flow diagram of Unloading onto S3卸載到S3的流程圖

創建輸入表 (Creating the input table)

In order for the user to make a call to Batch Transform, the user will need to create an input table that contains the data for the model, and mandatory fields, the predictionid which is a uuid for the job, record_seq which is a unique identifier for reach input rows, a NULL prediction column which is the target of interest.

為了使用戶能夠調用Batch Transform，用戶將需要創建一個輸入表，其中包含模型的數據和必填字段， predictionid是作業的uuid， record_seq是唯一標識符。對于覆蓋率輸入行，則為目標目標NULL prediction列。

Input Data: hotel_cancellation輸入數據：hotel_cancellation

卸載到S3 (Unloading onto S3)

The call_ml_prediction Stored Procedure takes in a user-defined job name and input table name. Calling it will unload the file (using predictionid as the name) onto S3 bucket in the /input path and create an entry in the prediction_status table. From there, Batch Transform will be called to predict on the inputted data.

call_ml_prediction存儲過程采用用戶定義的作業名稱和輸入表名稱。調用它將把文件(使用predictionid作為名稱)卸載到/input路徑中的S3存儲桶上，并在prediction_status表中創建一個條目。從那里，將調用Batch Transform來預測輸入的數據。

To ensure there aren’t multiple requests being submitted, only one job is able to run at a time. For simplicity, I also ensured only a single file is unloaded onto S3, but Batch Transform can handle multiple input files.

為了確保不會提交多個請求，一次只能運行一個作業。為簡單起見，我還確保僅將單個文件卸載到S3上，但是Batch Transform可以處理多個輸入文件。

Prediction status table預測狀態表

預測—使用SageMaker批量轉換 (Prediction — Use of SageMaker Batch Transform)

Flow diagram of Triggering Batch Transform觸發批量轉換的流程圖

觸發SageMaker批量轉換 (Triggering SageMaker Batch Transform)

Once the data is unloaded onto the S3 bucket /input, a Lambda gets fired which makes a call SageMaker Batch Transform to read in the input data and output inferences to the /sagemaker path.

將數據卸載到S3存儲桶/input ，將觸發Lambda，該Lambda調用SageMaker Batch Transform讀取輸入數據，并將推斷輸出到/sagemaker路徑。

If you’re familiar with Batch Transform, you can set the input_filter, join and output_filter to your likings for the output prediction file.

如果您熟悉Batch Transform，則可以根據自己的喜好設置output_filter，join和output_filter，以適應輸出預測文件。

批量轉換輸出 (Batch Transform Output)

Once Batch Transform completes, it outputs the result as a .csv.out in the /sagemaker path. Another Lambda gets fired which will copy and rename the file as .csv to the /snowflake path where SQS is setup for Snowpipe auto-ingest.

一旦批量變換完成時，它輸出該結果作為一個.csv.out在/sagemaker路徑。觸發另一個Lambda，它將把文件復制為.csv并將其重命名為/snowflake路徑，在該路徑中為Snowpipe自動攝取設置了SQS。

結果-使用Snowpipe，流和任務 (The Result — Use of Snowpipe, Stream and Task)

Flow diagram of pipping the data into Snowflake將數據放入Snowflake的流程圖

通過雪管攝取 (Ingestion through Snowpipe)

Once the data is dropped onto the /snowflake path, it is inserted into the prediction_result table via Snowpipe. For simplicity, since SageMaker Batch Transform maintains the order of the prediction, the row number was used as the identifier to join to the input table. You can do the postprocessing step within Batch Transform itself.

數據放到/snowflake路徑后，便會通過Snowpipe將其插入prediction_result表。為簡單起見，由于SageMaker Batch Transform保持了預測的順序，因此將行號用作連接到輸入表的標識符。您可以在Batch Transform本身中執行后處理步驟。

流數據并觸發任務 (Streaming the data and triggering Tasks)

A stream is created is on the prediction_result table which will populate prediction_result_stream after Snowpipe delivers the data. This stream, specifically the system$stream_has_data('prediction_result_stream, will be used by the scheduled task populate_prediction_result to call the stored procedure populate_prediction_result to populate the prediction data on the hotel_cancellation table, only if there is a stream. The unique identifier, predictionid, is also set as a task session variable.

創建一個流是在prediction_result表，該表將填充prediction_result_stream Snowpipe開出數據之后。調度的任務populate_prediction_result將使用此流，特別是system$stream_has_data('prediction_result_stream調用存儲過程populate_prediction_result以在hotel_cancellation表上填充預測數據，唯一的hotel_cancellation是唯一流標識符predictionid ID為。還設置為任務會話變量。

The Result from the Batch Transform批處理轉換的結果

完成工作 (Completing the job)

At the end of the job, and after populate_prediction_result completes, using the system task session variable, the next task update_prediction_status updates the prediction status from Submitted to Completed. This concludes the entire “Using SQL to run Batch Prediction” pipeline.

在作業結束時，并在populate_prediction_result完成之后，使用系統任務會話變量，下一個任務update_prediction_status將預測狀態從Submitted更改為Completed 。這樣就構成了整個“使用SQL運行批處理預測”管道的整個過程。

Updated prediction status更新了預測狀態

做得更好 (Doing it better)

Snowflake provides a lot of power through Snowpipe, Streams, Stored Procedure and Task to create a data pipeline which can be used for different applications. When combined with SageMaker, Users will be able to send inputs directly from Snowflake and interact with the prediction results.

Snowflake通過Snowpipe，流，存儲過程和任務提供了大量功能，以創建可用于不同應用程序的數據管道。與SageMaker結合使用時，用戶將能夠直接從Snowflake發送輸入并與預測結果進行交互。

Nonetheless, there are some wishlist items which will improve the whole experience and that is:

盡管如此，仍有一些愿望清單項可以改善整體體驗，即：

For Snowflake: The ability to manually trigger, or trigger a Task after Snowpipe ingestion finishes. This would guarantee the Task up completed Streams.

對于Snowflake：能夠在Snowpipe提取完成后手動觸發或觸發任務。這將保證任務完成流。

For the pipeline: Being able to update the status of Snowflake from AWS side to let users know the progress of Batch Transform

對于管道：能夠從AWS端更新Snowflake的狀態，以使用戶知道Batch Transform的進度

I hope you find this article useful and enjoyed the read.

希望您覺得這篇文章對您有幫助，并喜歡閱讀。

關于我 (About Me)

I love writing medium articles, and sharing my ideas and learnings with everyone. My day-to-day job involves helping businesses build scalable cloud and data solutions, and trying new food recipes. Feel free to connect with me for a casual chat, just let me know you’re from Medium

我喜歡寫中篇文章，并與所有人分享我的想法和經驗。我的日常工作涉及幫助企業構建可擴展的云和數據解決方案，并嘗試新的食品食譜。隨時與我聯系以進行休閑聊天，只需讓我知道您來自中

— Jeno Yamma

— 杰諾·雅瑪 ( Jeno Yamma)

翻譯自: https://towardsdatascience.com/using-machine-learning-models-to-make-prediction-directly-from-snowflake-2471b2f71b68