airflow使用_使用AirFlow,SAS Viya和Docker像Pro一样自动化ML模型
airflow使用
Ok, here’s a scenario: You’re the lone data scientist/ML Engineer in your consumer-focused unicorn startup, and you have to build a bunch of models for a variety of different business use cases. You don’t have time to sit around and sulk about the nitty-gritty details of any one model. So you’ve got choices to make. Decisions. Decisions that make you move fast, learn faster, and yet build for resilience all while gaining a unique appreciation for walking the talk. IF you do this right (even partly), you end up becoming walking gold for your company. A unicorn at a unicorn 😃. Why? Because, you put the customer feedback you observed through their data-trail back to work for your company, instead of letting it rot in the dark rooms of untapped logs and data dungeons (a.k.a. databases). These micro-decisions you enable matter. They eventually add up to push your company beyond the inflection point that is needed for exponential growth.
?K,這里有一個場景 :你在你的消費者為中心的麒麟啟動孤獨的數(shù)據(jù)科學(xué)家/ ML工程師,你必須建立一堆模型,各種不同的業(yè)務(wù)用例。 您沒有時間閑逛并為任何一個模型的實質(zhì)細節(jié)details之以鼻。 這樣您就可以做出選擇了。 決定。 決策使您能夠快速行動,更快地學(xué)習(xí),并為恢復(fù)能力而努力,同時獲得了對演講的獨特贊賞。 如果您正確(甚至部分地)做到這一點,您最終將成為貴公司的黃金。 獨角獸中的獨角獸😃。 為什么? 因為,您將通過數(shù)據(jù)線索觀察到的客戶反饋重新投入到為公司工作中,而不是讓它在未使用的日志和數(shù)據(jù)副本(也稱為數(shù)據(jù)庫)的暗室中腐爛。 這些使您事半功倍的決定。 他們最終加起來,使您的公司超越了指數(shù)式增長所需的拐點。
So, that is where we start from. And build. We’ll assume we can choose tech that simplifies everything for us, yet letting us automate all we want. When in doubt, we’ll simplify — remove, until we can rationalize effort for the outcome to avoid over-engineering stuff. That is exactly what I’ve done for us here — so we don’t get stuck in an analysis/choice paralysis.
因此,這就是我們的出發(fā)點。 并建立。 我們假設(shè)我們可以選擇可以簡化我們所有工作的技術(shù),但可以讓我們自動化所有需要的技術(shù)。 如有疑問,我們將進行簡化-刪除,直到我們可以合理化結(jié)果的工作量,以避免過度設(shè)計。 這正是我在這里為我們所做的-因此我們不會陷入分析/選擇麻痹的境地。
Note, everything we’ll use here will be assumed to be running on docker unless mentioned otherwise. So based on that we’ll use …
注意,除非另有說明,否則我們將在此處使用的所有內(nèi)容都假定在docker上運行。 因此,基于此,我們將使用...
Apache Airflow for orchestrating our workflow: Airflow has quickly become the de-facto standard for authoring, scheduling, monitoring, and managing workflows — especially in data pipelines. We know that today, at least, 350 companies use Airflow in the broader tech industry along with a variety of executors and operators including kubernetes and docker.
Apache Airflow 用于協(xié)調(diào)我們的工作流程: Airflow已Swift成為用于創(chuàng)作,調(diào)度,監(jiān)視和管理工作流程的事實標準,尤其是在數(shù)據(jù)管道中。 我們知道,今天,至少有350家公司以及更廣泛的執(zhí)行程序和操作員(包括kubernetes和docker)在更廣泛的技術(shù)行業(yè)中使用Airflow 。
The usual suspects in the python ecosystem: for glue code, data engineering etc. The one notable addition would be vaex for processing large parquet files quickly and doing some data prep work.
python生態(tài)系統(tǒng)中 常見的嫌疑人 :用于粘合代碼,數(shù)據(jù)工程等。其中一個值得注意的添加是vaex,用于快速處理大型木地板文件并進行一些數(shù)據(jù)準備工作。
Viya in a container & Viya as an Enterprise Analytics Platform(EAP): SAS Viya is an exciting technology platform that can be used to quickly build business focused capabilities on top of foundational analytical and AI models that SAS produces. We’ll use two flavors of SAS Viya — one as a container for building and running our models, and another one running on Virtual Machine(s) which acts as the enterprise analytics platform that the rest of our organization uses to perform analytics, consume reports, track and monitor models etc. For our specific use case, we’ll use the SAS platform’s autoML capabilities via the DataSciencePilot action set so that we can go full auto-mode on our problem.
容器中的Viya和作為企業(yè)分析平臺(EAP)的 Viya:SAS Viya是一個令人興奮的技術(shù)平臺,可用于在SAS產(chǎn)生的基礎(chǔ)分析和AI模型之上快速構(gòu)建以業(yè)務(wù)為中心的功能。 我們將使用兩種SAS Viya: 一種是用于構(gòu)建和運行模型的容器 ,另一種是在虛擬機上運行,??該虛擬機充當企業(yè)的分析平臺 ,我們組織的其余部分用來執(zhí)行分析,使用報告,跟蹤和監(jiān)視模型等。對于我們的特定用例,我們將通過DataSciencePilot操作集使用SAS平臺的autoML功能,以便我們可以對問題進行全自動模式。
SAS Model Manager to inventory, track, & deploy models : This is the model management component on the Viya Enterprise Analytics Platform that we’ll use to eventually push the model out to the wild for scoring.
SAS模型管理器 以庫存,跟蹤和部署模型 :這是Viya Enterprise Analytics Platform上的模型管理組件,我們將使用該組件最終將模型推向野外進行評分。
Now that we’ve lined up all the basic building blocks, let’s address the business problem: We’re required to build a churn detection service so that our fictitious unicorn can detect potential churners and follow up with some remedial course of action to keep them engaged, instead of trying to reactivate them after the window of opportunity lapses. Because we plan to use Viya’s DataSciencePilot action set for training our model, we can simply prep the data and pass it off to dsautoml action which, as it turns out, is just a regular method call using the python-swat package. If you have access to Viya, you should try this out if you haven’t already.
現(xiàn)在,我們已經(jīng)排列了所有基本的構(gòu)建基塊,讓我們解決業(yè)務(wù)問題 :我們需要構(gòu)建流失檢測服務(wù),以便我們虛構(gòu)的獨角獸可以檢測到潛在的流失并采取一些補救措施來保持參與,而不是嘗試在機會窗口消失后重新激活它們。 因為我們計劃使用Viya的DataSciencePilot操作集來訓(xùn)練我們的模型,所以我們可以簡單地準備數(shù)據(jù)并將其傳遞給dsautoml操作 ,事實證明,這只是使用python-swat包的常規(guī)方法調(diào)用。 如果您可以訪問Viya,則應(yīng)該嘗試一下。
Also, if you didn’t pick up on it yet, we’re trying to automate everything including the model (re-)training process for models developed with autoML. We want to do this at a particular cadence as we fully expect to create fresh/new models whenever possible to keep up with the changing data. So automating autoML. Like Inception…😎
另外,如果您還沒有掌握它,我們將嘗試使包括使用autoML開發(fā)的模型的模型(重新)訓(xùn)練過程自動化。 我們希望以特定的節(jié)奏進行此操作,因為我們完全希望盡可能地創(chuàng)建新的/新的模型,以跟上不斷變化的數(shù)據(jù)。 因此,自動化autoML。 像盜夢空間一樣...😎
Anyway, remember: You’re the lone warrior in the effort to spawn artificial intelligence & release it into the back office services clan that attack emerging customer data and provide relevant micro-decisions. So there’s not much time to waste. So, let’s start.
無論如何,請記住:您是唯一的勇士,努力產(chǎn)生人工智能并將其發(fā)布到后臺服務(wù)部門,以攻擊新興的客戶數(shù)據(jù)并提供相關(guān)的微決策。 因此,沒有太多時間可以浪費。 所以,讓我們開始吧。
We’ll use a little Makefile to start our containers (see below) — it just runs a little script that starts up the containers by setting up the right params and triggering the right flags when ‘docker run’ is called. Nothing extraordinary, but gets the job done.
我們將使用一個小的Makefile來啟動我們的容器(見下文)—它只是運行一個小的腳本,該腳本通過設(shè)置正確的參數(shù)并在調(diào)用“ docker run”時觸發(fā)正確的標志來啟動容器。 沒什么特別的,但是可以完成工作。
Start the containers for model development啟動用于模型開發(fā)的容器Now, just like that, we’ve got our containers live and kicking. Once we have our notebook environment, we can call autoML via the dsautoml action after loading our data. The syntactic specifics of this action are available here. Very quickly, a sample method call looks like this :
現(xiàn)在,就像這樣,我們使容器處于活動狀態(tài)。 有了筆記本環(huán)境后,可以在加載數(shù)據(jù)后通過dsautoml操作調(diào)用autoML。 此操作的語法細節(jié)在此處提供 。 很快,示例方法調(diào)用如下所示:
# sess is the session context for the CAS session.sess.datasciencepilot.dsautoml(table = out, target = "CHURN",inputs = effect_vars,
transformationPolicy={"missing":True,"cardinality":True,
"entropy":True, "iqv":True,
"skewness":True, "kurtosis":True,"outlier":True},
modelTypes = ["decisionTree", "GRADBOOST"],
objective = "AUC",
sampleSize = 20,
topKPipelines = 10,
kFolds = 2,
transformationout = dict(name="TRANSFORMATION_OUT", replace=True),
featureout = dict(name="FEATURE_OUT", replace=True),
pipelineout = dict(name="PIPELINE_OUT", replace=True),
savestate = dict(modelNamePrefix='churn_model', replace = True))
I’ve placed the entire notebook in this repo for you to take a look, so worry not! This particular post isn’t about the specifics of dsautoml. If you are looking for a good intro to automl, you can head over here. You’ll be convinced.
我已將整個筆記本放入此存儲庫中供您查看,所以請不要擔心! 這篇特別的文章與dsautoml的細節(jié)無關(guān)。 如果您正在尋找有關(guān)automl的良好介紹,則可以轉(zhuǎn)到此處 。 您會被說服的。
As you will see, SAS DataSciencePilot (autoML) provides fully-automated capabilities that allow for multiple actions including automatic feature generation via the feature machine, which auto resolves the transformations needed and then using those features for constructing multiple pipelines in a full-on leaderboard challenge. Additionally, the dsautoml method call also produces two binary files. One for capturing the feature transformations that are performed and then another one for the top model. This means we get the score code for the champion and the feature transformations, so that we can deploy them easily into production. This is VERY important. In a commercial use case such as this one, model deployment is more important than development.
正如您將看到的,SAS DataSciencePilot(autoML)提供了全自動功能,該功能允許多種操作,包括通過功能機自動生成功能,該功能自動解析所需的轉(zhuǎn)換,然后使用這些功能在一個完整的排行榜中構(gòu)建多個管道挑戰(zhàn)。 此外,dsautoml方法調(diào)用還會生成兩個二進制文件。 一個用于捕獲執(zhí)行的特征轉(zhuǎn)換,然后另一個用于捕獲頂級模型。 這意味著我們可以獲得冠軍和功能轉(zhuǎn)換的得分代碼,以便我們可以輕松地將其部署到生產(chǎn)中。 這個非常重要。 在這樣的商業(yè)用例中,模型部署比開發(fā)更重要。
If your models don’t get deployed, even the best of them perish doing nothing. And when that is the deal, even a 1-year old will pick something over nothing.
如果您的模型沒有部署,那么即使是最好的模型也無濟于事。 而當那筆交易達成時,即使是一歲的孩子,也總會收獲一無所有。
what your response SHOULDN’T be to how many ML models do you actually deploy?您實際上應(yīng)該部署多少個ML模型,您的React應(yīng)該是什么?This mandates us to always choose tools and techniques that meet the ask, and potentially increase the range of deployable options while avoiding re-work. In other words, your tool and model should be able to meet the acceptable scoring SLA of the workload for the business case. And you should know this before you start writing a single line of code. If this doesn’t happen, then any code we write is wasteful and meets no purpose other than satisfying personal fancies.
這要求我們始終選擇能夠滿足要求的工具和技術(shù),并有可能增加可部署選項的范圍,同時避免返工 。 換句話說,您的工具和模型應(yīng)該能夠滿足業(yè)務(wù)案例工作負載可接受的評分SLA。 在開始編寫一行代碼之前,您應(yīng)該知道這一點。 如果這沒有發(fā)生,那么我們編寫的任何代碼都是浪費,除了滿足個人幻想之外,沒有其他目的。
So, now that we have a way to automatically train these models on our data, let’s get this autoML process deployed for automatic retraining. This is where Airflow will help us immensely. Why? When we hand off “retraining” to production — there are bunch of new requirements that pop up such as:
因此,既然我們有了一種可以在我們的數(shù)據(jù)上自動訓(xùn)練這些模型的方法,那么讓我們?yōu)樽詣又匦掠?xùn)練部署此autoML流程。 這是Airflow將極大地幫助我們的地方。 為什么? 當我們將“再培訓(xùn)”交給生產(chǎn)時,會彈出很多新要求,例如:
- Error handling — How many times to retry? What happens if there is a failure? 錯誤處理-重試多少次? 如果發(fā)生故障怎么辦?
- Quick and easy access to consolidated logs 快速輕松地訪問合并日志
- Task Status Tracking 任務(wù)狀態(tài)跟蹤
- Ability to re-process historic data due to upstream changes 由于上游變化,能夠重新處理歷史數(shù)據(jù)
- Execution Dependencies on other processes: For example, process Y needs to run after process X, but what if X does not finish on-time? 對其他流程的執(zhí)行依賴關(guān)系:例如,流程Y需要在流程X之后運行,但是如果X不能按時完成怎么辦?
- Tracing Changes in the Automation Process Definition Files 跟蹤自動化過程定義文件中的更改
Airflow handles all of the above elegantly. And not just that! We can quickly set up airflow on containers, and run it using docker-compose using this repo. Obviously you can edit the Dockerfile or the compose file as you see fit. Once again, I’ve edited these files to suit my needs and dropped it in this repo so you can follow along if you need to. At this point when you run docker-compose you should see postgres and airflow web server running
氣流可以優(yōu)雅地處理上述所有問題。 不僅如此! 我們可以快速在容器上設(shè)置氣流,并使用docker-compose通過此repo運行它。 顯然,您可以根據(jù)需要編輯Dockerfile或compose文件。 再次,我已經(jīng)編輯了這些文件以滿足我的需要,并將其放在此存儲庫中,以便您可以根據(jù)需要進行操作。 此時,當您運行docker-compose時,應(yīng)該看到postgres和airflow Web服務(wù)器正在運行
Next, let’s look at the Directed Acyclic Graph (DAG) we’ll use to automatically rebuild this churn detection model weekly. Don’t worry, this DAG is also provided in the same repo.
接下來,讓我們看一下有向無環(huán)圖(DAG),我們將使用它每周自動重建這種流失檢測模型。 不用擔心,該DAG也在同一倉庫中提供 。
ML DAG set up to run weeklyML DAG設(shè)置為每周運行Now, we’ll click into the graph view and understand what the DAG is trying to accomplish step by step.
現(xiàn)在,我們將單擊進入圖形視圖并了解DAG試圖逐步完成的任務(wù)。
Airflow DAG for automating autoML and registering models to SAS Model ManagerAirflow DAG,用于自動執(zhí)行autoML并將模型注冊到SAS Model ManagerAnd that’s it! Our process is ready to be put to test!
就是這樣! 我們的過程已準備就緒,可以進行測試!
A sample post-run gantt chart運行后甘特圖樣本When the process finishes successfully, all the tasks should report success and the gantt chart view in airflow should resolve to something that looks like the one above (execution times will obviously be different). And just like that, we’ve gotten incredibly close to the finish line.
當過程成功完成時,所有任務(wù)均應(yīng)報告成功,并且氣流中的甘特圖視圖應(yīng)解析為類似于上圖的視圖(執(zhí)行時間顯然會有所不同)。 就像那樣,我們已經(jīng)非常接近終點線。
We’ve just automated our entire training process! Including saving our models for deployment and sending emails out whenever the DAG is run. If you look back, our original goal was to deploy these models as consumable services. We could’ve easily automated that part as well, but our choice of technology (SAS model manager in this case) allows us to add additional touch points, if you so desire. It normally makes sense to have a human-in-the-middle “push button” process before engaging in model publish activities, because it factors in buffers if upstream processes go wonky for reasons like crappy data, sudden changes in baseline distributions etc. More importantly, pushing models to production should actively bring in conscious human mindfulness to the activity. Surely, we wouldn’t want an out-of-sight process impacting the business wildly. Doing this ‘human-in-the-middle’ activity also significantly eliminates the unnecessary need to engage in post-hoc explanations as backtesting comes to the fore.
我們剛剛完成了整個培訓(xùn)過程! 包括保存我們的部署模型并在運行DAG時發(fā)送電子郵件。 如果回頭看,我們的最初目標是將這些模型部署為消耗性服務(wù)。 我們也可以很容易地使那部分自動化,但是我們的技術(shù)選擇(在這種情況下為SAS模型管理器)允許我們根據(jù)需要添加其他接觸點。 通常,在進行模型發(fā)布活動之前先進行中間人“按鈕”處理是有意義的,因為如果上游處理由于數(shù)據(jù)不足,基線分布突然變化等原因而變得不穩(wěn)定,則會考慮到緩沖區(qū)。重要的是,將模型推向生產(chǎn)階段應(yīng)積極地將人類的正念帶入活動中。 當然,我們不希望視線外的過程對業(yè)務(wù)產(chǎn)生巨大影響。 進行這種“中間人 ”活動還可以顯著消除不必要的事后解釋,因為回測即將到來。
Ok, lets see how all of this works real quick:
好的,讓我們快速了解所有這些工作原理:
Deploying our autoML model部署我們的autoML模型Notice that SAS Model Manager is able to take the model artifacts and publish them out as a module in a micro analytic service, where models can be consumed using scoring endpoints. And just like that, you are able to flip the switch on your models and make them respond to requests for inference.
注意, SAS Model Manager能夠獲取模型工件并將其作為模塊發(fā)布到微分析服務(wù)中,在該服務(wù)中,可以使用計分端點來使用模型。 這樣,您就可以翻轉(zhuǎn)模型上的開關(guān),并使它們響應(yīng)推理請求。
There’s obviously no CI/CD component here just yet. That’s intentional. I didn’t want to overcomplicate this post since all we have is just a model here. I’ll come back and write a follow up on that topic on another day, at a later time, with another app. But for now, let’s rejoice in how much we’ve managed to get done automagically with Airflow & SAS Viya in containers.
顯然,這里還沒有CI / CD組件。 那是故意的。 我不想讓這篇文章過于復(fù)雜,因為我們這里只是一個模型。 我將在第二天回來,稍后再通過另一個應(yīng)用編寫關(guān)于該主題的后續(xù)報告。 但是現(xiàn)在,讓我們?yōu)槭褂肁irflow和SAS Viya容器自動完成的工作量感到高興。
Through thoughtful intelligent automation of mundane routines, using properly selected technology components, you can now make yourself available to focus on more exciting, cooler, higher order projects, while still making an ongoing impact in your unicorn organization through your models. Your best life is now. So why wait, when you can automate? 🤖
通過使用適當選擇的技術(shù)組件,通過周到的例行程序智能自動化,您現(xiàn)在可以使自己專注于更令人興奮,更酷,更高階的項目,同時仍通過模型對獨角獸組織產(chǎn)生持續(xù)影響。 你現(xiàn)在最好的生活。 那么,為什么要等到什么時候才能實現(xiàn)自動化呢? 🤖
Connect with Sathish on LinkedIn
在LinkedIn上與Sathish聯(lián)系
翻譯自: https://medium.com/swlh/automating-your-ml-models-like-a-pro-using-airflow-sas-viya-docker-6abe324d9072
airflow使用
總結(jié)
以上是生活随笔為你收集整理的airflow使用_使用AirFlow,SAS Viya和Docker像Pro一样自动化ML模型的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Linux0.11内核--缓冲区机制大致
- 下一篇: 迁移学习 nlp_NLP的发展-第3部分