如何将Jupyter Notebook连接到远程Spark集群并每天运行Spark作业?
As a data scientist, you are developing notebooks that process large data that does not fit in your laptop using Spark. What would you do? This is not a trivial problem.
作為數據科學家,您正在開發使用Spark處理筆記本電腦無法容納的大數據的筆記本電腦。 你會怎么做? 這不是一個小問題。
Let’s start with the most naive solution without install anything on your laptop.
讓我們從最簡單的解決方案開始,而不在筆記本電腦上安裝任何東西。
The problem of “No notebook” is the developer experience is unacceptable on Spark shell:
“沒有筆記本”的問題是在Spark shell上無法接受開發人員的體驗:
The second option “Local notebook”: You have to downsample the data and pull the data to your laptop (downsample: if you have 100GB data on your clusters, you downsample the data to 1GB without losing too much important information). Then you could process the data on your local Jupyter notebook.
第二個選項“本地筆記本”:您必須對數據進行降采樣并將數據拉至筆記本電腦(降采樣:如果群集上有100GB數據,則可以將數據降采樣為1GB,而不會丟失太多重要信息)。 然后,您可以在本地Jupyter筆記本上處理數據。
it creates a few new painful problems:
它帶來了一些新的痛苦問題:
Ok, “No notebook” and “Local notebook” are obviously not the best approach. What if your data team has access to the cloud, e.g. AWS? Yes, AWS provides Jupyter notebook on their EMR clusters and SageMaker. The notebook server is accessed through AWS Web console and it is ready to use when the clusters are ready.
好的,“沒有筆記本”和“本地筆記本”顯然不是最好的方法。 如果您的數據團隊可以訪問云(例如AWS)怎么辦? 是的,AWS在其EMR群集和SageMaker上提供Jupyter筆記本。 筆記本服務器可通過AWS Web控制臺訪問,并且在群集準備就緒后即可使用。
This approach is called “Remote notebook on a cloud”.
這種方法稱為“云上的遠程筆記本”。
AWS EMR with Jupyter Notebook by AWSAWS EMR與Jupyter Notebook by AWSThe problems of “remote notebook on the cloud” are
“遠程筆記本在云上”的問題是
This approach, ironically, is the most popular one among the data scientists who have access to AWS. This can be explained by the principle of least effort: It provides one-click access to remote clusters so that data scientists can focus on their machine learning models, visualization, and business impact without spending too much time on clusters.
具有諷刺意味的是,這種方法是可以訪問AWS的數據科學家中最受歡迎的一種。 可以用最少努力的原理來解釋: 一鍵式訪問遠程集群,因此數據科學家可以專注于他們的機器學習模型,可視化和業務影響,而無需在集群上花費太多時間。
Besides “No notebook”, “Local notebook”, and “Remote notebook on Cloud”, there are options that point spark on a laptop to remote spark clusters. The code is submitted via a local notebook and send to a remote spark cluster. This approach is called “Bridge local & remote spark”.
除了“無筆記本”,“本地筆記本”和“云上的遠程筆記本”之外,還有一些選項可將筆記本電腦上的火花指向遠程火花群集。 該代碼通過本地筆記本提交,并發送到遠程Spark集群。 這種方法稱為“橋接本地和遠程火花”。
You can use set the remote master when you create sparkSession
創建sparkSession時可以使用set remote master
val spark = SparkSession.builder().appName(“SparkSample”)
.master(“spark://123.456.789:7077”)
.getOrCreate()
The problems are
問題是
it only works when Spark is deployed as Standalone not YARN. If your spark cluster is deployed on YARN, then you have to copy the configuration files/etc/hadoop/conf on remote clusters to your laptop and restart your local spark, assuming you have already figured out how to install Spark on your laptop.
它僅在將Spark部署為獨立版本而不是YARN時有效。 如果您的Spark集群已部署在YARN上,那么您必須將遠程集群上的配置文件/etc/hadoop/conf復制到您的筆記本電腦上,然后重新啟動本地spark,前提是您已經弄清楚了如何在筆記本電腦上安裝Spark。
If you have multiple spark clusters, then you have to switch back and forth by copy configuration files. If the clusters are ephemeral on Cloud, then it easily becomes a nightmare.
如果您有多個Spark集群,則必須通過復制配置文件來回切換。 如果集群是短暫的,那么它很容易成為噩夢。
“Bridge local & remote spark” does not work for most of the data scientists. Luckily, we can switch back our attention to Jupyter notebook. There is a Jupyter notebook kernel called “Sparkmagic” which can send your code to a remote cluster with the assumption that Livy is installed on the remote spark clusters. This assumption is met for all cloud providers and it is not hard to install on in-house spark clusters with the help of Apache Ambari.
“橋接本地和遠程火花”不適用于大多數數據科學家。 幸運的是,我們可以將注意力轉移到Jupyter筆記本上。 有一個名為“ Sparkmagic”的Jupyter筆記本內核,可以將Livy安裝在遠程Spark群集上,從而將您的代碼發送到遠程群集。 所有云提供商均滿足此假設,并且在Apache Ambari的幫助下將其安裝在內部Spark集群上并不困難。
Sparkmagic ArchitectureSparkmagic架構It seems “Sparkmagic” is the best solution at this point but why it is not the most popular one. There are 2 reasons:
目前看來,“ Sparkmagic”是最好的解決方案,但為什么它不是最受歡迎的解決方案。 有兩個原因:
To solve problem 2, sparkmagic introduces Docker containers that are ready to use. Docker container, indeed, has solved some of the issues in installation, but it also introduces new problems for data scientists:
為了解決問題2,sparkmagic引入了可立即使用的Docker容器。 Docker容器確實解決了安裝中的一些問題,但是它也為數據科學家帶來了新的問題:
The discussion of docker containers will stop here and another article that explains how to make Docker containers actually work for data scientists will be published in a few days.
關于Docker容器的討論將在這里停止,另一篇文章將解釋如何使Docker容器真正為數據科學家服務,將在幾天后發布。
To summarize, we have two categories of solutions:
總而言之,我們有兩種解決方案:
Despite installation and connection issues, “Sparkmagic” is the recommended solution. However, there are often other unsolved issues that reduce productivity and hurt developer experience:
盡管存在安裝和連接問題,但建議使用“ Sparkmagic”解決方案。 但是,通常還有其他未解決的問題會降低生產率并損害開發人員的經驗:
Let’s go over the current solutions:
讓我們來看一下當前的解決方案:
Set up a remote Jupyter server and SSH tunneling (Reference). This definitely works but it takes time to set it up, and notebooks are on the remote servers.
設置遠程Jupyter服務器和SSH隧道(R eference )。 絕對可以,但是設置起來很費時間,并且筆記本在遠程服務器上。
Let’s reframe the problems:
讓我們重新構造問題:
The solutions implemented by Bayesnote (a new open source Notebook project https://github.com/Bayesnote/Bayesnote) follows this principle:
Bayesnote(一個新的開源Notebook項目https://github.com/Bayesnote/Bayesnote )實現的解決方案遵循以下原則:
These ideas are implemented by feature “auto self-deployment” of Bayesnote. In the development phase, the only required input from data scientists is authentication information, like IP and password. Then Bayesnote deploys itself to remote servers and started to listen for socket messages. The code will be sent to a remote server and get results back for users.
這些想法是通過Bayesnote的功能“自動自我部署”實現的。 在開發階段,數據科學家唯一需要的輸入就是身份驗證信息,例如IP和密碼。 然后,Bayesnote將自己部署到遠程服務器,并開始偵聽套接字消息。 該代碼將被發送到遠程服務器,并為用戶返回結果。
Bayesnote: auto self-deploymentBayesnote:自動自我部署In the operation phase, a YAML file is specified and Bayesnote would run notebooks on remote servers, get finished notebooks back, and send a status update to emails or slack.
在操作階段,將指定一個YAML文件,并且Bayesnote將在遠程服務器上運行筆記本,取回已??完成的筆記本,并將狀態更新發送到電子郵件或備用服務器。
Workflow YAML by BayesnoteBayesnote的工作流程YAML(Users will configure by filling out forms rather than YAML files, and the dependency of notebooks will be visualized nicely. )
(用戶將通過填寫表單(而不是YAML文件)進行配置,并且筆記本的依賴關系將得到很好的可視化。)
The (partial) implementation can be found on Github. https://github.com/Bayesnote/Bayesnote
可以在Github上找到(部分)實現。 https://github.com/Bayesnote/Bayesnote
Free data scientists from tooling issues so they can be happy and productive in their jobs.
使數據科學家從工具問題中解放出來,使他們在工作中感到快樂和高效率。
翻譯自: https://towardsdatascience.com/how-to-connect-jupyter-notebook-to-remote-spark-clusters-and-run-spark-jobs-every-day-2c5a0c1b61df
總結
以上是生活随笔為你收集整理的如何将Jupyter Notebook连接到远程Spark集群并每天运行Spark作业?的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到在家上香什么意思
- 下一篇: 梦到出国买东西是什么意思