nifi apache_Apache Nifi的工作原理-浏览数据流,不要淹没其中
nifi apache
by Fran?ois Paupier
通過Fran?oisPaupier
Apache Nifi的工作原理-瀏覽數(shù)據(jù)流,不要淹沒其中 (How Apache Nifi works — surf on your dataflow, don’t drown in it)
介紹 (Introduction)
That’s a crazy flow of water. Just like your application deals with a crazy stream of data. Routing data from one storage to another, applying validation rules and addressing questions of data governance, reliability in a Big Data ecosystem is hard to get right if you do it all by yourself.
那是瘋狂的水流。 就像您的應用程序處理瘋狂的數(shù)據(jù)流一樣。 如果您獨自完成所有工作,那么很難將數(shù)據(jù)從一個存儲路由到另一個存儲,應用驗證規(guī)則并解決數(shù)據(jù)治理,大數(shù)據(jù)生態(tài)系統(tǒng)中的可靠性問題。
Good news, you don’t have to build your dataflow solution from scratch — Apache NiFi got your back!
好消息,您不必從頭開始構建數(shù)據(jù)流解決方案-Apache NiFi支持您!
At the end of this article, you’ll be a NiFi expert — ready to build your data pipeline.
在本文結尾,您將成為NiFi專家-準備建立數(shù)據(jù)管道。
我將在本文中介紹: (What I will cover in this article:)
- What Apache NiFi is, in which situation you should use it, and what are the key concepts to understand in NiFi. 什么是Apache NiFi,在什么情況下應使用它,以及在NiFi中理解的關鍵概念是什么。
我不會介紹的內(nèi)容: (What I won’t cover:)
- Installation, deployment, monitoring, security, and administration of a NiFi cluster. NiFi群集的安裝,部署,監(jiān)視,安全性和管理。
For your convenience here is the table of content, feel free to go straight where your curiosity takes you. If you’re a NiFi first-timer, going through this article in the indicated order is advised.
為了方便起見,這里是目錄,您可以隨時隨心所欲地帶您進入。 如果您是NiFi初學者,建議按照指示的順序閱讀本文。
表中的內(nèi)容 (Table of Content)
I — What is Apache NiFi?
我- 什么是Apache NiFi?
-
--
Defining NiFi
定義NiFi
-
--
Why using NiFi?
為什么要使用NiFi?
II — Apache Nifi under the microscope
II – 顯微鏡下的Apache Nifi
-
--
FlowFile
流文件
-
--
Processor
處理器
-
--
Process Group
Craft.io組
-
--
Connection
連接
-
--
Flow Controller
流量控制器
Conclusion and call to action
結論和號召性用語
什么是Apache NiFi? (What is Apache NiFi?)
On the website of the Apache Nifi project, you can find the following definition:
在Apache Nifi項目的網(wǎng)站上,可以找到以下定義:
An easy to use, powerful, and reliable system to process and distribute data.一個易于使用,功能強大且可靠的系統(tǒng)來處理和分發(fā)數(shù)據(jù)。Let’s analyze the keywords there.
讓我們在那里分析關鍵字。
定義NiFi (Defining NiFi)
Process and distribute dataThat’s the gist of Nifi. It moves data around systems and gives you tools to process this data.
處理和分發(fā)數(shù)據(jù)這就是Nifi的要旨。 它可以在系統(tǒng)中移動數(shù)據(jù),并為您提供處理該數(shù)據(jù)的工具。
Nifi can deal with a great variety of data sources and format. You take data in from one source, transform it, and push it to a different data sink.
Nifi可以處理各種各樣的數(shù)據(jù)源和格式。 您可以從一個源中獲取數(shù)據(jù),對其進行轉(zhuǎn)換,然后將其推送到另一個數(shù)據(jù)接收器。
Easy to useProcessors — the boxes — linked by connectors — the arrows create a flow. NiFi offers a flow-based programming experience.
易于使用的處理器-通過連接器鏈接的框- 箭頭創(chuàng)建了流程。 N iFi提供基于流的編程體驗。
Nifi makes it possible to understand, at a glance, a set of dataflow operations that would take hundreds of lines of source code to implement.
Nifi使一眼就能理解一組數(shù)據(jù)流操作,這將需要數(shù)百行源代碼來實現(xiàn)。
Consider the pipeline below:
考慮下面的管道:
To translate the data flow above in NiFi, you go to NiFi graphical user interface, drag and drop three components into the canvas, and That’s it. It takes two minutes to build.
要在NiFi中轉(zhuǎn)換上述數(shù)據(jù)流,請轉(zhuǎn)到NiFi圖形用戶界面,將三個組件拖放到畫布中,僅此而已。 構建需要兩分鐘。
Now, if you write code to do the same thing, it’s likely to be a several hundred lines long to achieve a similar result.
現(xiàn)在,如果您編寫代碼來執(zhí)行相同的操作,則要獲得相似的結果可能需要數(shù)百行。
You don’t capture the essence of the pipeline through code as you do with a flow-based approach. Nifi is more expressive to build a data pipeline; it’s designed to do that.
您不會像使用基于流的方法那樣通過代碼捕獲管道的本質(zhì)。 Nifi在構建數(shù)據(jù)管道方面更具表現(xiàn)力; 它的目的是這樣做 。
PowerfulNiFi provides many processors out of the box (293 in Nifi 1.9.2). You’re on the shoulders of a giant. Those standard processors handle the vast majority of use cases you may encounter.
強大的 NiFi提供許多處理器 開箱即用(Nifi 1.9.2中的293)。 您站在巨人的肩膀上。 這些標準處理器可以處理您可能遇到的絕大多數(shù)用例。
NiFi is highly concurrent, yet its internals encapsulates the associated complexity. Processors offer you a high-level abstraction that hides the inherent complexity of parallel programming. Processors run simultaneously, and you can span multiple threads of a processor to cope with the load.
NiFi是高度并發(fā)的,但其內(nèi)部封裝了相關的復雜性。 處理器為您提供了高級抽象,它隱藏了并行編程固有的復雜性。 處理器同時運行,您可以跨越處理器的多個線程來應對負載。
Concurrency is a computing Pandora’s box that you don’t want to open. NiFi conveniently shields the pipeline builder from the complexities of concurrency.
并發(fā)是您不想打開的計算潘多拉盒子。 NiFi方便地保護了管道構建器免受并發(fā)復雜性的影響。
ReliableThe theory backing NiFi is not new; it has solid theoretical anchors. It’s similar to models like SEDA.
可靠 NiFi的理論支持并不新鮮; 它具有扎實的理論基礎。 它類似于SEDA之類的模型。
For a dataflow system, one of the main topics to address is reliability. You want to be sure that data sent somewhere is effectively received.
對于數(shù)據(jù)流系統(tǒng),要解決的主要主題之一是可靠性 。 您要確保有效地接收了發(fā)送到某處的數(shù)據(jù)。
NiFi achieves a high level of reliability through multiple mechanisms that keep track of the state of the system at any point in time. Those mechanisms are configurable so you can make the appropriate tradeoffs between latency and throughput required by your applications.
NiFi通過多種機制在任何時間點跟蹤系統(tǒng)狀態(tài),從而實現(xiàn)了高度的可靠性。 這些機制是可配置的,因此您可以在延遲和應用程序所需的吞吐量之間進行適當?shù)臋嗪?。
NiFi tracks the history of each piece of data with its lineage and provenance features. It makes it possible to know what transformation happens on each piece of information.
NiFi利用其沿襲和出處特征來跟蹤每條數(shù)據(jù)的歷史記錄。 它使得知道每條信息發(fā)生什么轉(zhuǎn)變成為可能。
The data lineage solution proposed by Apache Nifi proves to be an excellent tool for auditing a data pipeline. Data lineage features are essential to bolster confidence in big data and AI systems in a context where transnational actors such as the European Union propose guidelines to support accurate data processing.
Apache Nifi提出的數(shù)據(jù)沿襲解決方案被證明是審核數(shù)據(jù)管道的出色工具。 在諸如歐盟這樣的跨國參與者提出支持準確數(shù)據(jù)處理的準則的背景下,數(shù)據(jù)沿襲功能對于增強人們對大數(shù)據(jù)和AI系統(tǒng)的信心至關重要。
為什么要使用Nifi? (Why using Nifi?)
First, I want to make it clear I’m not here to evangelize NiFi. My goal is to give you enough elements so you can make an informed decision on the best way to build your data pipeline.
首先,我想說明一下,我不是在宣傳NiFi。 我的目標是為您提供足夠的元素,以便您可以明智地決定構建數(shù)據(jù)管道的最佳方法。
It’s useful to keep in mind the four Vs of big data when dimensioning your solution.
在確定解決方案的尺寸時,牢記大數(shù)據(jù)的四個優(yōu)勢非常有用。
Volume — At what scale do you operate? In order of magnitude, are you closer to a few GigaBytes or hundreds of PetaBytes?
數(shù)量 -您的經(jīng)營規(guī)模是多少? 在數(shù)量級上,您接近幾千兆字節(jié)還是幾百PB?
Variety — How many data sources do you have? Are your data structured? If yes, does the schema vary often?
種類 -您有多少個數(shù)據(jù)源? 您的數(shù)據(jù)是結構化的嗎? 如果是,架構是否經(jīng)常變化?
Velocity — What is the frequency of the events you process? Is it credit cards payments? Is it a daily performance report sent by an IoT device?
速度 -您處理事件的頻率是多少? 是信用卡付款嗎? 它是物聯(lián)網(wǎng)設備發(fā)送的每日性能報告嗎?
Veracity — Can you trust the data? Alternatively, do you need to apply multiple cleaning operations before manipulating it?
準確性 -您可以信任數(shù)據(jù)嗎? 另外,在操作之前是否需要進行多次清潔操作?
NiFi seamlessly ingests data from multiple data sources and provides mechanisms to handle different schema in the data. Thus, it shines when there is a high variety in the data.
NiFi無縫地從多個數(shù)據(jù)源中提取數(shù)據(jù),并提供了處理數(shù)據(jù)中不同模式的機制。 因此,當數(shù)據(jù)種類繁多時,它會發(fā)光。
Nifi is particularly valuable if data is of low veracity. Since it provides multiple processors to clean and format the data.
如果數(shù)據(jù)的準確性不高,則Nifi尤其有價值。 由于它提供了多個處理器來清理和格式化數(shù)據(jù)。
With its configuration options, Nifi can address a broad range of volume/velocity situations.
通過其配置選項,Nifi可以解決各種體積/速度情況。
數(shù)據(jù)路由解決方案的應用程序列表越來越多 (An increasing list of applications for data routing solutions)
New regulations, the rise of the Internet of Things and the flow of data it generates emphasize the relevance of tools such as Apache NiFi.
新法規(guī),物聯(lián)網(wǎng)的興起及其生成的數(shù)據(jù)流都強調(diào)了諸如Apache NiFi之類的工具的重要性。
Microservices are trendy. In those loosely coupled services, the data is the contract between the services. Nifi is a robust way to route data between those services.
微服務是新潮。 在那些松耦合的服務中, 數(shù)據(jù)是服務之間的契約 。 Nifi是在這些服務之間路由數(shù)據(jù)的可靠方法。
Internet of Things brings a multitude of data to the cloud. Ingesting and validating data from the edge to the cloud poses a lot of new challenges that NiFi can efficiently address (primarily through MiniFi, NiFi project for edge devices)
物聯(lián)網(wǎng) 將大量數(shù)據(jù)帶到云中。 從邊緣到云的數(shù)據(jù)吸收和驗證帶來了NiFi有效解決的許多新挑戰(zhàn)(主要是通過MiniFi ,針對邊緣設備的NiFi項目)
New guidelines and regulations are put in place to readjust the Big Data economy. In this context of increasing monitoring, it is vital for businesses to have a clear overview of their data pipeline. NiFi data lineage, for example, can be helpful in a path towards compliance to regulations.
制定了新的準則和法規(guī)以重新調(diào)整大數(shù)據(jù)經(jīng)濟。 在日益增加的監(jiān)控范圍內(nèi),對于企業(yè)來說,對其數(shù)據(jù)管道有清晰的概覽至關重要。 例如,NiFi數(shù)據(jù)沿襲可能會有助于您遵守法規(guī)。
彌合大數(shù)據(jù)專家與其他專家之間的鴻溝 (Bridge the gap between big data experts and the others)
As you can see by the user interface, a dataflow expressed in NiFi is excellent to communicate about your data pipeline. It can help members of your organization become more knowledgeable about what’s going on in the data pipeline.
從用戶界面可以看到,用NiFi表示的數(shù)據(jù)流非常適合與您的數(shù)據(jù)管道進行通信。 它可以幫助您的組織成員更了解數(shù)據(jù)管道中發(fā)生的事情。
An analyst is asking for insights about why this data arrives here that way? Sit together and walk through the flow. In five minutes you give someone a strong understanding of the Extract Transform and Load -ETL- pipeline.
分析師正在尋求有關為什么這些數(shù)據(jù)以這種方式到達此處的見解? 坐在一起,并在流程中穿行。 在五分鐘內(nèi),您將對提取轉(zhuǎn)換和加載-ETL-管道有深入的了解。
You want feedback from your peers on a new error handling flow you created? NiFi makes it a design decision to consider error paths as likely as valid outcomes. Expect the flow review to be shorter than a traditional code review.
您是否希望您的同僚對您創(chuàng)建的新錯誤處理流程提供反饋? NiFi決定將錯誤路徑視為有效結果,這是一項設計決策。 期望流程審查比傳統(tǒng)的代碼審查要短。
你應該使用它嗎? 是的,不是,也許嗎? (Should you use it? Yes, No, Maybe?)
NiFi brands itself as easy to use. Still, it is an enterprise dataflow platform. It offers a complete set of features from which you may only need a reduced subset. Adding a new tool to the stack is not benign.
NiFi品牌本身就易于使用。 盡管如此,它還是一個企業(yè)數(shù)據(jù)流平臺。 它提供了一套完整的功能,您可能只需要其中的一部分即可。 將新工具添加到堆棧中不是良性的。
If you are starting from scratch and manage a few data from trusted data sources, you may be better off setting up your Extract Transform and Load — ETL pipeline. Maybe a change data capture from a database and some data preparations scripts are all you need.
如果您是從頭開始并管理來自受信任數(shù)據(jù)源的一些數(shù)據(jù),則最好設置“提取轉(zhuǎn)換和加載-ETL”管道。 您可能只需要從數(shù)據(jù)庫中捕獲更改數(shù)據(jù)和一些數(shù)據(jù)準備腳本即可。
On the other hand, if you work in an environment with existing big data solutions in use (be it for storage, processing or messaging ), NiFi integrates well with them and is more likely to be a quick win. You can leverage the out of the box connectors to those other Big Data solutions.
另一方面,如果您在使用現(xiàn)有大數(shù)據(jù)解決方案(用于存儲 , 處理或消息傳遞 )的環(huán)境中工作,則NiFi可以很好地與它們集成,并且很可能會很快獲勝。 您可以利用現(xiàn)成的連接器來連接其他大數(shù)據(jù)解決方案。
It’s easy to be hyped by new solutions. List your requirements and choose the solution that answers your needs as simply as possible.
新解決方案很容易被炒作。 列出您的要求,并選擇盡可能簡單地滿足您需求的解決方案 。
Now that we have seen the very high picture of Apache NiFi, we take a look at its key concepts and dissect its internals.
既然我們已經(jīng)看到了Apache NiFi的高水準,我們來看看它的關鍵概念并剖析其內(nèi)部結構。
顯微鏡下的Apache Nifi (Apache Nifi under the microscope)
“NiFi is boxes and arrow programming” may be ok to communicate the big picture. However, if you have to operate with NiFi, you may want to understand a bit more about how it works.
可以傳達“ NiFi是盒子和箭頭編程”的信息。 但是,如果您必須使用NiFi進行操作,則可能需要更多地了解其工作原理。
In this second part, I explain the critical concepts of Apache NiFi with schemas. This black box model won’t be a black box to you afterward.
在第二部分中,我將說明使用模式的Apache NiFi的關鍵概念。 此后的黑匣子模型將不再是您的黑匣子。
取消裝箱Apache NiFi (Unboxing Apache NiFi)
When you start NiFi, you land on its web interface. The web UI is the blueprint on which you design and control your data pipeline.
啟動NiFi時,您會進入其Web界面。 Web UI是設計和控制數(shù)據(jù)管道的藍圖。
In Nifi, you assemble processors linked together by connections. In the sample dataflow introduced previously, there are three processors.
在Nifi中,您將組裝通過連接鏈接在一起的處理器 。 在前面介紹的樣本數(shù)據(jù)流中,有三個處理器。
The NiFi canvas user interface is the framework in which the pipeline builder evolves.
NiFi canvas用戶界面是管道構建器在其中發(fā)展的框架。
理解Nifi術語 (Making sense of Nifi terminology)
To express your dataflow in Nifi, you must first master its language. No worries, a few terms are enough to grasp the concept behind it.
要以Nifi表示數(shù)據(jù)流,您必須首先掌握其語言。 不用擔心,只需幾個術語即可掌握其背后的概念。
The black boxes are called processors, and they exchange chunks of information named FlowFiles through queues that are named connections. Finally, the FlowFile Controller is responsible for managing the resources between those components.
黑匣子稱為處理器,它們通過稱為連接的隊列交換名為FlowFiles的信息塊。 最后, FlowFile Controller負責管理那些組件之間的資源。
Let’s take a look at how this works under the hood.
讓我們看看它是如何工作的。
流文件 (FlowFile)
In NiFi, the FlowFile is the information packet moving through the processors of the pipeline.
在NiFi中, FlowFile 是通過管道處理器移動的信息包。
A FlowFile comes in two parts:
FlowFile分為兩個部分:
Attributes, which are key/value pairs. For example, the file name, file path, and a unique identifier are standard attributes.
屬性 ,是鍵/值對。 例如,文件名,文件路徑和唯一標識符是標準屬性。
Content, a reference to the stream of bytes compose the FlowFile content.
Content ,對字節(jié)流的引用構成了FlowFile內(nèi)容。
The FlowFile does not contain the data itself. That would severely limit the throughput of the pipeline.
FlowFile不包含數(shù)據(jù)本身。 這將嚴重限制管道的吞吐量。
Instead, a FlowFile holds a pointer that references data stored at some place in the local storage. This place is called the Content Repository.
相反,FlowFile保留一個指針,該指針引用存儲在本地存儲中某個位置的數(shù)據(jù)。 這個地方稱為內(nèi)容存儲庫 。
To access the content, the FlowFile claims the resource from the Content Repository. The later keep tracks of the exact disk offset from where the content is and streams it back to the FlowFile.
為了訪問內(nèi)容,FlowFile從內(nèi)容存儲庫中聲明資源。 稍后將跟蹤內(nèi)容所在位置的確切磁盤偏移,并將其流回FlowFile。
Not all processors need to access the content of the FlowFile to perform their operations — for example, aggregating the content of two FlowFiles doesn’t require to load their content in memory.
并非所有處理器都需要訪問FlowFile的內(nèi)容來執(zhí)行其操作-例如,聚合兩個FlowFiles的內(nèi)容不需要將其內(nèi)容加載到內(nèi)存中。
When a processor modifies the content of a FlowFile, the previous data is kept. NiFi copies-on-write, it modifies the content while copying it to a new location. The original information is left intact in the Content Repository.
當處理器修改FlowFile的內(nèi)容時,將保留先前的數(shù)據(jù)。 NiFi 寫時復制,它會在將內(nèi)容復制到新位置時對其進行修改。 原始信息保留在內(nèi)容存儲庫中。
ExampleConsider a processor that compresses the content of a FlowFile. The original content remains in the Content Repository, and a new entry is created for the compressed content.
示例考慮一個壓縮FlowFile內(nèi)容的處理器。 原始內(nèi)容保留在內(nèi)容存儲庫中,并為壓縮內(nèi)容創(chuàng)建一個新條目。
The Content Repository finally returns the reference to the compressed content. The FlowFile is updated to point to the compressed data.
內(nèi)容存儲庫最終將對壓縮內(nèi)容的引用返回。 FlowFile更新為指向壓縮數(shù)據(jù)。
The drawing below sums up the example with a processor that compresses the content of FlowFiles.
下圖總結了帶有壓縮FlowFiles內(nèi)容的處理器的示例。
ReliabilityNiFi claims to be reliable, how is it in practice? The attributes of all the FlowFiles currently in use, as well as the reference to their content, are stored in the FlowFile Repository.
可靠性 NiFi聲稱是可靠的,在實踐中如何? 當前使用的所有FlowFiles的屬性以及對其內(nèi)容的引用都存儲在FlowFile存儲庫中。
At every step of the pipeline, a modification to a Flowfile is first recorded in the FlowFile Repository, in a write-ahead log, before it is performed.
在流水線的每個步驟中,在對流文件進行修改之前,首先將其記錄在流文件存儲庫中的預寫日志中 。
For each FlowFile that currently exist in the system, the FlowFile repository stores:
對于系統(tǒng)中當前存在的每個FlowFile,FlowFile存儲庫存儲:
- The FlowFile attributes FlowFile屬性
- A pointer to the content of the FlowFile located in the FlowFile repository 指向位于FlowFile存儲庫中的FlowFile內(nèi)容的指針
- The state of the FlowFile. For example: to which queue does the Flowfile belong at this instant. FlowFile的狀態(tài)。 例如:Flowfile在此瞬間屬于哪個隊列。
The FlowFile repository gives us the most current state of the flow; thus it’s a powerful tool to recover from an outage.
FlowFile存儲庫為我們提供了流程的最新狀態(tài)。 因此,它是從中斷中恢復的強大工具。
NiFi provides another tool to track the complete history of all the FlowFiles in the flow: the Provenance Repository.
NiFi提供了另一個工具來跟蹤流程中所有FlowFiles的完整歷史記錄:“資源庫”。
Provenance RepositoryEvery time a FlowFile is modified, NiFi takes a snapshot of the FlowFile and its context at this point. The name for this snapshot in NiFi is a Provenance Event. The Provenance Repository records Provenance Events.
來源存儲庫每次修改FlowFile時,NiFi都會在此時獲取FlowFile及其上下文的快照。 NiFi中此快照的名稱是“ 來源事件” 。 來源存儲庫記錄來源事件。
Provenance enables us to retrace the lineage of the data and build the full chain of custody for every piece of information processed in NiFi.
出處使我們能夠追溯數(shù)據(jù)沿襲并為在NiFi中處理的每條信息建立完整的監(jiān)管鏈。
On top of offering the complete lineage of the data, the Provenance Repository also offers to replay the data from any point in time.
除了提供完整的數(shù)據(jù)沿襲之外,Provenance信息庫還提供從任何時間點重播數(shù)據(jù)的功能。
Wait, what’s the difference between the FlowFile Repository and the Provenance Repository?
等等,FlowFile資料庫和Provenance資料庫有什么區(qū)別?
The idea behind the FlowFile Repository and the Provenance Repository is quite similar, but they don’t address the same issue.
FlowFile資料庫和Provenance資料庫背后的想法非常相似,但是它們沒有解決相同的問題。
- The FlowFile repository is a log that contains only the latest state of the in-use FlowFiles in the system. It is the most recent picture of the flow and makes it possible to recover from an outage quickly. FlowFile存儲庫是一個日志,僅包含系統(tǒng)中正在使用的FlowFiles的最新狀態(tài)。 這是流量的最新情況,可以快速從中斷中恢復。
- The Provenance Repository, on the other hand, is more exhaustive since it tracks the complete life cycle of every FlowFile that has been in the flow. 另一方面,“資源庫”更為詳盡,因為它跟蹤流中每個FlowFile的完整生命周期。
If you have only the most recent picture of the system with the FlowFile repository, the Provenance Repository gives you a collection of photos — a video. You can rewind to any moment in the past, investigate the data, replay operations from a given time. It provides a complete lineage of the data.
如果您只有使用FlowFile信息庫的最新系統(tǒng)圖片,則Provenance信息庫會為您提供照片集- 視頻 。 您可以倒退到過去的任何時刻,研究數(shù)據(jù),并從給定的時間重放操作。 它提供了數(shù)據(jù)的完整沿襲。
FlowFile處理器 (FlowFile Processor)
A processor is a black box that performs an operation. Processors have access to the attributes and the content of the FlowFile to perform all kind of actions. They enable you to perform many operations in data ingress, standard data transformation/validation tasks, and saving this data to various data sinks.
處理器是執(zhí)行操作的黑匣子。 處理器可以訪問FlowFile的屬性和內(nèi)容來執(zhí)行所有類型的操作。 它們使您能夠在數(shù)據(jù)輸入,標準數(shù)據(jù)轉(zhuǎn)換/驗證任務中執(zhí)行許多操作,并將這些數(shù)據(jù)保存到各種數(shù)據(jù)接收器中。
NiFi comes with many processors when you install it. If you don’t find the perfect one for your use case, it’s still possible to build your own processor. Writing custom processors is outside the scope of this blog post.
NiFi在安裝時會附帶許多處理器。 如果找不到適合您的用例的處理器,那么仍然可以構建自己的處理器。 編寫自定義處理器超出了本博客文章的范圍。
Processors are high-level abstractions that fulfill one task. This abstraction is very convenient because it shields the pipeline builder from the inherent difficulties of concurrent programming and the implementation of error handling mechanisms.
處理器是完成一項任務的高級抽象。 這種抽象非常方便,因為它使管道構建器免受并發(fā)編程和錯誤處理機制的實現(xiàn)所固有的困難。
Processors expose an interface with multiple configuration settings to fine-tune their behavior.
處理器公開具有多個配置設置的界面以微調(diào)其行為。
The properties of those processors are the last link between NiFi and the business reality of your application requirements.
這些處理器的屬性是NiFi與您的應用程序需求之間的最后聯(lián)系。
The devil is in the details, and pipeline builders spend most of their time fine-tuning those properties to match the expected behavior.
細節(jié)在于魔鬼,管道建設者會花費大部分時間來微調(diào)這些屬性以匹配預期的行為。
ScalingFor each processor, you can specify the number of concurrent tasks you want to run simultaneously. Like this, the Flow Controller allocates more resources to this processor, increasing its throughput. Processors share threads. If one processor requests more threads, other processors have fewer threads available to execute. Details on how the Flow Controller allocates threads are available here.
擴展對于每個處理器,您可以指定要同時運行的并發(fā)任務數(shù)。 這樣, 流控制器將更多資源分配給該處理器,從而增加其吞吐量。 處理器共享線程。 如果一個處理器請求更多線程,則其他處理器將具有更少的線程來執(zhí)行。 有關Flow Controller如何分配線程的詳細信息,請參見此處 。
Horizontal scaling. Another way to scale is to increase the number of nodes in your NiFi cluster. Clustering servers make it possible to increase your processing capability using commodity hardware.
水平縮放。 擴展的另一種方法是增加NiFi群集中的節(jié)點數(shù)。 群集服務器使您可以使用商用硬件來提高處理能力。
Craft.io組 (Process Group)
This one is straightforward now that we’ve seen what processors are.
現(xiàn)在,我們已經(jīng)了解了什么是處理器,這很簡單。
A bunch of processors put together with their connections can form a process group. You add an input port and an output port so it can receive and send data.
一堆處理器及其連接可以組成一個進程組。 您添加了一個輸入端口和一個輸出端口,以便它可以接收和發(fā)送數(shù)據(jù)。
Processor groups are an easy way to create new processors based from existing ones.
處理器組是從現(xiàn)有處理器創(chuàng)建新處理器的簡便方法。
連接數(shù) (Connections)
Connections are the queues between processors. These queues allow processors to interact at differing rates. Connections can have different capacities like there exist different size of water pipes.
連接是處理器之間的隊列。 這些隊列允許處理器以不同的速率進行交互。 連接可以具有不同的容量,例如存在不同尺寸的水管。
Because processors consume and produce data at different rates depending on the operations they perform, connections act as buffers of FlowFiles.
由于處理器根據(jù)執(zhí)行的操作以不同的速率消耗和產(chǎn)生數(shù)據(jù),因此連接充當FlowFiles的緩沖區(qū)。
There is a limit on how many data can be in the connection. Similarly, when your water pipe is full, you can’t add water anymore, or it overflows.
連接中可以有多少數(shù)據(jù)是有限制的。 同樣,當水管已滿時,您將無法再加水,否則水會溢出。
In NiFi you can set limits on the number of FlowFiles and the size of their aggregated content going through the connections.
在NiFi中,您可以設置FlowFile的數(shù)量及其通過連接的聚合內(nèi)容大小的限制。
What happens when you send more data than the connection can handle?
當您發(fā)送的數(shù)據(jù)超出連接的處理能力會發(fā)生什么?
If the number of FlowFiles or the quantity of data goes above the defined threshold, backpressure is applied. The Flow Controller won’t schedule the previous processor to run again until there is room in the queue.
如果FlowFiles的數(shù)量或數(shù)據(jù)量超過定義的閾值,則將施加反壓 。 在隊列中沒有空間之前,Flow Controller不會安排先前的處理器再次運行。
Let’s say you have a limit of 10 000 FlowFiles between two processors. At some point, the connection has 7 000 elements in it. It is ok since the limit is 10 000. P1 can still send data through the connection to P2.
假設您在兩個處理器之間最多只能有10000個FlowFiles。 在某個時候,連接中有7 000個元素。 這是確定的,因為限制為10 000 P1還可以通過連接到P2發(fā)送數(shù)據(jù)。
Now let’s say that processor one sends 4 000 new FlowFiles to the connection. 7 0000 + 4 000 = 11 000 → We go above the connection threshold of 10 000 FlowFiles.
現(xiàn)在,假設處理器一向該連接發(fā)送了4000個新的FlowFiles。 7 0000 + 4 000 = 11000→我們超過了10 000個FlowFiles的連接閾值。
The limits are soft limits, meaning they can be exceeded. However, once they are, the previous processor, P1 won’t be scheduled until the connector goes back below its threshold value — 10 000 FlowFiles.
限制是軟限制,表示可以超出限制 。 但是,一旦連接器恢復到其閾值(10000個FlowFiles)以下,就不會調(diào)度以前的處理器P1 。
This simplified example gives the big picture of how backpressure works.
這個簡化的示例可以大致了解反壓的工作原理。
You want to setup connection thresholds appropriate to the Volume and Velocity of data to handle. Keep in mind the Four Vs.
您要設置適合于要處理的數(shù)據(jù)量和速度的連接閾值。 請記住四個Vs。
The idea of exceeding a limit may sound odd. When the number of FlowFiles or the associated data go beyond the threshold, a swap mechanism is triggered.
超出限制的想法聽起來很奇怪。 當FlowFiles或關聯(lián)數(shù)據(jù)的數(shù)量超過閾值時,將觸發(fā)交換機制 。
For another example on backpressure, this mail thread can help.
對于反壓的另一個示例, 此郵件線程可以提供幫助。
Prioritizing FlowFilesThe connectors in NiFi are highly configurable. You can choose how you prioritize FlowFiles in the queue to decide which one to process next.
確定FlowFile的優(yōu)先級 NiFi中的連接器是高度可配置的。 您可以選擇如何在隊列中確定FlowFiles的優(yōu)先級 ,以決定下一個要處理的文件。
Among the available possibility, there is, for example, the First In First Out order — FIFO. However, you can even use an attribute of your choice from the FlowFile to prioritize incoming packets.
在可用的可能性中,例如,先進先出順序FIFO。 但是,您甚至可以使用FlowFile中選擇的屬性來對傳入數(shù)據(jù)包進行優(yōu)先級排序。
流量控制器 (Flow Controller)
The Flow Controller is the glue that brings everything together. It allocates and manages threads for processors. It’s what executes the dataflow.
流量控制器是將一切融合在一起的粘合劑。 它為處理器分配和管理線程。 這就是執(zhí)行數(shù)據(jù)流的方式。
Also, the Flow Controller makes it possible to add Controller Services.
此外,Flow Controller還可以添加Controller Services。
Those services facilitate the management of shared resources like database connections or cloud services provider credentials. Controller services are daemons. They run in the background and provide configuration, resources, and parameters for the processors to execute.
這些服務有助于管理共享資源,例如數(shù)據(jù)庫連接或云服務提供商憑據(jù)。 控制器服務是守護程序 。 它們在后臺運行,并提供配置,資源和參數(shù)供處理器執(zhí)行。
For example, you may use an AWS credentials provider service to make it possible for your services to interact with S3 buckets without having to worry about the credentials at the processor level.
例如,您可以使用AWS憑證提供程序服務使您的服務與S3存儲桶進行交互,而不必擔心處理器級別的憑證。
Just like with processors, a multitude of controller services is available out of the box.
就像處理器一樣,開箱即用的控制器服務也很多 。
You can check out this article for more content on the controller services.
您可以查看本文以獲取有關控制器服務的更多內(nèi)容。
結論和號召性用語 (Conclusion and call to action)
In the course of this article, we discussed NiFi, an enterprise dataflow solution. You now have a strong understanding of what NiFi does and how you can leverage its data routing features for your applications.
在本文的過程中,我們討論了企業(yè)數(shù)據(jù)流解決方案NiFi。 您現(xiàn)在對NiFi的功能以及如何為應用程序利用其數(shù)據(jù)路由功能有了深刻的了解。
If you’re reading this, congrats! You now know more about NiFi than 99.99% of the world’s population.
如果您正在閱讀本文,那么恭喜! 現(xiàn)在,您對NiFi的了解超過了全球99.99%的人口。
Practice makes perfect. You master all the concepts required to start building your own pipeline. Make it simple; make it work first.
實踐使完美。 您掌握了開始構建自己的管道所需的所有概念。 簡單點; 使它首先工作。
Here is a list of exciting resources I compiled on top of my work experience to write this article.
這是我根據(jù)自己的工作經(jīng)驗編寫的這篇令人興奮的資源清單。
資源? (Resources ?)
更大的圖景 (The bigger picture)
Because designing data pipeline in a complex ecosystem requires proficiency in multiple areas, I highly recommend the book Designing Data-Intensive Applications from Martin Kleppmann. It covers the fundamentals.
因為在復雜的生態(tài)系統(tǒng)中設計數(shù)據(jù)管道需要精通多個領域,所以我強烈建議《 設計數(shù)據(jù)密集型應用程序 》一書 來自Martin Kleppmann。 它涵蓋了基礎知識。
A cheat sheet with all the references quoted in Martin’s book is available on his Github repo.
馬丁書中引用的所有參考文獻的備忘單可在他的Github存儲庫中找到 。
This cheat sheet is a great place to start if you already know what kind of topic you’d like to study in-depth and you want to find quality materials.
如果您已經(jīng)知道您想深入學習什么樣的主題并且想要找到優(yōu)質(zhì)的材料,那么這份備忘單是一個很好的起點。
Apache Nifi的替代品 (Alternatives to Apache Nifi)
Other dataflow solutions exist.
存在其他數(shù)據(jù)流解決方案。
Open source:
開源:
Streamsets is similar to NiFi; a good comparison is available on this blog
流集類似于NiFi; 這個博客上有一個很好的比較
Most of the existing cloud providers offer dataflow solutions. Those solutions integrate easily with other products you use from this cloud provider. At the same time, it solidly ties you to a particular vendor.
大多數(shù)現(xiàn)有的云提供商都提供數(shù)據(jù)流解決方案。 這些解決方案可輕松與您從該云提供商處使用的其他產(chǎn)品集成。 同時,它將您與特定供應商牢固地聯(lián)系在一起。
Azure Data Factory, A Microsoft solution
微軟解決方案Azure數(shù)據(jù)工廠
IBM has its InfoSphere DataStage
IBM有其InfoSphere DataStage
Amazon proposes a tool named Data Pipeline
亞馬遜提出了一個名為數(shù)據(jù)管道的工具
Google offers its Dataflow
Google提供其數(shù)據(jù)流
Alibaba cloud introduces a service DataWorks with similar features
阿里云推出具有類似功能的服務DataWorks
NiFi相關資源 (NiFi related resources)
The official Nifi documentation and especially the Nifi In-depth section are gold mines.
Nifi的官方文檔 ,尤其是“ Nifi深入”部分是金礦。
Registering to Nifi users mailing list is also a great way to be informed — for example, this conversation explains back-pressure.
向Nifi用戶的郵件列表注冊也是一種很好的通知方式-例如, 此對話說明了背壓。
Hortonworks, a big data solutions provider, has a community website full of engaging resources and how-to for Apache Nifi.
Hortonworks,大數(shù)據(jù)解決方案提供商,擁有一個社區(qū)網(wǎng)站充分接合的資源,以及如何對 Apache的Nifi。
—
-
This article goes in depth about connectors, heap usage, and back pressure.
本文深入介紹了連接器,堆使用情況和背壓。
—
-
This one shares dimensioning best practices when deploying a NiFi cluster.
此人分享了部署NiFi集群時的尺寸最佳實踐。
The NiFi blog distills a lot of insights NiFi usage patterns as well as tips on how to build pipelines.
NiFi博客摘錄了許多有關NiFi使用模式的見解以及有關如何構建管道的技巧。
Claim Check pattern explained
索賠檢查模式說明
The theory behind Apache Nifi is not new, Seda referenced in Nifi Doc is extremely relevant
Apache Nifi背后的理論并不是新事物,Nifi Doc中引用的Seda極為相關
— Matt Welsh. Berkeley. SEDA: An Architecture for Well-Conditioned, Scalable Internet Services [online]. Retrieved: 21 Apr 2019, from
—馬特·威爾士(Matt Welsh)。 伯克利。 SEDA:一種條件良好的可擴展Internet服務的體系結構[在線]。 檢索:2019年4月21日,從
http://www.mdw.la/papers/seda-sosp01.pdf
http://www.mdw.la/papers/seda-sosp01.pdf
翻譯自: https://www.freecodecamp.org/news/nifi-surf-on-your-dataflow-4f3343c50aa2/
nifi apache
總結
以上是生活随笔為你收集整理的nifi apache_Apache Nifi的工作原理-浏览数据流,不要淹没其中的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 怎么查看视频的md5值
- 下一篇: 内网穿透远程群晖NAS:使用自定义域名