Telltale:简化了Netflix应用程序监视
By Andrei Ushakov, Seth Katz, Janak Ramachandran, Jeff Butsch, Peter Lau, Ram Vaithilingam, and Greg Burrell
作者:Andrei Ushakov, Seth Katz , Janak Ramachandran , Jeff Butsch , Peter Lau , Ram Vaithilingam和Greg Burrell
我們的故事愿景 (Our Telltale Vision)
An alert fires and you get paged in the middle of the night. A metric crossed a threshold. You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? When was the last time somebody adjusted our alert thresholds? Maybe it’s due to an upstream or downstream service?” This is a critical application so you drag yourself out of bed, open your laptop, and start poring through dashboards for more info. You’re not yet convinced there’s a real problem but you’re also aware that the clock is ticking as you dig through a mountain of data looking for clues.
警報觸發,您在半夜被傳呼。 指標超過了閾值。 您半醒著,想知道:“這真的有問題嗎?或者這只是需要調整的警報? 上一次有人調整我們的警報閾值是什么時候? 也許是由于上游或下游服務?” 這是至關重要的應用程序,因此您可以將自己拖下床,打開筆記本電腦,然后開始瀏覽儀表板以獲取更多信息。 您尚未確信這是一個真正的問題,但是您也意識到,在挖掘大量數據以尋找線索時,時鐘正在滴答作響。
Healthy Netflix services are essential to member joy. When you sit down to watch “Tiger King” you expect it to just play. Over the years we’ve learned from on-call engineers about the pain points of application monitoring: too many alerts, too many dashboards to scroll through, and too much configuration and maintenance. Our streaming teams need a monitoring system that enables them to quickly diagnose and remediate problems; seconds count! Our Node team needs a system that empowers a small group to operate a large fleet.
健康的Netflix服務對于會員歡樂至關重要。 當您坐下來觀看“ 老虎王 ”時,您期望它會播放。 多年來,我們已經從應召喚的工程師那里了解了應用程序監視的痛點:警報太多,要滾動瀏覽的儀表板太多,配置和維護太多。 我們的流媒體團隊需要一個監控系統,使他們能夠快速診斷和修復問題; 秒數! 我們的Node團隊需要一個能夠使一小群人操作大型艦隊的系統。
So we built Telltale.
因此,我們構建了Telltale。
Telltale combines a variety of data sources to create a holistic view of an application’s health. Telltale learns what constitutes typical health for an application, no alert tuning required. And because we know what’s healthy, we can let application owners know when their services are trending towards unhealthy.
Telltale結合了各種數據源來創建應用程序運行狀況的整體視圖。 Telltale可以了解什么構成應用程序的典型運行狀況,而無需調整警報。 并且因為我們知道什么是健康的,所以我們可以讓應用程序所有者知道他們的服務何時趨于不健康。
Metrics are a key part of understanding application health. But sometimes you can have too many metrics, too many graphs, and too many dashboards. Telltale shows only the relevant data from the application plus that of upstream and downstream services. We use colors to indicate severity (users can opt to have Telltale display numbers in addition to colors) so users can tell, at a glance, the state of their application’s health. We also highlight interesting broader events such as regional traffic evacuations and nearby deployments, information that is vital to understanding health holistically. Especially during an incident.
指標是了解應用程序運行狀況的關鍵部分。 但是有時您可以擁有太多指標,太多圖表和太多儀表板。 Telltale 僅顯示應用程序中的相關數據以及上游和下游服務的數據。 我們使用顏色來指示嚴重性(用戶可以選擇除顏色以外還可以選擇Telltale顯示數字),以便用戶可以一目了然地知道其應用程序的運行狀況。 我們還將重點介紹更有趣的更廣泛的事件,例如區域交通疏散和附近的部署 ,這些信息對于全面了解健康至關重要。 尤其是在發生事件期間。
That is our Telltale vision. It exists today and monitors the health of over 100 Netflix production-facing applications.
這就是我們的Telltale愿景。 它現已存在,并監視著100多個Netflix面向生產的應用程序的運行狀況。
An application lives in an ecosystem應用程序生活在生態系統中應用程序健康模型 (The Application Health Model)
A microservice doesn’t live in isolation. It usually has dependencies, talks to other services, and lives in different AWS regions. The call graph above is a relatively simple one, they can be much deeper with dozens of services involved. An application is part of an ecosystem that can be subtly influenced by property changes or radically altered by region-wide events. The launch of a canary can affect an application. As can an upstream or downstream deployments.
微服務并非孤立存在。 它通常具有依賴性,與其他服務的對話,并且位于不同的AWS區域中。 上面的調用圖是一個相對簡單的圖,其中涉及許多服務,它們可能會更深。 應用程序是生態系統的一部分,可能會受到屬性變化的微妙影響,或者會受到區域范圍內事件的根本性改變。 金絲雀的啟動可能會影響應用程序。 上游或下游部署也可以。
Telltale uses a variety of signals from multiple sources to assemble a constantly evolving model of the application’s health:
Telltale使用來自多個來源的各種信號來組裝一個不斷發展的應用程序運行狀況模型:
Atlas time series metrics.
Atlas時間序列指標。
Regional traffic evacuations.
區域交通疏散 。
Mantis real-time streaming data.
螳螂實時流數據。
- Infrastructure change events. 基礎架構變更事件。
Canary launches and deployments.
金絲雀發射和部署 。
- The health of upstream and downstream services. 上游和下游服務的運行狀況。
Client metrics and QoE changes.
客戶指標和QoE更改 。
- Alerts triggered by our alerting platform. 由我們的警報平臺觸發的警報。
Different signals have different levels of importance to an application’s health. For example, a latency increase is less critical than error rate increase and some error codes are less critical than others. A canary launch two layers downstream might not be as significant as a deployment immediately upstream. A regional traffic shift means one region ends up with zero traffic while another region has double. You can imagine the impact that has on metrics. A metric’s meaning determines how we should interpret it.
不同的信號對應用程序的運行狀況具有不同的重要性級別。 例如,等待時間的增加不如錯誤率增加那么關鍵,某些錯誤代碼不如其他錯誤代碼那么關鍵。 下游的金絲雀發射兩層可能不如上游的部署重要。 區域交通流量轉移意味著一個區域最終的交通流量為零,而另一區域則為兩倍。 您可以想象對指標的影響。 指標的含義決定了我們應該如何解釋它。
Telltale takes all those factors into consideration when constructing its view of application health.
在構建其應用程序運行狀況視圖時,Telltale考慮了所有這些因素。
The application health model is the heart of Telltale.
應用程序運行狀況模型是Telltale的核心。
智能監控 (Intelligent Monitoring)
Every service operator knows the difficulty of alert tuning. Set thresholds too low and you get a deluge of spurious alerts. So you overcompensate and relax the tuning to the point of missing important health warnings. The end result is a lack of trust in alerts. Telltale is built on the premise that you shouldn’t have to constantly tune configuration.
每個服務運營商都知道警報調整的難度。 將閾值設置得太低,您會收到大量虛假警報。 因此,您會過度補償并放松調整,以致錯過重要的健康警告。 最終結果是對警報缺乏信任。 Telltale建立在您不必不斷調整配置的前提下。
We make setup and configuration easy for application owners by providing curated and managed signal packs. These packs are combined into application profiles to address most common service types. Telltale automatically tracks dependencies between services to build the topology used in the application health model. Signal packs and topology detection keep configuration up-to-date with minimal effort. Those who want a more hands-on approach can still do manual configuration and tuning.
通過提供精選和托管的信號包,我們使應用程序所有者易于設置和配置。 這些包被組合到應用程序配置文件中,以解決最常見的服務類型。 Telltale自動跟蹤服務之間的依賴關系,以構建應用程序運行狀況模型中使用的拓撲。 信號包和拓撲檢測以最小的努力使配置保持最新狀態。 那些需要更多實踐方法的人仍然可以進行手動配置和調整。
No single algorithm can account for the wide variety of signals we use. So, instead, we employ a mix of algorithms including statistical, rule based, and machine learning. We’ll do a future Netflix Tech Blog article focused on our algorithms. Telltale also has analyzers to detect long-term trends or memory leaks. Intelligent monitoring means results our users can trust. It means a faster time to detection and a faster time to resolution during an incident.
沒有任何一種算法可以解釋我們使用的各種信號。 因此,相反,我們采用了多種算法,包括統計,基于規則和機器學習。 我們將在以后的Netflix Tech Blog文章中重點介紹我們的算法。 Telltale還具有分析器,可以檢測長期趨勢或內存泄漏。 智能監控意味著我們的用戶可以信賴的結果。 這意味著在事件期間更快的檢測時間和更快的解決時間。
智能警報 (Intelligent Alerting)
Intelligent monitoring yields intelligent alerting. Telltale creates an issue when it detects a health problem in your application’s ecosystem. Teams can opt in to alerting via Slack, email, or PagerDuty (all powered by our internal alerting system). If the issue is caused by an upstream or downstream system then Telltale’s context-aware routing alerts that team instead. Intelligent alerting also means a team receives a single notification, alert storms are a thing of the past.
智能監控可產生智能警報。 當Telltale檢測到應用程序生態系統中的運行狀況問題時,就會產生問題。 團隊可以選擇通過Slack,電子郵件或PagerDuty(均由我們的內部警報系統提供動力)進行警報。 如果問題是由上游或下游系統引起的,則Telltale的上下文感知路由會提醒該團隊。 智能警報還意味著團隊會收到一個通知,警報風暴已成為過去。
An example of a Telltale notification in Slack.Slack中的Telltale通知示例。When a problem strikes, it’s essential to have the right information. Our Slack alerts also start a thread containing only the most relevant context about the incident. This includes the signals that Telltale identified as unhealthy and the reasons why. The right context provides a better understanding of the application’s current state so the on-call engineer can return it to health.
出現問題時,掌握正確的信息至關重要。 我們的Slack警報還會啟動一個僅包含有關事件的最相關上下文的線程。 這包括Telltale標識為不健康的信號及其原因。 正確的上下文可以更好地了解應用程序的當前狀態,以便值班工程師可以將其恢復到健康狀態。
Incidents evolve and have their own lifecycle, so updates are essential. Are things getting better or worse? Are there new signals or events to consider? Telltale updates the Slack thread as the current incident unfolds. The thread is marked Resolved upon return to healthy state so users know, at a glance, which incidents are ongoing and which have been successfully remediated.
事件不斷發展并具有自己的生命周期 ,因此更新至關重要。 事情是好還是壞? 是否有新的信號或事件要考慮? Telltale在當前事件發生時更新Slack線程。 返回正常狀態后,該線程將標記為“已解決”,因此用戶一眼就能知道哪些事件正在進行,哪些事件已成功修復。
But these Slack threads aren’t just for Telltale. Teams use them to share additional data, observations, theories, and discussion about the incident. Incident data and discussion all in one thread makes for shared understanding, faster resolution, and easier post-incident analysis.
但是這些Slack線程不僅僅適用于Telltale。 團隊使用它們來共享有關事件的其他數據,觀察,理論和討論。 事件數據和討論全部集中在一個線程中,可以實現共識,更快的解決方案以及更容易的事件后分析。
We strive to improve the quality of Telltale alerts. One way to do that is to learn from our users. So we provide feedback buttons right in the Slack message. Users can tell us to suppress future occurrences of an alert. Or provide a reason for why an alert isn’t actionable. Intelligent alerting means alerts our users can trust.
我們努力提高Telltale警報的質量。 一種方法是向我們的用戶學習。 因此,我們在Slack消息中提供了反饋按鈕。 用戶可以告訴我們禁止將來發生警報。 或提供警報不可操作的原因。 智能警報意味著我們的用戶可以信任的警報。
An example of the details found in a Telltale notification in Slack.在Slack的Telltale通知中找到的詳細信息示例。為什么我的服務不健康? (Why Is My Service Unhealthy?)
A wide variety of signals, knowledge of the application’s ecosystem, and correlation of signals across multiple services helps Telltale to detect the possible causes of an application’s degraded health. Causes such as an outlier instance, a canary or deployment by a dependent service, an unhealthy database, or just a spike in traffic. Highlighting possible causes saves valuable time during an incident.
種類繁多的信號,對應用程序生態系統的了解以及跨多個服務的信號相關性有助于Telltale檢測應用程序運行狀況降低的可能原因。 原因包括異常實例,依賴服務的金絲雀或部署,數據庫運行不正?;蛄髁考ぴ?。 突出顯示可能的原因可以節省事件期間的寶貴時間。
事件管理 (Incident Management)
An example of a Telltale incident summary.Telltale事件摘要的示例。When Telltale sends an alert it also creates a snapshot that has references to the unhealthy signals. As new information arrives, it’s added to this snapshot. This simplifies the post-incident review process for many teams. When it’s time to review past issues, the Application Incident Summary feature shows all aspects of recent issues in a single place including key metrics like total downtime and MTTR (Mean Time To Resolution). We want to help our teams see larger patterns of incidents so they can improve overall service availability.
當Telltale發送警報時,它還會創建一個快照,其中引用了不正常的信號。 隨著新信息的到來,會將其添加到此快照中。 這簡化了許多團隊的事后審查流程。 當需要回顧過去的問題時,“ 應用程序事件摘要”功能可以在一個地方顯示最近問題的所有方面,包括關鍵指標,如總停機時間和MTTR(平均解決時間)。 我們希望幫助我們的團隊了解更大的事件模式,以便他們提高整體服務的可用性。
The cluster view groups similar incidents.群集視圖將類似事件分組。部署監控 (Deployment Monitoring)
Telltale’s application health model and intelligent monitoring have proven so powerful that we’re also using it for safer deployments. We start with Spinnaker, our open source delivery platform. As Spinnaker slowly rolls out a new build we use Telltale to continuously monitor the health of the instances running the new build. Continuous monitoring means a deployment stops and rolls back at the first sign of a problem. It means deployment problems have smaller blast radius and a shorter duration.
Telltale的應用程序運行狀況模型和智能監控已被證明非常強大,以至于我們還將其用于更安全的部署 。 我們從我們的開源交付平臺Spinnaker開始。 隨著Spinnaker緩慢推出新版本,我們使用Telltale連續監視運行新版本的實例的運行狀況。 持續監視意味著部署在出現問題的第一個跡象時停止并回滾。 這意味著部署問題的爆炸半徑較小,持續時間較短。
連續的提高 (Continuous Improvement)
Operating microservices in a complex ecosystem is challenging. We’re thrilled that Telltale’s intelligent monitoring and alerting helps our service operators improve availability, reduce toil, and sleep better at night. But we’re not done. We’re constantly exploring new algorithms to improve the accuracy of our alerts. We’ll write more about that in a future Netflix Tech Blog post. We’re also evaluating improvements to our application health model. We believe there’s useful information in service log and trace data. And benefits to employing higher resolution metrics. We’re looking forward to collaborating with our platform team on building out those new features. Getting new applications onto Telltale has been a white-glove treatment which doesn’t scale well, we can definitely improve our self-service UI. And we know there’s better heuristics to help pinpoint what’s affecting your service health.
在復雜的生態系統中運行微服務具有挑戰性。 我們很高興知道Telltale的智能監控和警報功能可以幫助我們的服務運營商提高可用性,減少勞累并在晚上睡得更好。 但是我們還沒有完成。 我們正在不斷探索新算法,以提高警報的準確性。 我們將在以后的Netflix Tech Blog帖子中寫更多有關此內容的信息。 我們還在評估對應用程序運行狀況模型的改進。 我們認為服務日志和跟蹤數據中有有用的信息。 并有利于采用更高分辨率的指標。 我們期待與我們的平臺團隊合作開發這些新功能。 將新應用程序引入Telltale一直是一種白手套,但無法很好地擴展,我們絕對可以改善自助服務UI。 我們知道,有更好的啟發式方法可以幫助您找出影響服務健康的因素。
Telltale is application monitoring simplified.
Telltale簡化了應用程序監視。
A healthy Netflix service enables us to entertain the world. Correlating disparate signals to model health in realtime is challenging. Add in thousands of streaming device types, an ever-evolving architecture, and a growing content production ecosystem and the problem becomes fascinating. If you’re passionate about observability then come talk to us.
健康的Netflix服務使我們能夠娛樂世界。 將不同的信號關聯起來以實時模擬健康狀況具有挑戰性。 加上成千上萬種流媒體設備類型,不斷發展的體系結構以及不斷增長的內容生產生態系統,問題變得更加令人著迷。 如果您對可觀察性充滿熱情,請 與我們聯系 。
翻譯自: https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba
總結
以上是生活随笔為你收集整理的Telltale:简化了Netflix应用程序监视的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Siebel学习笔记
- 下一篇: html消除自带边距,CSS3中清除外边