掌握大数据数据分析师吗?_要掌握您的数据吗? 这就是为什么您应该关心元数据的原因...
掌握大數(shù)據(jù)數(shù)據(jù)分析師嗎?
Either you are a data scientist, a data engineer, or someone enthusiastic about data, understanding your data is one thing you don’t want to overlook. We usually regard data as numbers, texts, or images, but data is more than that.
?ither你是一個數(shù)據(jù)科學(xué)家,數(shù)據(jù)工程師,還是有人熱衷于數(shù)據(jù), 了解你的數(shù)據(jù)是你不想忽視的一件事。 我們通常將數(shù)據(jù)視為數(shù)字,文本或圖像,但數(shù)據(jù)不僅限于此。
We should consider data as an independent entity. Data can make self-introduction, tell stories, and visualize trends. To reach those outcomes, you must understand your data first. Not only about how it was formed or its origin, but how it’ll change over time and its usability. Some of this information is what we call metadata.
我們應(yīng)該將數(shù)據(jù)視為一個獨立的實體。 數(shù)據(jù)可以自我介紹, 講故事和可視化趨勢。 為了獲得這些結(jié)果,您必須首先了解您的數(shù)據(jù)。 不僅是關(guān)于它的形成方式或起源,還包括它隨著時間的變化及其可用性的變化。 其中一些信息就是我們所說的元數(shù)據(jù)。
Why is metadata so important? And why must we master metadata before we master data? Today I’ll show you how we can leverage metadata in our data business.
為什么元數(shù)據(jù)如此重要? 為何我們在掌握數(shù)據(jù)之前必須掌握元數(shù)據(jù)? 今天,我將向您展示如何在數(shù)據(jù)業(yè)務(wù)中利用元數(shù)據(jù)。
到底什么是元數(shù)據(jù)? (What is metadata, exactly?)
According to Wikipedia, metadata is “data that provides information about other data”. It’s “data about data”. That sounds straightforward, doesn’t it? All data contains information about a specific thing. For metadata, that specific thing is another data.
根據(jù)維基百科 ,元數(shù)據(jù)是“ 提供有關(guān)其他數(shù)據(jù)的信息的數(shù)據(jù) ”。 這是“關(guān)于數(shù)據(jù)的數(shù)據(jù)” 。 這聽起來很簡單,不是嗎? 所有數(shù)據(jù)都包含有關(guān)特定事物的信息。 對于元數(shù)據(jù),那個特定的東西是另一種數(shù)據(jù)。
However, metadata also varies in the definition per se. It can be the name of the dataset, creation information, or statistical distribution of data points. It can be anything related to the data properties. With that said, all data must possess for it the metadata. But that’s not always the exhaustive case.
但是,元數(shù)據(jù)本身的定義也有所不同。 它可以是數(shù)據(jù)集的名稱,創(chuàng)建信息或數(shù)據(jù)點的統(tǒng)計分布 。 它可以是與數(shù)據(jù)屬性有關(guān)的任何內(nèi)容。 話雖如此,所有數(shù)據(jù)都必須擁有元數(shù)據(jù)。 但這并不總是窮舉。
Data without metadata is always incomplete.
沒有元數(shù)據(jù)的數(shù)據(jù)總是不完整的。
Types of metadata. Credit to the author.元數(shù)據(jù)的類型。 感謝作者。We use data with the hope of extracting useful insights, and the purpose of data comprehension. Metadata helps us to assert the data integrity, to verify the source of truth, or to maintain stable data quality.
我們使用數(shù)據(jù)的目的是希望提取有用的見解以及數(shù)據(jù)理解的目的。 元數(shù)據(jù)可幫助我們維護(hù)數(shù)據(jù)完整性,驗證真相來源或保持穩(wěn)定的數(shù)據(jù)質(zhì)量。
An example of an email’s metadata. Credit to the author.電子郵件元數(shù)據(jù)的示例。 感謝作者。However, in some cases, data users ignore the effect of metadata. They view it as just labels and the value it brings to the table is limited. We’ll see next how metadata is related to another critical aspect of data: Data quality.
但是,在某些情況下,數(shù)據(jù)用戶會忽略元數(shù)據(jù)的影響。 他們將其視為標(biāo)簽,并且它帶給表的價值是有限的。 接下來,我們將看到元數(shù)據(jù)與數(shù)據(jù)的另一個關(guān)鍵方面如何相關(guān): 數(shù)據(jù)質(zhì)量 。
資料品質(zhì) (Data quality)
Again, Wikipedia says: “Data quality refers to the state of qualitative or quantitative pieces of information.” In general, data is said to have high quality when “it fits the intended use case regardless of data users”.
維基百科再次說:“ 數(shù)據(jù)質(zhì)量是指定性或定量信息的狀態(tài) 。” 通常,當(dāng)數(shù)據(jù)“適合預(yù)期的使用情況而與數(shù)據(jù)用戶無關(guān)”時,數(shù)據(jù)被認(rèn)為具有高質(zhì)量。
Data is a valuable source of information, but nobody wants to use a piece of crap. The more you desire to extract from data, the more significant is data quality. In the world of Big Data, this also becomes a bottleneck.
數(shù)據(jù)是有價值的信息來源,但是沒有人愿意使用這些廢話。 您希望從數(shù)據(jù)中提取的內(nèi)容越多,數(shù)據(jù)質(zhì)量就越重要。 在大數(shù)據(jù)世界中,這也成為瓶頸。
Photo by Markus Winkler on Unsplash Markus Winkler在Unsplash上拍攝的照片As data grows bigger, so does metadata. We are not used to handling a great amount of metadata. Since it needs a special kind of treatment, we must consider it is at the same time data and not data. Metadata is not an independent piece of information but rather an attachment to our data. We have the possibility to extend that to become an assessment of the data quality.
隨著數(shù)據(jù)的增長,元數(shù)據(jù)也隨之增長。 我們不習(xí)慣處理大量的元數(shù)據(jù)。 由于它需要一種特殊的處理方式,因此必須同時考慮它是數(shù)據(jù)而不是數(shù)據(jù)。 元數(shù)據(jù)不是獨立的信息,而是數(shù)據(jù)的附件。 我們有可能將其擴(kuò)展為對數(shù)據(jù)質(zhì)量的評估。
Data is a valuable source of information, but nobody wants to use a piece of crap
數(shù)據(jù)是有價值的信息來源,但是沒有人愿意使用廢話
In a common effort of cultivating a high data quality in Big data pipelines, tech companies are paying lots of attention to this newish subject. From detecting anomalies to automatic alerting systems, we wish to limit the impact of erroneous data as little as possible. We can’t do this without data comprehension, or precisely without metadata.
為了在大數(shù)據(jù)管道中培養(yǎng)高質(zhì)量的數(shù)據(jù),技術(shù)公司一直在關(guān)注這一新話題。 從檢測異常到自動警報系統(tǒng),我們希望盡可能減少錯誤數(shù)據(jù)的影響。 沒有數(shù)據(jù)理解,或者沒有元數(shù)據(jù),我們就無法做到這一點。
Data quality reflects via many aspects, but most often is the correctness of values. Imagine you plot a histogram of university students’ grades within a semester. The histogram is a statistical representation of those values, and it describes your data. It becomes metadata. What you might interpret is the distribution of the grades, then you can conclude whether it will fit your use case.
數(shù)據(jù)質(zhì)量可以通過許多方面反映出來,但最常見的是值的正確性。 想象一下,您繪制了一個學(xué)期內(nèi)大學(xué)生成績的直方圖 。 直方圖是這些值的統(tǒng)計表示形式,它描述了您的數(shù)據(jù)。 它成為元數(shù)據(jù)。 您可能會解釋的是成績的分布,然后可以得出結(jié)論是否適合您的用例。
Using Histograms to Understand Your Data使用直方圖了解您的數(shù)據(jù)There are many questions to be asked about data values beforehand. Are those values stable overtime? Are there any outliers? If yes, what should we do with those outliers? By answering these questions, we extract some insights, not information-wise but data-wise. We can create metadata, useful metadata. That’s just a primitive step in asserting data quality via metadata. We’ll have a good look at the next section on how we can leverage metadata that we could generate.
事先有很多關(guān)于數(shù)據(jù)值的問題。 這些值在超時后是否穩(wěn)定? 有離群值嗎? 如果是,我們應(yīng)該如何處理這些異常值? 通過回答這些問題,我們可以得出一些見解,而不是信息方面的見解,而是數(shù)據(jù)方面的見解。 我們可以創(chuàng)建元數(shù)據(jù),有用的元數(shù)據(jù)。 這只是通過元數(shù)據(jù)聲明數(shù)據(jù)質(zhì)量的原始步驟。 我們將在下一節(jié)中很好地介紹如何利用我們可以生成的元數(shù)據(jù)。
如何利用元數(shù)據(jù) (How to leverage metadata)
Some people might be overwhelmed by the various statistical representations we can extract from a dataset. Others might as well ignore that additional information thinking it is useless. It’s true that we don’t need to draw a histogram every time working with data, but it helps. To leverage the insightful metadata, data users must first answer three important questions:
我們可能從數(shù)據(jù)集中提取的各種統(tǒng)計表示可能會讓某些人不知所措。 其他人可能會以為多余的信息無用,而忽略了這些信息。 的確,我們不需要每次處理數(shù)據(jù)時都繪制直方圖,但這很有用。 要利用有見地的元數(shù)據(jù),數(shù)據(jù)用戶必須首先回答三個重要問題:
What: What do you want to verify the quality of your data? Some data requires strict stability while some need attention whether it’s righteous. For each kind of data, we adapt the information extracted as metadata. Statistical distribution, trends over time, discrepancies, etc. This is what we call the metadata strategy. We are limited in storage and human resources while working with both data and metadata. Therefore, we must think cautiously about where to focus.
什么: 您想驗證什么數(shù)據(jù)質(zhì)量? 有些數(shù)據(jù)需要嚴(yán)格的穩(wěn)定性,而有些則需要注意其是否合理。 對于每種數(shù)據(jù),我們將提取的信息調(diào)整為元數(shù)據(jù)。 統(tǒng)計分布,隨時間的趨勢,差異等。這就是我們所說的元數(shù)據(jù)策略 。 在處理數(shù)據(jù)和元數(shù)據(jù)時,我們在存儲和人力資源上受到限制。 因此,我們必須謹(jǐn)慎考慮應(yīng)將重點放在哪里。
How: How do we measure data quality? These actions follow the metadata strategy. We could choose to measure the whole database, or some tables, or a specific set of columns. The total number of values, the maximum/minimum length of a string, the proportion of missing data. What we decide to measure depends on how we use those data to produce outcomes.
如何: 我們?nèi)绾魏饬繑?shù)據(jù)質(zhì)量? 這些操作遵循元數(shù)據(jù)策略。 我們可以選擇測量整個數(shù)據(jù)庫,某些表或一組特定的列。 值的總數(shù),字符串的最大/最小長度,丟失數(shù)據(jù)的比例。 我們決定衡量的內(nèi)容取決于我們?nèi)绾问褂眠@些數(shù)據(jù)來產(chǎn)生結(jié)果。
When: Data changes over time. When we extract insights via metadata, we are tracking those transitions. When do we track the metadata? Every day? Every hour? Every quarter? It depends on how much granularity is sufficient to address data quality. We adapt our measure to how quickly the data can change. For example, stock market data needs to be tracked every single minute or second. Weather data changes every hour while aerospatial data can take months or years to shift.
時間:數(shù)據(jù)隨時間變化。 當(dāng)我們通過元數(shù)據(jù)提取見解時,我們正在跟蹤這些過渡。 我們何時跟蹤元數(shù)據(jù)? 每天? 每隔一小時? 每個季度? 這取決于多少粒度足以解決數(shù)據(jù)質(zhì)量。 我們會根據(jù)數(shù)據(jù)變化的速度調(diào)整指標(biāo)。 例如,需要每隔一分鐘或一秒鐘跟蹤一次股市數(shù)據(jù)。 天氣數(shù)據(jù)每小時都會變化,而航空數(shù)據(jù)可能要花費數(shù)月或數(shù)年才能變化。
Metadata has its long history, but we have just recently discovered its contribution to data management, or especially data quality. Metadata itself can’t change the outcomes of data, but it adds a security and management layer between our raw data and its usage. You might even use metadata to discover your data without realizing it.
元數(shù)據(jù)具有悠久的歷史,但我們最近才發(fā)現(xiàn)它對數(shù)據(jù)管理 (特別是數(shù)據(jù)質(zhì)量)的貢獻(xiàn)。 元數(shù)據(jù)本身無法更改數(shù)據(jù)的結(jié)果,但會在原始數(shù)據(jù)及其使用之間增加安全性和管理層。 您甚至可能使用元數(shù)據(jù)來發(fā)現(xiàn)數(shù)據(jù)而沒有意識到。
Data quality might be insignificant when your data is small, but it becomes critical when working with a bigger amount. Metadata helps us keep track of that growth, and make sure the data evolves as it should be. By failing to leverage metadata, we fail to understand your data.
當(dāng)您的數(shù)據(jù)較小時,數(shù)據(jù)質(zhì)量可能微不足道,但在處理大量數(shù)據(jù)時就變得至關(guān)重要。 元數(shù)據(jù)可幫助我們跟蹤增長情況,并確保數(shù)據(jù)按預(yù)期發(fā)展。 由于未能利用元數(shù)據(jù),我們無法理解您的數(shù)據(jù)。
我該如何處理元數(shù)據(jù)? (What should I do with metadata?)
If you wish to master your data, you should start to treat metadata systematically. Base on the framework we have seen above, you choose for yourself a suitable data strategy. There’s nothing fancy about it yet. It starts with how you wish to use your data and how you control the quality of its usage. Everything starts with a goal.
如果您希望掌握數(shù)據(jù),則應(yīng)該開始系統(tǒng)地處理元數(shù)據(jù)。 在上面我們看到的框架的基礎(chǔ)上,您可以自己選擇合適的數(shù)據(jù)策略。 對此還沒有幻想。 它從您希望如何使用數(shù)據(jù)以及如何控制其使用質(zhì)量開始。 一切始于目標(biāo)。
There’s one phase in the ETL process called Exploratory Data Analysis. I find it quite interesting to know more about the statistical aspect of your data. It seems to be close to what we would like to know via metadata.
ETL過程中有一個階段稱為“ 探索性數(shù)據(jù)分析” 。 我發(fā)現(xiàn)對您的數(shù)據(jù)的統(tǒng)計方面的更多了解非常有趣。 它似乎與我們希望通過元數(shù)據(jù)知道的內(nèi)容接近。
I always see my data scientists and/or data analysts friends start with EDA before doing anything with their raw data. So I’ve figured out it must be an important step and I wondered how it’s linked to my metadata framework. They turn out to share quite a lot of things in common.
我總是看到我的數(shù)據(jù)科學(xué)家和/或數(shù)據(jù)分析師朋友從EDA開始,然后再處理原始數(shù)據(jù)。 因此,我認(rèn)為這必須是重要的一步,我想知道它如何與我的元數(shù)據(jù)框架鏈接。 他們竟然分享了很多共同點。
First comes the purpose. The “exploratory” part in EDA somehow coincides with the discovery objective of metadata. Second is how they both look at the statistical side of data to evaluate its future usage. With all that said, EDA is actually a must-to-have step due to its similarity to metadata-based assessment on data quality.
首先是目的。 EDA中的“探索性”部分在某種程度上與元數(shù)據(jù)的發(fā)現(xiàn)目標(biāo)相吻合。 其次是他們倆都如何看待數(shù)據(jù)的統(tǒng)計方面來評估其未來使用情況。 綜上所述,EDA實際上是必不可少的步驟,因為它與基于元數(shù)據(jù)的數(shù)據(jù)質(zhì)量評估相似。
You have the data strategy, the data evaluation, now it’s the time for you to decide what to proceed with all that information. How the data will be used decides whether it’s righteous and trustworthy under the eyes of a data quality control.
您有了數(shù)據(jù)策略,數(shù)據(jù)評估,現(xiàn)在是時候決定如何處理所有信息。 在數(shù)據(jù)質(zhì)量控制的眼中,如何使用數(shù)據(jù)將決定其是否合理和可信賴。
Key takeaways:- Build your data strategy based on data usability- Apply an EDA - Exploratory Data Analysis to evaluate the data
- Decide on whether you have a solid confidence on your data
結(jié)論 (Conclusion)
I’ve shared some of my points of view on metadata. For me, it has as much value as the data itself. Those who take advantage of these values are the ones who understand their data. It’s easier to misuse something we don’t comprehend. Metadata gives us a clearer view of the data, and furthermore data quality, integrity, and usability.
我已經(jīng)分享了一些有關(guān)元數(shù)據(jù)的觀點。 對我來說,它與數(shù)據(jù)本身一樣有價值。 那些利用這些價值的人就是了解他們的數(shù)據(jù)的人。 濫用我們不理解的東西會更容易。 元數(shù)據(jù)為我們提供了更清晰的數(shù)據(jù)視圖,以及數(shù)據(jù)質(zhì)量,完整性和可用性。
My name’s Nam Nguyen, and I write (mostly) about Big Data. Enjoy your reading? Follow me on Medium and Twitter for more updates.
我叫Nam Nguyen,(主要)寫有關(guān)大數(shù)據(jù)的文章。 喜歡閱讀嗎? 在Medium和Twitter上關(guān)注我以獲取更多更新。
翻譯自: https://towardsdatascience.com/want-to-master-your-data-heres-why-you-should-care-about-metadata-8fcd7754c3b8
掌握大數(shù)據(jù)數(shù)據(jù)分析師嗎?
總結(jié)
以上是生活随笔為你收集整理的掌握大数据数据分析师吗?_要掌握您的数据吗? 这就是为什么您应该关心元数据的原因...的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 不知道输入何时停止_知道何时停止
- 下一篇: 梦到蛇咬自己腿是什么意思