大数据数据量估算_如何估算数据科学项目的数据收集成本
大數據數據量估算
(Notes: All opinions are my own)
(注:所有觀點均為我自己)
介紹 (Introduction)
Data collection is the initial and fundamental step in any Data Science or Analytics project, and on which all following activities rely, from data analysis to model deployment.
數據收集是任何數據科學或Analytics(分析)項目中的第一步,也是基礎步驟,從數據分析到模型部署,所有后續活動都依賴于此。
With the pervasive presence of APIs and Cloud Computing, I am ever more intrigued in maximizing the efficiency and level of automation of data collection activities for both work and personal projects.
隨著API和云計算的普遍存在,我對將工作和個人項目的數據收集活動的效率和自動化水平最大化實現了極大的興趣。
In the latter category, I have been interested in collecting data from online home-rental platforms in the UK market (Zoopla, RightMove, OnTheMarket, and similar) with the aim of extracting image and text data to be processed for use in machine learning models (for use cases such as prediction of a property’s price, extraction of key features from image-data to infer a listing’s true value, processing of customer reviews through NLP techniques, etc..)
在后一類中,我感興趣的是從英國市場( Zoopla , RightMove , OnTheMarket等)的在線家庭租賃平臺收集數據,目的是提取要處理的圖像和文本數據,以用于機器學習模型。 (對于用例,例如預測房地產價格,從圖像數據中提取關鍵特征以推斷出房源的真實價值,通過NLP技術處理客戶評論等)。
In the following lines, I aim to discuss how to potentially go about:
在下面的幾行中,我旨在討論如何實現:
The identification of the most critical data sources
識別最關鍵的數據源
The estimation of data collection costs should you want to put your solution to commercial use
如果您要將解決方案投入商業使用,則需要估算數據收集成本
I gave the article a broader cut, which touches upon market and regulatory considerations to be made when reasoning around data collection for potentially commercial purposes, as well as the more technical considerations of working with APIs, as I realize there are multiple layers to be surfaced within this very interesting topic.
我對文章進行了更廣泛的介紹,其中涉及了出于潛在商業目的而進行數據收集推理時要考慮的市場和監管方面的考慮,以及涉及API的更多技術方面的考慮,因為我意識到要浮出水面在這個非常有趣的話題中。
I hope the below key points will result useful in setting up the Data Collection block of your current and future Data Science projects, no matter your industry focus.
我希望以下要點將有助于您建立當前和將來的數據科學項目的數據收集模塊,無論您關注的是行業如何。
做市場調查并確定您的關鍵數據源 (Do your market research & identify your key data sources)
In two-sided markets such as online home rental platforms, which are dominated by supply and demand agents (on the supply side, homeowners looking to rent, either directly or through a real-estate agent; on the demand side, individuals looking to rent), you are going to find the most data, both in terms of quantity and quality, on those platforms which drive the majority of traffic in a given market, from both supply and demand sides.
在雙向市場(例如在線房屋租賃平臺)中,供求代理占主導地位(在供應方面,希望直接或通過房地產代理進行租賃的房主;在需求方,希望進行租賃的個人) ,您將在驅動特定市場中來自供需雙方的大部分流量的平臺上找到數量和質量方面最多的數據。
In this sense, you need to identify the platforms which hold the majority of market power as they pull and attract most eyeballs. Knowing the market’s distribution of overall traffic/data volume is very useful if you are looking to pull high amounts of data over time, and do not want to be integrating multiple data streams coming from smaller market players.
從這個意義上講,您需要確定在吸引和吸引大多數眼球的同時擁有大部分市場力量的平臺。 如果您希望隨時間推移獲取大量數據,并且不想集成來自較小市場參與者的多個數據流,則了解市場的總體流量/數據量分布非常有用。
In the UK’s online home rental market, the majority of the traffic and listings is distributed between the top 1–5 players, and those companies (the left of the curve in the below illustrative distribution) are therefore the ones on which you want to focus your data collection efforts on.
在英國的在線房屋租賃市場中,大部分流量和列表都分布在排名靠前的1-5個參與者之間,因此,您要關注的公司(以下示例性分布中曲線的左側)您的數據收集工作正在繼續。
Mode.comMode.comThis is of course a double-edged sword, as the big players from which you are going to be sourcing from have high leverage when it comes to entering data-sharing agreements, which allows them to:
當然,這是一把雙刃劍,因為要簽訂數據共享協議時 ,您將要從中采購的大型參與者具有很高的杠桿作用 ,這使他們可以:
1) act as de-facto gatekeepers to a particular market and set their own data usage policies, especially in a less regulated market scenario
1)充當特定市場的事實上的守門人,并制定自己的數據使用策略,尤其是在市場監管不嚴格的情況下
2) charge more per the same unit of data volume when entering data sharing agreements
2)簽訂數據共享協議時,按同一單位數據量收取更多費用
3) effectively monitor potential competitive threats to their core-business from startups who require access to their data and who are thus more dependent on their services
3)有效監控那些需要訪問其數據并因此更加依賴其服務的初創公司對其核心業務的潛在競爭威脅
At the same time, given a skewed distribution of market share and in the absence of enforcing anti-competitive regulation, this is where the true value of the data resides, and thus aspiring Data Science teams which want to put their hands on this data need to pay a price to tackle the majority of the market and access high volume, high quality data points.
同時,由于市場份額的分配存在偏差,并且沒有實施反競爭法規,這就是數據的真正價值所在,因此,有志向的數據科學團隊希望將他們的手放在這一數據需求上付出一定的代價來應對大多數市場,并獲得大量,高質量的數據點。
N.B For non commercial or research purposes, you are probably OK just scraping data off these websites (although the activity is not always appreciated when done at high frequency and volume — this is purely a practical consideration, I do not encourage web scraping on websites which have policies against it, and you are always better off respecting the terms and conditions of the data provider).
注意:出于非商業或研究目的,您可能只是從這些網站上抓取數據就可以了(盡管以高頻率和高流量進行操作時并不總是能體會到這種活動-純粹是出于實際考慮,我不鼓勵在這些網站上進行網絡抓取有反對的政策,那么您始終最好遵守數據提供者的條款。
始終先尋找API (Always look for APIs first)
Once you have identified the main data sources, your first bet is looking through their developer resources and figuring out:
一旦確定了主要數據源,您的第一個賭注就是瀏覽他們的開發人員資源并弄清楚:
Whether they have an active API from which you can pull the data you need
他們是否具有活動的API,您可以從中提取所需的數據
What their overall data sharing terms and conditions (T&Cs) are
他們的總體數據共享條款和條件(T&C)是什么
Zoopla, for example, has an API page, which can be useful to return a few features and listings data. Zoopla’s specific API has not being updated in a while and has apparently drawn criticism previously documented on Medium, but this type of information is what you want to look for when comparing different data sources.
例如, Zoopla有一個API頁面,可以用于返回一些功能和列表數據。 Zoopla的特定API暫時沒有更新,并且顯然引起了先前在Medium上記錄的批評,但是當您比較不同的數據源時,您需要查找此類信息。
When moving on to RightMove, you are directed to their Data Services page, per their official website. They do not seem to have or authorize any official API at the time of writing. OnTheMarket.com also does not seem to have any API as well.
轉到RightMove時,您將通過其官方網站轉到其“ 數據服務”頁面。 在撰寫本文時,他們似乎沒有或未授權任何官方API。 OnTheMarket.com似乎也沒有任何API。
Checking the main players is incredibly useful to determine your next steps in your data collection strategy. You can get some sample data if you find an active API and decide:
檢查主要參與者對于確定數據收集策略中的下一步非常有用。 如果找到有效的API并決定以下內容,則可以獲得一些示例數據:
Whether the data volume and quality is enough for your application
數據量和質量是否足以滿足您的應用程序
Whether you are in violation of their T&Cs
您是否違反其條款和條件
Whether you want to get in touch with the Data Providers (see next steps) to submit a format data request to obtain further and hopefully richer datasets
是否要與數據提供者聯系(請參閱后續步驟)以提交格式數據請求以獲取更多(希望是更豐富)的數據集
Whether to move on to other smaller players in the market which may give you enough data (via their own API) to start off with (other aggregators such as Nestoria, which does provide one)
是否轉向市場上其他較小的參與者,這可能會(通過他們自己的API)為您提供足夠的數據作為開始(其他類似 Nestoria的 聚合器( 確實提供了這一點))
No matter the case, do not skip this step as it provides very valuable information, even if you are not immediately given access to what you need.
無論如何,即使您沒有立即獲得所需的信息,也不要跳過此步驟,因為它會提供非常有價值的信息。
不要害怕與數據提供者聯系并討論潛在的數據共享協議 (Don’t be afraid to get in touch with data providers and discuss potential data-sharing agreements)
In my case, I decided to dig a bit deeper and thus got tentatively in touch with RightMove & Zoopla, via email and LinkedIn, by searching for Analytics roles and by reaching out to viable prospects.
就我而言,我決定進行更深入的研究,并通過電子郵件和LinkedIn來搜索Right Analytics和Zoopla,并通過搜索Analytics角色并尋求可行的潛在客戶來暫時聯系。
I recommend doing this as you can always find people on the other side who are interested in supporting developers and hearing out interesting use cases. You may also uncover information which you did not previously noticed while reading through the various documentations.
我建議您這樣做,因為您總是可以在另一側找到對支持開發人員和聽到有趣的用例感興趣的人員。 您可能還會發現在閱讀各種文檔時以前沒有注意到的信息。
In my case, I found RightMove to be very restrictive of their data’s usage, and thus the only thing I really obtained from them was a cold shoulder. Same with Zoopla, which merely referred me back to their existing API, whose data richness I doubted after having tested it briefly with a Python script.
就我而言,我發現RightMove限制了他們數據的使用,因此,我真正從他們那里獲得的唯一一件事就是冷漠的肩膀。 與Zoopla一樣,后者只是讓我回到了他們現有的API,在使用Python腳本對其進行了簡短測試之后,我對它的數據豐富性表示懷疑。
At this point, I decided to search online to identify applications and platforms which already made use of data coming from either one of the two main providers, and see if I could extract further information on how they had done so and potentially at what cost.
在這一點上,我決定在線搜索以標識已經利用了來自兩個主要提供商之一的數據的應用程序和平臺,并查看我是否可以提取有關他們這樣做的進一步信息以及潛在的成本。
I could have also doubled down on Zoopla & RightMove and decided to propose a data-sharing agreement, but as a single individual, how much leverage would I realistically possess in such a conversation?
我本可以對Zoopla和RightMove進行一番研究,然后決定提出一項數據共享協議,但是作為一個人,我實際上可以在這種對話中擁有多少杠桿作用?
In similar cases in which you are trying to decide where and how to collect your data from, I suggest you either:
在嘗試確定從何處以及如何收集數據的類似情況下,我建議您:
Take your time with researching the market and various data providers, and give yourself as many potential data sources as possible, which will also allow you to compare their costs against the budget you are willing to allocate to your project
花些時間研究市場和各種數據提供者,并給自己盡可能多的潛在數據源,這也使您可以將它們的成本與您愿意分配給項目的預算進行比較。
Take you time to establish a relationship with the few providers of choice (if they do not necessarily have a clear-cut API, such as in this case) and extract as much price/other information from them, while also being very transparent in the use you plan to make of their data (research, commercial, personal, etc.)
花一些時間與所選的少數提供者建立關系(如果它們不一定具有明確的API,例如在這種情況下),并從它們中提取盡可能多的價格/其他信息,同時在提供者中也非常透明使用您打算利用其數據(研究,商業,個人等)的數據
利用您之前收集數據的其他人的專業知識 (Leverage the expertise of others who have collected the data before you)
After having identified your main data sources and having checked for APIs and their usage potential, you’d also want to reach out to other market players who are exploiting those same data sources and see if you can uncover further insights.
在確定了主要數據源并檢查了API及其潛在用途之后,您還希望與其他正在利用相同數據源的市場參與者建立聯系,看看您是否可以發現進一步的見解。
I found this to be an incredible little steps in getting some great-quality contextual information around data collection costs.
我發現這是獲取有關數據收集成本的高質量上下文信息的令人難以置信的小步驟。
For example, I found a great website, Property Data, which cites the same data sources I was looking for, and thus I immediately sent an email using their contact form.
例如,我發現了一個很棒的網站Property Data ,它引用了我一直在尋找的相同數據源 ,因此我立即使用他們的聯系表發送了一封電子郵件。
To my surprise, the founder himself replied, mentioning the amount of money one provider was charging PropertyData to get them what they needed, as well as confirming they had not been able to convince another provider to send over their data, no matter the price point proposed, thus confirming my previous negative experience when reaching out to most of them via email/LinkedIn.
令我驚訝的是,創始人本人回答說,提到一家提供商向PropertyData收取的費用,以獲取他們所需的東西,并確認無論價格高低,他們都無法說服另一家提供商發送其數據。建議,從而證實了我以前通過電子郵件/ LinkedIn與大多數人聯系時的負面經歷。
-(below is the extract from the email response I got from PropertyData, sanitised where possible for confidentiality reasons)-
-(以下是我從PropertyData獲得的電子郵件回復的摘錄,出于機密原因,在可能的情況下進行了清理)-
“We pay Source 1 £XX per month. That did the trick to get us what we needed!
“我們每月向Source 1支付XX英鎊。 這樣做的竅門就是獲得我們所需的東西!
Source 2, no amount of money makes them interested!
來源2,沒有多少錢讓他們感興趣!
PropertyData”
PropertyData”
This is great information as:
這是非常有用的信息,因為:
It gives you an actual estimation amount from which to extrapolate data collection costs for similar providers, in the absence of any API or price points.
在沒有任何API或價格點的情況下,它為您提供了一個實際的估算金額,可以從中估算出類似提供商的數據收集成本。
Gives you further indication of which data sources might be more feasible to work with and which ones you might avoid altogether, using the experience of others as a compass.
借助其他人的經驗,進一步指示使用哪些數據源可能更可行,以及完全避免使用哪些數據源。
I always recommend taking the time to reach out to who has done it before and just ask, you might get positively surprising and helpful responses in return!
我總是建議花點時間聯系以前做過的事情的人,然后再問,您可能會得到積極的驚喜和有益的回應!
運行您的估計并檢查財務和技術可行性 (Run your estimations and check financial and technical feasibility)
By this point, you should have collected all the information needed to calculate the monthly running costs for data collection, which can be estimated by:
至此,您應該已經收集了計算數據收集每月運行成本所需的所有信息,可以通過以下方式進行估算:
(Number of data sources * Avg. Monthly Subscription Costs of API/Data Agreement)
(數據源數量* API /數據協議的平均每月訂閱費用)
To this, you might want to factor in any Cloud Computing resources, which are going to be dependent on your data collections scripts and the amount of processing resources (time, data size driven) you are going to be utilizing to get your data into your data lake/data warehouse for later processing and analysis.
為此,您可能需要考慮 任何云計算資源 ,這些資源將取決于您的數據收集腳本和將用于將數據放入您的處理資源(時間,數據大小驅動)的數量。數據湖/數據倉庫,供以后處理和分析。
Aside from the mere numbers, at this moment you should also develop a sense for the overall technical feasibility of the approach given your project set up, and whether it can make sense to proceed or to completely pivot your data collection strategy.
除了數量之外,此刻,您還應該對項目建立后的方法的整體技術可行性以及是否繼續進行或完全采用數據收集策略有意義。
綜上所述 (In summary)
Having a sound data collection methodology and approach can really set your data science project up and running in the best way, while getting the best possible data at the best possible price given your market domain knowledge and the data providers available.
擁有完善的數據收集方法論和方法,可以真正以最佳方式設置和運行數據科學項目,同時根據您的市場領域知識和可用的數據提供者,以最優惠的價格獲得最佳的數據。
If you can:
如果你可以的話:
Conduct solid market research and identify the best quality sources
進行扎實的市場研究并確定最佳質量來源
Thoroughly check for existing’s APIs and their (usually) rich documentation
徹底檢查現有的API及其(通常)豐富的文檔
Additionally reach out to data providers to address potential data requests and their willingness to assist you
此外,還可以與數據提供商聯系,以解決潛在的數據請求及其愿意為您提供幫助的意愿
Further increase your knowledge base by asking around to people and companies who have been given access to the data before you
通過在訪問您之前先詢問有權訪問數據的人員和公司,進一步增加您的知識庫
Get a fair estimation of how much time and money you are realistically going to spend to capture all the data you need
合理估算您實際上將花費多少時間和金錢來捕獲所需的所有數據
You can greatly increase your chances of developing a sound approach for data collection and maximize your chances of getting great data in an efficient way. Thanks for reading!
您可以極大地提高開發合理的數據收集方法的機會,并最大限度地提高以有效方式獲取優質數據的機會。 謝謝閱讀!
Access my free Data Science resource checklist here
在此處 訪問我的免費數據科學資源清單
翻譯自: https://towardsdatascience.com/how-to-estimate-data-collection-costs-for-your-data-science-project-8938ca9acc5f
大數據數據量估算
總結
以上是生活随笔為你收集整理的大数据数据量估算_如何估算数据科学项目的数据收集成本的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 首架国产大飞机 C919 完成兔年首次飞
- 下一篇: 为什么和平精英无响应_什么和为什么