知乎 开源机器学习_使用开源数据和机器学习预测海洋温度
知乎 開源機器學習
In this tutorial, we’re going to show you how to take open source data from the National Oceanic and Atmospheric Administration (NOAA), clean it, and forecast future temperatures using no-code machine learning methods.
在本教程中,我們將向您展示如何從美國國家海洋和大氣管理局(NOAA)獲取開源數據,進行清理以及使用無代碼機器學習方法預測未來的溫度。
This particular data comes from the Harmful Algal BloomS Observation System (HABSOS). There are several interesting questions to ask of this data — namely, what is the relationship between algal blooms and water temperature fluctuations. For this tutorial, we’re going to start with a basic question: can we predict what temperatures will be over the next five months?
此特定數據來自有害藻華觀測系統(HABSOS)。 這個數據有幾個有趣的問題要問-即藻華與水溫波動之間的關系是什么。 對于本教程,我們將從一個基本問題開始:我們可以預測未來五個月的溫度嗎?
The first part of this tutorial deals with acquiring and cleaning the dataset. There are a lot of approaches to this; what is shown below is just one approach. Further, if your dataset is already clean, you can skip all that “data engineering” and jump straight into no-code AI bliss :)
本教程的第一部分涉及獲取和清理數據集。 有很多方法可以解決這個問題。 下面顯示的只是一種方法。 此外,如果您的數據集已經干凈,則可以跳過所有的“數據工程”,直接跳入無代碼的AI幸福:)
步驟1:下載并清理數據 (Step 1: Download & Clean the Data)
First, we download the data from the HABSOS site linked above. For convenience, we are posting the file here as well.
首先,我們從上面鏈接的HABSOS網站下載數據。 為了方便起見, 我們也在此處發布文件。
This CSV has 21 columns, which we discovered with this bash command.
該CSV共有21列,我們是通過bash命令發現的。
$ awk '{print NF}' habsos_20200310.csv | sort -nu | tail -n 121
We’ll explore the rest of the data in subsequent tutorials, but, of these 21 columns, the only columns I’m interested in for now are:
我們將在后續教程中探索其余數據,但是在這21列中,我目前唯一感興趣的列是:
- sample_date sample_date
- sample_depth sample_depth
- water_temp 水溫
In addition to only needing a subset of the columns in the data, there are other issues to deal with in order to get the data ready for analysis. We need to:
除了只需要數據中列的子集之外,還有其他問題需要處理才能準備好數據進行分析。 我們要:
Remove rows with NaN values (i.e. empty values) in thewater_temp column,
刪除water_temp列中具有NaN值(即空值)的行,
- Select only the measurements made at a depth of 0.5 meters (to remove temperature variability due to ocean depth), and 僅選擇在0.5米深度處進行的測量(以消除由于海洋深度引起的溫度變化),并且
- Regularize the data periods by turning the datetime values into date values. 通過將日期時間值轉換為日期值來規范化數據周期。
from datetime import datetime as dtdf = pd.read_csv('habsos_20200310.csv', sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
pd.set_option('display.max_rows', None)# Get only the columns we care about
dfSub = df[['sample_date','sample_depth','water_temp']]# Remove the NaN values
dfClean = dfSub.dropna()# Select 0.5 depth measurements only
dfClean2 = dfClean.loc[df['sample_depth'] == '0.5']# Split the datetime values
dfClean2['sample_date'] = dfClean2['sample_date'].str.split(expand=True)[0]dfClean2.to_csv(r'/PATH/TO/YOUR/OUTPUT/out.csv', index = False)
There’s another big problem with this data: on certain days, there are multiple sensor readings; on other days, there are no sensor readings. Sometimes there are entire months without readings.
這些數據還有另一個大問題:在某些日子里,會有多個傳感器讀數。 在其他日子里,沒有傳感器讀數。 有時整整幾個月都沒有閱讀。
These problems are quicker to address in spreadsheets by using pivot tables. And, now that we have reduced the size of the data with the preceding Python script, we areable to load it into a Google Sheet.
通過使用數據透視表,可以更快地在電子表格中解決這些問題。 而且,既然我們已經使用前面的Python腳本減小了數據的大小,現在可以將其加載到Google表格中了。
What we ended up doing is making a pivot table of each month of each year (1954 to 2020) and took the median water temperature for that month. We used median instead of average values in case there were wild outlier measurements that might skew our summarized data.
我們最終要做的是制作每年(1954年至2020年)每個月的數據透視表,并獲取該月的水溫中位數 。 如果存在異常的異常測量結果可能會歪曲匯總數據的情況,我們將使用中位數而不是平均值。
Our results are available for viewing in the third tab of this Google Sheet.
我們的結果可在此Google表格的第三個標簽中查看。
Let’s take those results and bring them into Monument!
讓我們將這些結果帶入Monument!
步驟2:繪制數據圖表并使用無代碼機器學習生成預測 (Step 2: Chart the Data & Use No-Code Machine Learning Generate a Forecast)
To chart the data, we’re first going to load it into Monument (www.monument.ai). Monument is an artificial intelligence/machine learning platform that allows you to use advanced algorithms without touching a line of code.
為了繪制數據圖表,我們首先將其加載到Monument( www.monument.ai )中。 Monument是一個人工智能/機器學習平臺,可讓您使用高級算法而無需編寫任何代碼。
First, we’re going to import our freshly cleaned data into Monument as a CSV file. In the INPUT tab, you’ll see the data as it exists in the source file on the top and the data as it will be imported into Monument on the bottom. If you’re satisfied with how it will be imported, click OK in the bottom right.
首先,我們將剛清理的數據作為CSV文件導入到Monument(紀念碑)中。 在“輸入”選項卡中,您將在頂部看到源文件中存在的數據,而在底部將看到導入到Monument中的數據。 如果對如何導入感到滿意,請單擊右下角的“確定”。
Load the data!加載數據!When you click OK, you’ll be brought into the MODEL tab. You can drag the “data pills” from the far left into the COLS(X) and ROWS(Y) areas to chart the data. You will clearly see the gaps in the data, where there were months with no temperature readings.
單擊“確定”后,您將進入“模型”選項卡。 您可以將“數據丸”從最左側拖動到COLS(X)和ROWS(Y)區域以繪制數據圖表。 您會清楚地看到數據中的差距,那里有數月沒有溫度讀數。
Monument’s algorithms can handle missing data.Monument的算法可以處理丟失的數據。This data has a visually recognizable pattern: it resembles a sine wave. In general — and especially when data has a repetitive pattern — it’s good to start an analysis with AutoRegression (AR). AR is one of the more “primitive” algorithms, but it often learns obvious patterns quickly.
該數據具有視覺上可識別的模式:類似于正弦波。 通常,尤其是當數據具有重復模式時,最好使用AutoRegression(AR)開始分析。 AR是更“原始”的算法之一,但是它經常可以快速學習明顯的模式。
When we apply AR to the water temperature data by dragging it into the chart, we see a spiked divergence from the actual historical data early in the training period, but that the algorithm quickly gets a handle on what is occurring in the dataset.
當我們通過將AR拖入圖表將AR應用于水溫數據時,我們發現在訓練初期它與實際歷史數據存在明顯的差異,但是該算法可以快速掌握數據集中的情況。
By the end of the training data, it almost perfectly overlays onto the training set. When an algorithm does a good job anticipating known historical data in the training period, it can be an indication that the algorithm will do well forecasting the future. (However, a concern is “overfitting,” which we will explore in future articles.)
到訓練數據結束時,它幾乎完美地覆蓋了訓練集。 當算法在訓練期間很好地預測已知歷史數據時,可能表明該算法可以很好地預測未來。 (但是,關注點是“過度擬合”,我們將在以后的文章中進行探討。)
Off to a good start!開啟良好的開端!Now, let’s try a Dynamic Linear Model (DLM). DLM is a slightly more complex algorithm — let’s see if it gets us even better results. When we drag DLM into the chart, we notice immediately that something seems off: DLM appears out of sync with the training data. It has trouble anticipating where the peaks and troughs are in the historical data.
現在,讓我們嘗試動態線性模型(DLM)。 DLM是一種稍微復雜一些的算法-讓我們看看它能否為我們帶來更好的結果。 當我們將DLM拖到圖表中時,我們立即注意到似乎有些不對勁:DLM似乎與訓練數據不同步。 很難預測高峰和低谷在歷史數據中的位置。
Uh oh…呃哦If we zoom in by dragging the windowing widget below the chart and mute the AR results by clicking the color box above the cart, the effect is even more pronounced. The historical data and DLM are out of sync, so it’s unlikely that the forecasted results — beyond the historical data — will be reliable.
如果我們通過拖動圖表下方的窗口小部件進行放大,并通過單擊購物車上方的顏色框使AR結果靜音,則效果會更加明顯。 歷史數據和DLM不同步,因此,超出歷史數據的預測結果不太可能可靠。
Not looking good…不好看...Let’s try Time-Varying AutoRegression (TVAR). It looks like it produces similar results to AR.
讓我們嘗試時變自動回歸(TVAR)。 看起來它產生與AR類似的結果。
Looking good.看起來不錯。Now, let’s try Long Short-Term Memory (LSTM). This is way off! An LSTM often produces great results for “noisier” data that has less regular patterns. However, on highly patterned data like this dataset, it has trouble.
現在,讓我們嘗試長短期記憶(LSTM)。 這是路! LSTM通常會為規則模式較少的“噪點”數據產生很好的結果。 但是,在像該數據集這樣的高度模式化的數據上,它會遇到麻煩。
There are ways to improve the performance of the LSTM (and any algorithm) by adjusting the algorithm’s parameters, but we already have algorithms performing well, so it doesn’t seem worth the effort.
有多種方法可以通過調整算法的參數來提高LSTM(和任何算法)的性能,但是我們已經擁有性能良好的算法,因此這似乎不值得付出努力。
The LSTM has forsaken us…LSTM拋棄了我們……Now, let’s zoom in to see what we are working with by using the windowing widget on the bottom of the chart. Let’s also click the circles icon in the top right of Monument and select “forecast” to remove the training period and only show the prediction.
現在,讓我們使用圖表底部的窗口小部件放大以查看我們正在使用什么。 我們還單擊“紀念碑”右上角的圓圈圖標,然后選擇“預測”以刪除訓練時間并僅顯示預測。
The TVAR had looked good when zoomed out, but up close all of our algorithms seem to agree with one another, with the exception of TVAR. Let’s drop TVAR.
縮小時,TVAR看起來不錯,但近距離我們的所有算法似乎彼此一致,但TVAR除外。 讓我們放下TVAR。
TVAR does not look so good up close.近距離來看,TVAR看起來不太好。Let’s bring back “training+forecast,” remove everything but AR, and apply the Gaussian Dynamic Boltzmann Machine (G-DyBM). Things are looking pretty good now :)
讓我們帶回“訓練+預測”,除去AR之外的所有內容,然后應用高斯動態玻爾茲曼機(G-DyBM)。 現在情況看起來不錯:)
The sweet spot.最好的地方。Let’s flip over to the OUTPUT tab and scroll to the bottom to see our forecasts. Because we made our data periods monthly, p1, p2, p3, p4, and p5 are Month-1, Month-2, Month-3, Month-4, and Month-5 into the future.
讓我們轉到“輸出”選項卡并滾動到底部以查看我們的預測。 因為我們將數據周期設為每月,所以p1,p2,p3,p4和p5分別是未來的第1個月,2個月,3個月,4個月和5個月。
In this tutorial, we took open source data from the internet, cleaned it, loaded it into Monument, and — in minutes! — used advanced data science methods to get forecasts for future median monthly water temperatures in the Gulf of Mexico at a depth of 0.5 meters.
在本教程中,我們從互聯網上獲取了開放源數據,將其清理,然后將其加載到Monument中,然后-只需幾分鐘! -使用先進的數據科學方法來獲得對墨西哥灣0.5米深處未來每月平均水溫的預測。
You can download the .mai file of our results from this link.
您可以從此鏈接下載結果的.mai文件。
In the next tutorial, we’ll look deeper at the error rates for each of the algorithms we tried above and discuss why we might select one algorithm over another. We’ll also calculate the standard deviation for the outliers and discuss why this is important.
在下一個教程中,我們將更深入地研究上面嘗試的每種算法的錯誤率,并討論為什么我們可能選擇一種算法而不是另一種算法。 我們還將計算離群值的標準偏差,并討論為什么這很重要。
翻譯自: https://medium.com/swlh/using-open-source-data-machine-learning-to-predict-ocean-temperatures-2c8d65165665
知乎 開源機器學習
總結
以上是生活随笔為你收集整理的知乎 开源机器学习_使用开源数据和机器学习预测海洋温度的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 我的世界扁桃在哪里直播(汉典我字的基本解
- 下一篇: 苹果max电池容量多少