python自动化数据报告_如何:使用Python将实时数据自动化到您的网站
python自動化數據報告
This tutorial will be helpful for people who have a website that hosts live data on a cloud service but are unsure how to completely automate the updating of the live data so the website becomes hassle free. For example: I host a website that shows Texas COVID case counts by county in an interactive dashboard, but everyday I had to run a script to download the excel file from the Texas COVID website, clean the data, update the pandas data frame that was used to create the dashboard, upload the updated data to the cloud service I was using, and reload my website. This was annoying, so I used the steps in this tutorial to show how my live data website is now totally automated.
對于擁有在云服務上托管實時數據的網站但不確定如何完全自動化實時數據更新的網站的人來說,本教程將非常有用。 例如:我托管了一個網站 ,該網站在交互式儀表板上顯示按縣分類的德克薩斯州COVID病例數,但是每天我必須運行腳本以從德克薩斯州COVID網站下載excel文件,清理數據,更新以前的熊貓數據框。用于創建儀表板,將更新的數據上傳到我正在使用的云服務,然后重新加載我的網站。 這很煩人,所以我使用了本教程中的步驟來展示我的實時數據網站現在是如何完全自動化的。
I will only be going over how to do this using the cloud service pythonanywhere, but these steps can be transferred to other cloud services. Another thing to note is that I am new to building and maintaining websites so please feel free to correct me or give me constructive feedback on this tutorial. I will be assuming that you have basic knowledge of python, selenium for web scraping, bash commands, and you have your own website. Lets go through the steps of automating live data to your website:
我將只討論如何使用pythonanywhere的云服務來執行此操作,但是這些步驟可以轉移到其他云服務。 要注意的另一件事是,我是網站建設和維護的新手,請隨時糾正我或對本教程給我有建設性的反饋。 我假設您具有python的基本知識,用于網絡抓取的Selenium,bash命令,并且您擁有自己的網站。 讓我們完成將實時數據自動化到您的網站的步驟:
I will not be going through some of the code I will be showing because I use much of the same code from my last tutorial on how to create and automate an interactive dashboard using python found here. Lets get started!
我將不會看過將要顯示的一些代碼,因為我使用了上一篇教程中的許多相同代碼,它們是關于如何使用此處找到的python創建和自動化交互式儀表板的。 讓我們開始吧!
web scraping with selenium using a cloud service
使用云服務使用Selenium進行Web抓取
So in your cloud service of choice (mine being pythonanywhere), open up a python3.7 console. I will be showing the code in chunks but all the code can be combined into one script which is what I have done. Also, all the file paths in the code you will have to change to your own for the code to work.
因此,在您選擇的云服務(我的網站是pythonanywhere)中,打開一個python3.7控制臺。 我將分塊顯示代碼,但是所有代碼都可以組合成一個腳本,這就是我所做的。 同樣,您必須將代碼中的所有文件路徑更改為自己的路徑,代碼才能正常工作。
from pyvirtualdisplay import Displayfrom selenium import webdriver
import time
from selenium.webdriver.chrome.options import Optionswith Display():
# we can now start Firefox and it will run inside the virtual display
browser = webdriver.Firefox()# these options allow selenium to download files
options = Options()
options.add_experimental_option("browser.download.folderList",2)
options.add_experimental_option("browser.download.manager.showWhenStarting", False)
options.add_experimental_option("browser.helperApps.neverAsk.saveToDisk", "application/octet-stream,application/vnd.ms-excel")# put the rest of our selenium code in a try/finally
# to make sure we always clean up at the end
try:
browser.get('https://www.dshs.texas.gov/coronavirus/additionaldata/')# initialize an object to the location on the html page and click on it to download
link = browser.find_element_by_xpath('/html/body/form/div[4]/div/div[3]/div[2]/div/div/ul[1]/li[1]/a')
link.click()# Wait for 30 seconds to allow chrome to download file
time.sleep(30)print(browser.title)
finally:
browser.quit()
In the chunk of code above, I open up a Firefox browser within pythonanywhere using their pyvirtualdisplay library. No new browser will pop on your computer since its running on the cloud. This means you should test out the script on your own computer without the display() function because error handling will be difficult within the cloud server. Then I download an .xlsx file from the Texas COVID website and it saves it in my /tmp file within pythonanywhere. To access the /tmp file, just click on the first “/” of the files tab that proceeds the home file button. This is all done within a try/finally blocks, so after the script runs, we close the browser so we do not use any more cpu time on the server. Another thing to note is that pythonanywhere only supports one version of selenium: 2.53.6. You can downgrade to this version of selenium using the following bash command:
在上面的代碼中,我使用pyanytdisplaydisplay庫在pythonanywhere中打開了Firefox瀏覽器。 自從它在云上運行以來,沒有新的瀏覽器會在您的計算機上彈出。 這意味著您應該在沒有display()函數的情況下在自己的計算機上測試腳本,因為在云服務器中錯誤處理將很困難。 然后,我從Texas COVID網站下載.xlsx文件,并將其保存在pythonanywhere中的/ tmp文件中。 要訪問/ tmp文件,只需單擊文件選項卡的第一個“ /”,然后單擊主文件按鈕即可。 這都是在try / finally塊中完成的,因此在腳本運行之后,我們關閉瀏覽器,因此我們不再在服務器上使用更多的CPU時間。 要注意的另一件事是pythonanywhere僅支持一個Selenium版本: 2.53.6。 您可以使用以下bash命令降級到該版本的Selenium:
pip3.7 install --user selenium==2.53.62. converting downloaded data in a .part file to .xlsx file
2. 將.part文件中的下載數據轉換為.xlsx文件
import shutilimport glob
import os# locating most recent .xlsx downloaded file
list_of_files = glob.glob('/tmp/*.xlsx.part')
latest_file = max(list_of_files, key=os.path.getmtime)
print(latest_file)# we need to locate the old .xlsx file(s) in the dir we want to store the new xlsx file in
list_of_files = glob.glob('/home/tsbloxsom/mysite/get_data/*.xlsx')
print(list_of_files)# need to delete old xlsx file(s) so if we download new xlsx file with same name we do not get an error while moving it
for file in list_of_files:
print("deleting old xlsx file:", file)
os.remove(file)# move new data into data dir
shutil.move("{}".format(latest_file), "/home/tsbloxsom/mysite/get_data/covid_dirty_data.xlsx")
When you download .xlsx files in pythonanywhere, they are stored as .xlsx.part files. After some research, these .part files are caused when you stop a download from completing. These .part files cannot be opened with typical tools but there is a easy trick around this problem. In the above code, I automate moving the new data and deleting the old data in my cloud directories. The part to notice is that when I move the .xlsx.part file, I save it as a .xlsx file. This converts it magically, and when you open this new .xlsx file, it has all the live data which means that my script did download the complete .xlsx file but pythonanywhere adds a .part to the file which is weird but hey it works.
當您在pythonanywhere中下載.xlsx文件時,它們將存儲為.xlsx.part文件。 經過研究,這些.part文件是在您停止下載完成時引起的。 這些.part文件無法使用典型工具打開,但是可以解決此問題。 在上面的代碼中,我自動移動了新數據并刪除了云目錄中的舊數據。 需要注意的部分是,當我移動.xlsx.part文件時,我將其另存為.xlsx文件。 這會神奇地進行轉換,當您打開這個新的.xlsx文件時,它具有所有實時數據,這意味著我的腳本確實下載了完整的.xlsx文件,但是pythonanywhere向該文件中添加了.part很奇怪,但嘿,它起作用了。
3. re-loading your website using the os python package
3.使用os python軟件包重新加載您的網站
import pandas as pdimport relist_of_files = glob.glob('/home/tsbloxsom/mysite/get_data/*.xlsx')
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file)df = pd.read_excel("{}".format(latest_file),header=None)# print out latest COVID data datetime and notes
date = re.findall("- [0-9]+/[0-9]+/[0-9]+ .+", df.iloc[0, 0])
print("COVID cases latest update:", date[0][2:])
print(df.iloc[1, 0])
#print(str(df.iloc[262:266, 0]).lstrip().rstrip())#drop non-data rows
df2 = df.drop([0, 1, 258, 260, 261, 262, 263, 264, 265, 266, 267])# clean column names
df2.iloc[0,:] = df2.iloc[0,:].apply(lambda x: x.replace("\r", ""))
df2.iloc[0,:] = df2.iloc[0,:].apply(lambda x: x.replace("\n", ""))
df2.columns = df2.iloc[0]
clean_df = df2.drop(df2.index[0])
clean_df = clean_df.set_index("County Name")clean_df.to_csv("/home/tsbloxsom/mysite/get_data/Texas county COVID cases data clean.csv")df = pd.read_csv("Texas county COVID cases data clean.csv")# convert df into time series where rows are each date and clean up
df_t = df.T
df_t.columns = df_t.iloc[0]
df_t = df_t.iloc[1:]
df_t = df_t.iloc[:,:-2]# next lets convert the index to a date time, must clean up dates first
def clean_index(s):
s = s.replace("*","")
s = s[-5:]
s = s + "-2020"
#print(s)
return sdf_t.index = df_t.index.map(clean_index)df_t.index = pd.to_datetime(df_t.index)# initalize df with three columns: Date, Case Count, and County
anderson = df_t.T.iloc[0,:]ts = anderson.to_frame().reset_index()ts["County"] = "Anderson"
ts = ts.rename(columns = {"Anderson": "Case Count", "index": "Date"})# This while loop adds all counties to the above ts so we can input it into plotly
x = 1
while x < 254:
new_ts = df_t.T.iloc[x,:]
new_ts = new_ts.to_frame().reset_index()
new_ts["County"] = new_ts.columns[1]
new_ts = new_ts.rename(columns = {new_ts.columns[1]: "Case Count", "index": "Date"})
ts = pd.concat([ts, new_ts])
x += 1ts.to_csv("/home/tsbloxsom/mysite/data/time_series_plotly.csv")time.sleep(5)#reload website with updated data
os.utime('/var/www/tsbloxsom_pythonanywhere_com_wsgi.py')
Most of the above code I explained in my last post which deals with cleaning excel files using pandas for inputting into a plotly dashboard. The most important line for this tutorial is the very last one. The os.utime function shows access and modify times of a file or python script. But when you call the function on your Web Server Gateway Interface (WSGI) file it will reload your website!
我在上一篇文章中解釋了上面的大多數代碼,其中涉及使用熊貓清理excel文件并輸入到繪圖儀表板。 本教程最重要的一行是最后一行。 os.utime函數顯示文件或python腳本的訪問和修改時間。 但是,當您在Web服務器網關接口(WSGI)文件上調用該函數時,它將重新加載您的網站!
4. scheduling a python script to run every day in pythonanywhere
4.計劃每天在pythonanywhere中運行的python腳本
Image by yours truly真正的形象Now for the easy part! After you combine the above code into one .py file, you can make it run every day or hour using pythonanywhere’s Task tab. All you do is copy and paste the bash command, with the full directory path, you would use to run the .py file into the bar in the image above and hit the create button! Now you should test the .py file using a bash console first to see if it runs correctly. But now you have a fully automated data scraping script that your website can use to have daily or hourly updated data displayed without you having to push one button!
現在簡單一點! 將以上代碼組合成一個.py文件后,您可以使用pythonanywhere的“任務”標簽使它每天或每小時運行一次。 您要做的就是復制并粘貼帶有完整目錄路徑的bash命令,您將使用該命令將.py文件運行到上圖中的欄中,然后單擊“創建”按鈕! 現在,您應該首先使用bash控制臺測試.py文件,以查看其是否正常運行。 但是現在您有了一個全自動的數據抓取腳本,您的網站可以使用它來顯示每日或每小時的更新數據,而無需按一個按鈕!
If you have any questions or critiques please feel free to say so in the comments and if you want to follow me on LinkedIn you can!
如果您有任何疑問或批評,請隨時在評論中說,如果您想在LinkedIn上關注我,可以!
翻譯自: https://towardsdatascience.com/how-to-automate-live-data-to-your-website-with-python-f22b76699674
python自動化數據報告
總結
以上是生活随笔為你收集整理的python自动化数据报告_如何:使用Python将实时数据自动化到您的网站的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 学习sql注入:猜测数据库_面向数据科学
- 下一篇: 学习深度学习需要哪些知识_您想了解的有关