當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

tmdb数据集_数据科学第2部分的数据管道tmdb api数据搜寻器

發(fā)布時間：2023/12/31 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 tmdb数据集_数据科学第2部分的数据管道tmdb api数据搜寻器小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

tmdb數(shù)據(jù)集

Distributed TMDb API Data Download using AWS Lambda.

使用AWS Lambda的分布式TMDb API數(shù)據(jù)下載。

是否想偶爾聽到有關(guān)Tensorflow，Keras，DeepLearning4J，Python和Java的抱怨？ (Wanna hear occasional rants about Tensorflow, Keras, DeepLearning4J, Python and Java?)

Join me on twitter @ twitter.com/hudsonmendes!

和我一起在twitter @ twitter.com/hudsonmendes上！

Taking Machine Learning models to production is a battle. And there I share my learnings (and my sorrows) there, so we can learn together!

將機(jī)器學(xué)習(xí)模型投入生產(chǎn)是一場戰(zhàn)斗。 我在那里分享我的學(xué)習(xí)(和悲傷)，所以我們可以一起學(xué)習(xí)！

數(shù)據(jù)科學(xué)系列的數(shù)據(jù)管道 (Data Pipeline for Data Science Series)

This is a large tutorial that we tried to keep conveniently small for the occasional reader, and is divided into the following parts:

這是一個很大的教程，我們試圖為偶爾閱讀的讀者盡量減小它的大小，并分為以下幾部分：

Part 1: Problem/Solution Fit
Part 2: TMDb Data “Crawler”Part 3: Infrastructure As Code
(soon available) Part 4: Airflow & Data Pipelines
(soon available) Part 5: DAG, Film Review Sentiment Classifier Model
(soon available) Part 6: DAG, Data Warehouse Building
(soon available) Part 7: Scheduling and Landings

第1部分：問題/解決方案擬合第2部分：TMDb數(shù)據(jù)“抓取工具” 第三部分：基礎(chǔ)架構(gòu)即代碼(即將推出)第4部分：氣流和數(shù)據(jù)管道(即將推出)第5部分：DAG，電影評論情感分類器模型(即將推出)第6部分：DAG，數(shù)據(jù)倉庫構(gòu)建(即將推出)第7部分：計劃和著陸

問題：鏈接IMDb ID和TMDb ID(The Problem: Linking IMDb ids and TMDb ids)

This project has the following problem statement:

該項目具有以下問題陳述：

Data Analysts must be able to produce reports on-demand, as well as run several roll-ups and drill-down queries into what the Review Sentiment is for both IMDb films and IMDb actors/actresses, based on their TMDb Film Reviews; And the Sentiment Classifier must be our own.

數(shù)據(jù)分析師必須能夠按需生成報告，并且能夠基于他們的TMDb電影評論，對IMDb電影和IMDb演員/女演員的評論情緒進(jìn)行多次匯總和深入查詢； 情感分類器必須是我們自己的。

Looking at the TMDb API specification, we find that they have links to the IMDb Film ids:

查看TMDb API規(guī)范，我們發(fā)現(xiàn)它們具有指向IMDb電影ID的鏈接：

Source TMDb: https://developers.themoviedb.org/3/getting-started/external-ids來源TMDb ： https : //developers.themoviedb.org/3/getting-started/external-ids

However, the "find" endpoint only supports one ID per request:

但是，“查找”端點每個請求僅支持一個ID：

Source TMDb: https://developers.themoviedb.org/3/find/find-by-id來源TMDb ： https : //developers.themoviedb.org/3/find/find-by-id

Given that we need as many links as we can have between IMDb films and TMDb films, from where we will get the reviews, we will need to request the API a large number of times.

鑒于我們需要在IMDb電影和TMDb電影之間建立盡可能多的鏈接，因此從那里我們將獲得評論，我們將需要多次請求API。

技術(shù)挑戰(zhàn)：網(wǎng)絡(luò)延遲 (Technical Challenge: Network Latency)

Photo by Brett Sayles from PexelsPexels的Brett Sayles攝

The TMDb API does indeed have a “fast, consistent and reliable way to get third party data”, as they claim in their documentation.

正如他們在其文檔中所聲稱的那樣，TMDb API確實具有“快速，一致和可靠的方式來獲取第三方數(shù)據(jù)”。

However, every time we request and endpoint, we are subject to network latency.

但是，每次我們請求和終結(jié)點時，我們都會受到網(wǎng)絡(luò)延遲的影響。

Millions of requests (7,167,768), would amount for approximately 35 days.

數(shù)百萬的請求(7,167,768)大約需要35天。

解決方案：并行和分布式數(shù)據(jù)下載 (Solution: Parallelism & Distributed Data Download)

Network Latency only affects us if we request the data serially. In simpler terms, it only affects if the second request waits until the first request is completely finished.

僅當(dāng)我們連續(xù)請求數(shù)據(jù)時，網(wǎng)絡(luò)延遲才會影響我們。簡單來說，它僅影響第二個請求是否等待直到第一個請求完全完成。

The terminology for the options we have vary greatly, but in simple terms, the approaches we could take are:

我們擁有的選項的術(shù)語差異很大，但是簡單來說，我們可以采用的方法是：

CPU-level Parallelism: based on a multithreading (or similar) approach, where all the connections are handled but a single computer networking infrastructure, and the requests are parallelised by the CPU.

CPU級別的并行性：基于多線程(或類似方法)，其中處理所有連接，但使用單個計算機(jī)網(wǎng)絡(luò)基礎(chǔ)結(jié)構(gòu)，并且請求由CPU并行化。

Distributed Machine Parallelism: based on multiple machines, each one with a different networking infrastructure, running requests independently.

分布式機(jī)器并行性：基于多臺機(jī)器，每臺機(jī)器具有不同的網(wǎng)絡(luò)基礎(chǔ)結(jié)構(gòu)，獨立運行請求。

Given the vast number of requests we are going to make (millions), here are some constraints that we face when trying to download the data:

鑒于我們將要發(fā)出大量請求(數(shù)百萬個)，因此在嘗試下載數(shù)據(jù)時會遇到一些限制：

Large number of threads (millions)
大量線程(百萬)
Large number of connections to the API (millions)
與API的連接數(shù)量眾多(百萬)

These constraints basically lead us to go for the logical choice of Distributed Machine Parallelism to solve our problem.

這些約束條件基本上使我們選擇了分布式機(jī)器并行性來解決我們的問題。

基礎(chǔ)架構(gòu)：使用AWS Lambda的無服務(wù)器 (Infrastructure: Serverless with AWS Lambda)

We don't want to be setting up tons of servers to do the work for us just for this task, specially because after finishing this job, we want to teardown our whole infrastructure

我們不想為該任務(wù)設(shè)置大量服務(wù)器來為我們完成工作，特別是因為完成此工作后，我們希望拆除整個基礎(chǔ)架構(gòu)

For that reason, the obvious choice is Cloud Computing.

因此，顯而易見的選擇是Cloud Computing 。

Each of our download tasks can be coded to be a simple “download one film JSON”; in other words, a single “download” can be a simple function.

我們的每個下載任務(wù)都可以編碼為簡單的“下載一部電影JSON”；換句話說，一個“下載”可以是一個簡單的功能。

AWS Lambda: https://aws.amazon.com/lambda/AWS Lambda： https ： //aws.amazon.com/lambda/

That function is simple enough to be run by AWS Lambda, which removes our need to setup and entire server.

該功能非常簡單，可以由AWS Lambda運行，從而消除了我們設(shè)置和整個服務(wù)器的需要。

We give AWS Lambda a python function, it spans a number of servers and execute it for us.

我們?yōu)锳WS Lambda提供了一個python函數(shù)，它可以跨多個服務(wù)器并為我們執(zhí)行它。

Given our requirements, our function could be something as simples as the following:

根據(jù)我們的要求，我們的功能可能很簡單，如下所示：

import os
import json
from pipeline import IMDb, TMDb
from infra import Config

def lambda_handler(event, context):
"""
Downloads 'movies' for a particular {year}, with names that
start with a particular {initial} character

Parameters
----------
- event : 'Records' have the messages received from SQS (full body)
- context: lambda context wrapper

Message Body
------------
- year : the year for which movies will be downloaded
- initial: the first non-blank character of the name of the movie
"""

config = Config()

imdb = IMDb(
bucket_name=config.get_datalake_bucket_name())

tmdb = TMDb(
bucket_name=config.get_datalake_bucket_name(),
api_key=config.get_tmdb_api_key())

for record in event['Records']:

body = json.loads(record['body'])

year = int(body['year'])

initial = body['initial']

print(f'Lambda, processsing partition ({year}, {initial})')

imdb_movies_stream = imdb.get_movie_refs_stream(
year=year,
initial=initial)

tmdb_movie_and_reviews_generator = tmdb.get_movies_related_to(
imdb_movies_stream=imdb_movies_stream)

processed_count = 0
for tmdb_movie, tmdb_reviews in tmdb_movie_and_reviews_generator:
tmdb_movie.save()
tmdb_reviews.save()
processed_count += 1

print(f'Lambda, completed processing {processed_count}')

return {
'statusCode': 200,
'body': json.dumps(body)
}

The full source code for that module can be found here:

該模塊的完整源代碼可以在這里找到：

AWS Lambda函數(shù)：安裝 (AWS Lambda Function: Installing)

To make our life easier, we wrapped our entire code around a click CLI program:

為了使我們的生活更輕松，我們將整個代碼包裝在一個單擊CLI程序周圍：

import os
import json
import click

@click.group()
def cli():
"""
Command line group, allowing us to run `python tdd` commands
in the root folder of this repository.
"""
pass

@cli.command()
@click.option('--datalake_bucket_name', prompt='DataLake, Bucket Name', help='The S3 BucketName to which you will dump your files', default='hudsonmendes-datalake')
@click.option('--tmdb_api_key', prompt='TMDB, API Key', help='Find it in https://www.themoviedb.org/settings/api', default=lambda: os.environ.get('TMDB_API_KEY', None))
def development(datalake_bucket_name: str, tmdb_api_key: str):
"""
Setup the development environment locally, with the required configuration.
`python tdd development --datalake_bucket_name [bucket_name] --tmdb_api_key [api_key]`
"""
from infra import Config
Config().update(
datalake_bucket_name=datalake_bucket_name,
tmdb_api_key=tmdb_api_key)

@cli.command()
@click.option('--year', prompt='IMDB, Year', default=2004, help='Year of movies that will be downloaded')
@click.option('--initial', prompt='IMDB, Initial', default='AD', help='First letter of the films that will be downloaded')
@click.option('--queue_name', prompt='AWS SQS, Queue', default='hudsonmendes-tmdb-downloader-queue', help='The name of the queue to which we will send the message')
def simulate(year: int, initial: str, queue_name: str):
"""
Simulates the system by sending a one-off message to the SQS queue,
so that the Lambda Function can pick it up and we can evaluate that
the whole system is functioning.
"""
import boto3
messages = [{'year': year, 'initial': initial}]
sqs = boto3.resource('sqs')
queue = sqs.get_queue_by_name(QueueName=queue_name)
for message in messages:
body = json.dumps(message)
queue.send_message(MessageBody=body)

@cli.command()
@click.option('--year', prompt='IMDB, Year', default=2004, help='Year of movies that will be downloaded')
@click.option('--initial', prompt='IMDB, Initial', default='AD', help='First letter of the films that will be downloaded')
def download(year: int, initial: str):
"""
Invokes the lambda_function manually for a one-off download.
Can be used for debug purposes
"""
import lambda_function
event = {'Records': [{'body': json.dumps({'year': year, 'initial': initial})}]}
lambda_function.lambda_handler(event=event, context=None)

@cli.command()
@click.option('--lambda_name', prompt='AWS Lambda, Function Name', default='hudsonmendes-tmdb-downloader-lambda', help='The name of the function to which we will deploy')
@click.option('--queue_name', prompt='AWS SQS, Queue', default='hudsonmendes-tmdb-downloader-queue', help='The name of the queue to which we will send the message')
@click.option('--datalake_bucket_name', prompt='DataLake, Bucket Name', help='The S3 BucketName to which you will dump your files', default='hudsonmendes-datalake')
def deploy(lambda_name, queue_name, datalake_bucket_name):
"""
Deploy the system into lambda, creating everything that is necessary to run.
"""
from infra import Deploy
Deploy(
lambda_name=lambda_name,
queue_name=queue_name,
datalake_bucket_name=datalake_bucket_name).deploy()

if __name__ == "__main__":
cli()

That allows us running our code using an elegant CLI tool, with the following commands:

這使我們可以使用優(yōu)美的CLI工具以及以下命令來運行代碼：

# from the root of the git repositorypython tdd development # runs our development server
python tdd download # downloads one year of data = one job
python tdd deploy # installs the lambda into AWS Lambda

The python tdd deploy then deploys our code into lambda.

然后， python tdd deploy將我們的代碼部署到lambda中。

運行我們的分布式API數(shù)據(jù)下載器 (Running our Distributed API Data Downloader)

Our AWS Lambda configuration waits messages in a AWS SQS Queue. Whenever messages arrive there, it starts spinning of instances of our AWS Lambda function that will consume them.

我們的AWS Lambda配置在AWS SQS隊列中等待消息。每當(dāng)消息到達(dá)那里時，它將開始旋轉(zhuǎn)將消耗它們的AWS Lambda函數(shù)實例。

In order to Launch our download process, we must then send messages to AWS SQS, and we do so by running our Jupyter Notebook called "launch_fleet.ipynb".

為了啟動下載過程，我們必須將消息發(fā)送到AWS SQS，然后通過運行名為“ launch_fleet.ipynb”的Jupyter Notebook來發(fā)送消息。

Github @hudsonmendes: "launch_fleet.ipynb"Github @hudsonmendes：“ launch_fleet.ipynb”

This notebook will then schedule messages that partition the download jobs by initial (e.g: "TH" for The Matrix) and year.

然后，該筆記本將安排按初始時間(例如：《黑客帝國》的“ TH”)和年份對下載作業(yè)進(jìn)行分區(qū)的消息。

Each one of these download jobs must take less than 15 minutes (which is the maximum setting for the lambda function timeout.

這些下載作業(yè)中的每個作業(yè)都必須少于15分鐘(這是lambda函數(shù)超時的最大設(shè)置。

跑步后 (After running)

You should have all the results dropped into the "datalake_bucket_name" that you have defined in your notebook.

您應(yīng)該將所有結(jié)果放入您在筆記本中定義的“ datalake_bucket_name”中。

Once that is done, we are ready to start building the infrastructure for our data pipeline.

完成后，我們準(zhǔn)備開始為數(shù)據(jù)管道構(gòu)建基礎(chǔ)結(jié)構(gòu)。

綜上所述 (In Summary)

By looking at the requirements, we:

通過查看需求，我們：

Understood the need for Parallelism

了解并行性的需求

Chosen Distributed Machine Parallelism

選擇分布式機(jī)器并行

Understood that Cloud Computing was the best choice

理解云計算是最佳選擇

For this Task, Going Serverless with AWS Lambda was the best

對于此任務(wù)，使用AWS Lambda實現(xiàn)無服務(wù)器是最好的

Wrote our lambda_function in python

在python中編寫了lambda_function

We are now ready to create our infra-structure using Infrastructure as Code, with python.

現(xiàn)在，我們準(zhǔn)備使用python作為基礎(chǔ)結(jié)構(gòu)使用Code來創(chuàng)建基礎(chǔ)結(jié)構(gòu)。

下一步 (Next Steps)

In the next article Part 3: Infrastructure As Code we will deep dive into how how we used Amazon Lambda to crawl the TMDb Films information (using their API) using parallelisation.

在下一篇文章第3部分：基礎(chǔ)架構(gòu)即代碼中 我們將深入探討如何使用Amazon Lambda通過并行化來爬網(wǎng)TMDb電影信息(使用其API)。

源代碼 (Source Code)

Find the end-to-end solution source code at https://github.com/hudsonmendes/nanodataeng-capstone.

在https://github.com/hudsonmendes/nanodataeng-capstone中找到端到端解決方案源代碼。

想保持聯(lián)系嗎？推特！ (Wanna keep in Touch? Twitter!)

I’m Hudson Mendes (@hudsonmendes), coder, 36, husband, father, Principal Research Engineer, Data Science @ AIQUDO, Voice To Action.

我是Hudson Mendes ( @hudsonmendes )，編碼員，36歲，丈夫，父親，數(shù)據(jù)科學(xué)@ AIQUDO的首席研究工程師，語音行動。

I’ve been on the Software Engineering road for 19+ years, and occasionally publish rants about Tensorflow, Keras, DeepLearning4J, Python & Java.

我從事軟件工程工作已有19年以上，偶爾發(fā)布有關(guān)Tensorflow，Keras，DeepLearning4J，Python和Java的文章。

Join me there, and I will keep you in the loop with my daily struggle to get ML Models to Production!

加入我的行列，我將每天為使ML模型投入生產(chǎn)而竭盡全力！

翻譯自: https://medium.com/@hudsonmendes/data-pipeline-for-data-science-part-2-tmdb-api-data-crawler-d07bc9e6dde6