来自TMDB的5000部电影数据集
原文:
TMDB 5000 Movie Dataset
Metadata on ~5,000 movies from TMDb
What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?
This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.
We have removed the original version of this dataset per a?DMCA?takedown request from IMDB. In order to minimize the impact, we're replacing it with a similar set of films and data fields from?The Movie Database (TMDb)?in accordance with?their terms of use. The bad news is that kernels built on the old dataset will most likely no longer work.
The good news is that:
-
You can port your existing kernels over with a bit of editing.?This kernel?offers functions and examples for doing so. You can also find?a general introduction to the new format here.
-
The new dataset contains full credits for both the cast and the crew, rather than just the first three actors.
-
Actor and actresses are now listed in the order they appear in the credits. It's unclear what ordering the original dataset used; for the movies I spot checked it didn't line up with either the credits order or IMDB's stars order.
-
The revenues appear to be more current. For example, IMDB's figures for Avatar seem to be from 2010 and understate the film's global revenues by over $2 billion.
-
Some of the movies that we weren't able to port over (a couple of hundred) were just bad entries. For example,?this IMDB entry?has basically no accurate information at all. It lists Star Wars Episode VII as a documentary.
Data Source Transfer Details
-
Several of the new columns contain json. You can save a bit of time by porting the load data functions [from this kernel]().
-
Even in simple fields like runtime may not be consistent across versions. For example, previous dataset shows the duration for Avatar's extended cut while TMDB shows the time for the original version.
-
There's now a separate file containing the full credits for both the cast and crew.
-
All fields are filled out by users so don't expect them to agree on keywords, genres, ratings, or the like.
-
Your existing kernels will continue to render normally until they are re-run.
-
If you are curious about how this dataset was prepared, the code to access TMDb's API is posted?here.
New columns:
-
homepage
-
id
-
original_title
-
overview
-
popularity
-
production_companies
-
production_countries
-
release_date
-
spoken_languages
-
status
-
tagline
-
vote_average
Lost columns:
-
actor1facebook_likes
-
actor2facebook_likes
-
actor3facebook_likes
-
aspect_ratio
-
casttotalfacebook_likes
-
color
-
content_rating
-
directorfacebooklikes
-
facenumberinposter
-
moviefacebooklikes
-
movieimdblink
-
numcriticfor_reviews
-
numuserfor_reviews
譯:
TMDB 5000電影數據集
來自TMDb的約5000部電影的元數據
在一部電影上映之前,我們能對它的成功說些什么呢?是否有某些公司(皮克斯?)找到了一致的公式?鑒于制作成本超過1億美元的大型電影仍可能失敗,這個問題對電影業來說比以往任何時候都更重要。電影迷可能有不同的興趣。我們能否預測哪些電影會獲得高評價,無論它們是否在商業上取得成功?
這是一個開始深入研究這些問題的好地方,有幾千部電影的情節、演員陣容、工作人員、預算和收入的數據。
已根據IMDB的DMCA刪除請求刪除了該數據集的原始版本。為了將影響降至最低,我們根據電影數據庫(TMDb)的使用條款,將其替換為一組類似的電影和數據字段。壞消息是,基于舊數據集構建的內核很可能不再工作。
好消息是:
● 您可以通過一些編輯來移植現有內核。這個內核提供了相關函數和示例。你也可以在這里找到新格式的一般介紹。
● 新的數據集包含演員和劇組的全部學分,而不僅僅是前三名演員。
● 男演員和女演員現在按他們在演員名單中出現的順序排列。目前尚不清楚原始數據集使用了什么順序;對于我抽查的電影,它既不符合信用卡訂單,也不符合IMDB的明星訂單。
● 收入似乎更具流動性。例如,IMDB關于《阿凡達》的數據似乎是從2010年開始的,并且低估了這部電影的全球收入超過20億美元。
● 有些我們沒能搬過去的電影(幾百部)只是糟糕的作品。例如,這個IMDB條目基本上沒有準確的信息。它將《星球大戰》第七集列為紀錄片。
數據源傳輸詳細信息
● 幾個新列包含json。通過[從這個內核]()移植load data函數,可以節省一些時間。
● 即使在運行時這樣的簡單字段中,各版本之間也可能不一致。例如,之前的數據集顯示了《阿凡達》延長剪輯的持續時間,而TMDB顯示了原始版本的時間。
● 現在有一個單獨的文件,包含演員和工作人員的全部學分。
● 所有字段都由用戶填寫,所以不要期望他們在關鍵詞、類型、評分等方面達成一致。
● 現有內核將繼續正常渲染,直到重新運行。
● 如果您對這個數據集是如何準備的感到好奇,可以在這里發布訪問TMDb API的代碼。
新增字段:
-
homepage
-
id
-
original_title
-
overview
-
popularity
-
production_companies
-
production_countries
-
release_date
-
spoken_languages
-
status
-
tagline
-
vote_average
Lost columns:
-
actor1facebook_likes
-
actor2facebook_likes
-
actor3facebook_likes
-
aspect_ratio
-
casttotalfacebook_likes
-
color
-
content_rating
-
directorfacebooklikes
-
facenumberinposter
-
moviefacebooklikes
-
movieimdblink
-
numcriticfor_reviews
-
numuserfor_reviews
總結
以上是生活随笔為你收集整理的来自TMDB的5000部电影数据集的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: VSCode解决中文乱码问题最详解
- 下一篇: 如何关闭mac的SIP