神经网络架构搜索_神经网络架构
神經(jīng)網(wǎng)絡(luò)架構(gòu)搜索
Marketing and product teams are tasked with understanding customers. To do so, they look at customer preferences — motivations, expectations and inclinations — which in combination with customer needs drive their purchasing decisions.
中號 arketing和產(chǎn)品團(tuán)隊的任務(wù)是了解客戶。 為此,他們著眼于客戶的偏好 -動機(jī),期望和傾向-結(jié)合客戶需求來驅(qū)動他們的購買決策 。
In my years as a data scientist I learned that customers — their preferences and needs — rarely (or never?) fall into simple objective buckets or segmentations we use to make sense of them. Instead, customer preferences and needs are complex, intertwined and constantly changing.
在擔(dān)任數(shù)據(jù)科學(xué)家的那幾年里,我了解到客戶-他們的喜好和需求-很少(或從來沒有?)落入我們用來理解客戶的簡單客觀的分類或細(xì)分中。 相反,客戶的喜好和需求是復(fù)雜的,相互交織的并且不斷變化的。
While understanding customers is already challenging enough, many modern digital businesses don’t know much about their products either. They operate digital platforms to facilitate the exchange between producers and consumers. The digital platform business model creates markets and communities with network effects that allow their users to interact and transact. The platform business does not control their inventory via a supply chain like linear businesses do.
雖然了解客戶已經(jīng)具有足夠的挑戰(zhàn)性,但許多現(xiàn)代數(shù)字企業(yè)也不了解他們的產(chǎn)品。 他們運營數(shù)字平臺,以促進(jìn)生產(chǎn)者和消費者之間的交流。 數(shù)字平臺業(yè)務(wù)模型通過網(wǎng)絡(luò)效應(yīng)來創(chuàng)建市場和社區(qū),從而使其用戶進(jìn)行交互和交易。 平臺業(yè)務(wù)不像線性業(yè)務(wù)那樣通過供應(yīng)鏈控制其庫存。
Mohamed Hassan from PixabayMohamed Hassan在Pixabay上發(fā)布A good way to describe the platform business is that they do not own the means of production but they instead create the means of connection. Examples of a platform business are Amazon, Facebook, YouTube, Twitter, Ebay, AirBnB, a Property Portal like Zillo, and aggregator businesses like travel booking websites. Over the last few decades the platform businesses came to dominate the economy.
描述平臺業(yè)務(wù)的一個好方法是,他們不擁有生產(chǎn)資料,而是創(chuàng)建聯(lián)系方式 。 平臺業(yè)務(wù)的示例包括Amazon , Facebook , YouTube , Twitter , Ebay , AirBnB , Zillo之類的Property Portal和旅行預(yù)訂網(wǎng)站之類的聚合業(yè)務(wù)。 在過去的幾十年中,平臺業(yè)務(wù)開始主導(dǎo)經(jīng)濟(jì)。
How can we use AI to make sense of our customers and products in the age of the platform business?
在平臺業(yè)務(wù)時代,我們?nèi)绾问褂肁I來理解我們的客戶和產(chǎn)品?
This blog post is a continuation of my previous discussion on the new gold standard of behavioural data in Marketing:
這篇博客文章是我之前關(guān)于市場營銷行為數(shù)據(jù)新金標(biāo)準(zhǔn)的討論的延續(xù):
In this blog post we use a more advanced Deep Neural Network to model customers and products.
在此博客文章中,我們使用更高級的深度神經(jīng)網(wǎng)絡(luò)對客戶和產(chǎn)品進(jìn)行建模。
神經(jīng)網(wǎng)絡(luò)架構(gòu) (The Neural Network Architecture)
ProSymbols, ProSymbols , lastspark, lastspark , Juan Pablo BravoJuan Pablo BravoWe use a deep Neural Network with the following elements:
我們使用具有以下要素的深度神經(jīng)網(wǎng)絡(luò):
Encoder: takes input data describing products or customers and maps it into Feature Embeddings. (An embedding is defined as a projection of some input into another more convenient representation space)
編碼器 :獲取描述產(chǎn)品或客戶的輸入數(shù)據(jù),并將其映射到功能嵌入中。 (嵌入定義為某些輸入到另一個更方便的表示空間中的投影)
Comparator: combines customer and product feature embeddings into a Preferences Tensor.
比較器 :將客戶和產(chǎn)品功能嵌入到首選項Tensor中 。
Predictor: turns the preferences into a predictive purchase propensity
預(yù)測者 :將偏好轉(zhuǎn)變?yōu)?strong>預(yù)測購買傾向
We use the neural network to predict product purchases as a target as we know that purchase decisions are driven by a customer’s preferences and needs. Therefore we teach the encoders to extract such preferences and needs from customer behavioural data, customer and product attributes.
由于我們知道購買決策是由客戶的偏好和需求決定的,因此我們使用神經(jīng)網(wǎng)絡(luò)將產(chǎn)品購買作為目標(biāo)進(jìn)行預(yù)測。 因此,我們教編碼器從客戶行為數(shù)據(jù),客戶和產(chǎn)品屬性中提取此類偏好和需求。
We can analyse and cluster the learned customer and product features to derive a data driven segmentation. More on this later.
我們可以分析和聚類學(xué)習(xí)到的客戶和產(chǎn)品功能,以得出數(shù)據(jù)驅(qū)動的細(xì)分。 稍后再詳細(xì)介紹。
Morning Brew onMorning Brew在UnsplashUnsplash拍攝TensorFlow實施 (TensorFlow Implementation)
The following code uses TensorFlow 2 and Keras to implement our Neural Network architecture:
以下代碼使用TensorFlow 2和Keras來實現(xiàn)我們的神經(jīng)網(wǎng)絡(luò)體系結(jié)構(gòu):
The code creates TensorFlow feature columns and can use numerical as well as categorical features. We are using the Keras functional API to define our customer preference neural network which can be compiled with the Adam optimiser using a binary cross-entropy as the loss function.
該代碼創(chuàng)建TensorFlow特征列,并且可以使用數(shù)字特征和分類特征 。 我們正在使用Keras功能API定義我們的客戶偏好神經(jīng)網(wǎng)絡(luò),該網(wǎng)絡(luò)可以使用亞當(dāng)優(yōu)化器使用二進(jìn)制交叉熵作為損失函數(shù)進(jìn)行編譯。
使用Spark訓(xùn)練數(shù)據(jù) (Training Data with Spark)
We will need training data for our customer preference model. As a platform business your raw data will fall into the Big Data category. To prepare TB of raw data from click streams, product searches and transactions we use Spark. The challenge is to bridge the two technologies and feed the training data from Spark into TensorFlow.
我們將需要針對我們的客戶偏好模型的培訓(xùn)數(shù)據(jù)。 作為平臺業(yè)務(wù),您的原始數(shù)據(jù)將屬于大數(shù)據(jù)類別。 為了從點擊流,產(chǎn)品搜索和交易中準(zhǔn)備TB的原始數(shù)據(jù),我們使用Spark。 挑戰(zhàn)在于將這兩種技術(shù)聯(lián)系起來并將培訓(xùn)數(shù)據(jù)從Spark饋入TensorFlow。
[OC][OC]The best format for big amounts of TensorFlow training data is to store it in the TFRecord file format, TensorFlow’s own binary storage format based on Protocol Buffers. The binary format greatly improves the performance of loading data and feeding it into model training. If you were to use, for example, csv files you will spend significant compute resources on loading and parsing your data rather than on training your neural network. The TFRecord file format makes sure your data pipeline is not bottlenecking your neural network training.
大量TensorFlow訓(xùn)練數(shù)據(jù)的最佳格式是將其存儲為TFRecord文件格式 ,這是TensorFlow自己基于協(xié)議緩沖區(qū)的二進(jìn)制存儲格式。 二進(jìn)制格式大大提高了加載數(shù)據(jù)并將其輸入模型訓(xùn)練的性能。 例如,如果要使用csv文件,則將花費大量的計算資源來加載和解析數(shù)據(jù),而不是訓(xùn)練神經(jīng)網(wǎng)絡(luò)。 TFRecord文件格式可確保您的數(shù)據(jù)管道不會成為神經(jīng)網(wǎng)絡(luò)訓(xùn)練的瓶頸。
The Spark-TensorFlow connector allows us to save TFRecords with Spark. Simply add it as a JAR to a new Spark session as follows:
Spark-TensorFlow連接器允許我們使用Spark保存TFRecords。 只需將其作為JAR添加到新的Spark會話中,如下所示:
spark = (SparkSession.builder
.master("yarn")
.appName(app_name)
.config("spark.submit.deployMode", "cluster")
.config("spark.jars.packages","org.tensorflow:spark-tensorflow-connector_2.11:1.15.0")
.getOrCreate()
)
and write a Spark DataFrame to TFRecords as follows:
并將Spark DataFrame寫入TFRecords,如下所示:
(training_feature_df
.write.mode("overwrite")
.format("tfrecords")
.option("recordType", "Example")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.save(path)
)
To load the TFRecords with TensorFlow you define the schema of your records and parse the data set into an iterator of python dictionaries using the TensorFlow dataset APIs:
要使用TensorFlow加載TFRecords,您需要定義記錄的架構(gòu),并使用TensorFlow數(shù)據(jù)集API將數(shù)據(jù)集解析為python詞典的迭代器:
SCHEMA = {"col_name1": tf.io.FixedLenFeature([], tf.string, default_value="Null"),
"col_name2: tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}data = (
tf.data.TFRecordDataset(list_of_file_paths, compression_type="GZIP")
.map(
lambda record: tf.io.parse_single_example(record, SCHEMA),
num_parallel_calls=num_of_workers
)
.batch(num_of_records)
.prefetch(num_of_batches)
)
使用Spark和PandasUDF進(jìn)行批量計分 (Batch Scoring with Spark and PandasUDFs)
After training our Neural Network there are obvious real-time scoring applications, for example, scoring search results in a product search to address choice paralysis on platforms with thousands and millions of products.
在訓(xùn)練了我們的神經(jīng)網(wǎng)絡(luò)之后,便有了明顯的實時評分應(yīng)用程序,例如,在產(chǎn)品搜索中對搜索結(jié)果進(jìn)行評分,以解決擁有成千上萬種產(chǎn)品的平臺上的選擇癱瘓。
But there is an advanced analytics use-case to look at the product/ user features and preferences for insights and to create a data driven segmentation to help with product development etc. For this we score our entire customer base and product catalogue to capture the outputs of the Encoders and Comparator of our model for clustering.
但是,有一個高級的分析用例可以查看產(chǎn)品/用戶的功能和偏好以獲取見解,并創(chuàng)建數(shù)據(jù)驅(qū)動的細(xì)分以幫助產(chǎn)品開發(fā)等。為此,我們對整個客戶群和產(chǎn)品目錄進(jìn)行評分,以獲取輸出聚類模型的編碼器和比較器的設(shè)計Design 。
To capture the output of intermediary neural network layers we can reshape our trained TensorFlow as follows:
為了捕獲中間神經(jīng)網(wǎng)絡(luò)層的輸出,我們可以按如下所示重塑我們訓(xùn)練有素的TensorFlow:
trained_customer_preference_model = tf.keras.models.load_model(path)customer_feature_model = tf.keras.Model(
inputs=trained_customer_preference_model.input,
outputs=trained_customer_preference_model.get_layer(
"customer_features").output
)
We score our users with Spark using a PandasUDF to score a batch of users at a time for performance reasons:
出于性能方面的考慮,我們使用PandasUDF為Spark的用戶評分,以便一次為一批用戶評分:
from pyspark.sql import functions as Fimport numpy as np
import pandas as pdspark = SparkSession.builder.getOrCreate()
customerFeatureModelWrapper = CustomerFeatureModelWrapper(path)
CUSTOMER_FEATURE_MODEL = spark.sparkContext.broadcast(customerFeatureModelWrapper)@F.pandas_udf("array<float>", F.PandasUDFType.SCALAR)
def customer_features_udf(*cols):
model_input = dict(zip(FEATURE_COL_NAMES, cols))
model_output = CUSTOMER_FEATURE_MODEL.value([model_input])
return pd.Series([np.array(v) for v in model_output.tolist()])(
customer_df
.withColumn(
"customer_features",
customer_features_udf(*model_input_cols)
)
)
We have to wrap our TensorFlow model into a wrapper class to allow serialisation, broadcasting across the Spark cluster and de-serialisation of the model on all workers. I use MLflow to track model artifacts but you could store them simply on any cloud storage without MLflow. Implement a download function fetching model artifacts from S3 or wherever you store your model.
我們必須將TensorFlow模型包裝到包裝器類中,以允許序列化,在Spark集群中廣播以及在所有worker上對該模型進(jìn)行反序列化。 我使用MLflow跟蹤模型工件,但是您可以將它們簡單地存儲在任何沒有MLflow的云存儲中。 實現(xiàn)下載功能,以從S3或存儲模型的任何地方獲取模型工件。
class CustomerFeatureModelWrapper(object):def __init__(self, model_path):
self.model_path = model_path
self.model = self._build(model_path)
def __getstate__(self):
return self.model_path
def __setstate__(self, model_path):
self.model_path = model_path
self.model = self._build(model_path)
def _build(self, model_path):
local_path = download(model_path)
return tf.keras.models.load_model(local_path)
You can read more about how MLflow can help you with your Data Science Projects in my previous article:
在上一篇文章中,您可以了解有關(guān)MLflow如何幫助您進(jìn)行數(shù)據(jù)科學(xué)項目的更多信息:
聚類和細(xì)分 (Clustering and Segmentation)
After scoring our customer base and product inventory with Spark we have a dataframe with feature and preference vectors as follows:
在使用Spark對我們的客戶群和產(chǎn)品庫存進(jìn)行評分之后,我們得到了一個具有特征向量和偏好向量的數(shù)據(jù)框,如下所示:
+-----------+---------------------------------------------------+|product_id |product_features |
+-----------+---------------------------------------------------+
|product_1 |[-0.28878614, 2.026503, 2.352102, -2.010809, ... |
|product_2 |[0.39889023, -0.06328985, 1.634547, 3.3479023, ... |
+-----------+---------------------------------------------------+Pixabay)Pixabay )
As a first step, we have to create a representative but much smaller sample of customers and products to use in clustering. It is important that you stratify your sample with equal numbers of customers and products per strata. Commonly, we have many anonymous customers with little customer attributes such as demographics etc. for stratification. In such a situation, we can stratify customers by the product attributes of the products the customers interact with as a proxy. This follows our general assumption that their preferences and needs drive their purchase decisions. In Spark you create a new column with the strata key. Get the total counts of customers and products by strata and calculate the faction per strata to sample approximately even counts by strata. You can use Spark’s
第一步,我們必須創(chuàng)建一個具有代表性的樣本,但是要用于集群的客戶和產(chǎn)品樣本要小得多。 你與客戶的數(shù)量相等 ,每個階層的產(chǎn)品分層你的樣品是很重要的。 通常,我們有許多匿名客戶,幾乎沒有客戶屬性(例如人口統(tǒng)計信息)進(jìn)行分層。 在這種情況下,我們可以通過與客戶交互的產(chǎn)品的產(chǎn)品屬性來對客戶進(jìn)行分層,以作為代理。 這是根據(jù)我們的普遍假設(shè),即他們的偏好和需求決定他們的購買決定。 在Spark中,您可以使用strata鍵創(chuàng)建一個新列。 按層獲取客戶和產(chǎn)品的總數(shù),并計算每個層的派系,以按層抽樣平均數(shù) 。 您可以使用Spark的
DataFrameStatFunctions.sampleBy(col_with_strata_keys, dict_of_sample_fractions, seed)to create a stratified sample.
創(chuàng)建分層樣本 。
To create our segmentation we use T-SNE to visualise the high-dimensional feature vectors of our stratified data sample. T-SNE is a stochastic ML algorithm to reduce dimensionality for visualisation purposes in a way that similar customers and products cluster together. This is also called a neighbour embedding. We can use additional product attributes to colour the t-sne results to interpret our clusters as part of our analysis to generate insights. After we obtain the results from T-SNE, we run DBSCAN on the T-SNE neighbour embeddings to find our clusters.
為了創(chuàng)建我們的分割,我們使用T-SNE可視化分層數(shù)據(jù)樣本的高維特征向量。 T-SNE是一種隨機(jī)ML算法,用于降低可視化目的的維數(shù),以類似的客戶和產(chǎn)品聚集在一起的方式進(jìn)行。 這也稱為鄰居嵌入 。 我們可以使用其他產(chǎn)品屬性為t-sne結(jié)果著色,以解釋我們的集群,這是我們進(jìn)行分析以生成見解的一部分。 從T-SNE獲得結(jié)果后,我們在T-SNE鄰居嵌入上運行DBSCAN以找到我們的集群 。
[OC][OC]With the cluster labels from the DBSCAN output we can calculate cluster centroids:
使用DBSCAN輸出中的集群標(biāo)簽,我們可以計算集群質(zhì)心 :
centroids = products[["product_features", "cluster"]].groupby(["cluster"])["product_features"].apply(
lambda x: np.mean(np.vstack(x), axis=0)
)cluster
0 [0.5143338, 0.56946456, -0.26320028, 0.4439753...
1 [0.42414477, 0.012167327, -0.662183, 1.2258132...
2 [-0.0057945233, 1.2221531, -0.22178105, 1.2349...
...
Name: product_embeddings, dtype: object
After we obtained our cluster centroids, we assign all our customer base and product catalogue to their representative cluster. Because so far, we only worked with a stratified sample of maybe 50,000 customer and products.
在獲得集群質(zhì)心之后 ,我們將所有客戶群和產(chǎn)品目錄分配給它們的代表集群。 因為到目前為止,我們僅處理了大約50,000個客戶和產(chǎn)品的分層樣本。
We use again Spark to assign all our customers and products to their closest cluster centroid. We use the L1 norm (or taxicab distance) to calculate the distance of customers/products to cluster centroids to emphasis the per feature alignment.
我們再次使用Spark將所有客戶和產(chǎn)品分配給最接近的群集質(zhì)心。 我們使用L1范數(shù) (或出租車出租車距離)來計算客戶/產(chǎn)品到聚類質(zhì)心的距離,以強(qiáng)調(diào)每個功能的對齊方式 。
distance_udf = F.udf(lambda x, y, i: float(np.linalg.norm(np.array(x) - np.array(y), axis=0, ord=i)), FloatType())customer_centroids = spark.read.parquet(path)customer_clusters = (
customer_dataframe
.crossJoin(
F.broadcast(customer_centroids)
)
.withColumn("distance", distance_udf("customer_centroid", "customer_features", F.lit(1)))
.withColumn("distance_order", F.row_number().over(Window.partitionBy("customer_id").orderBy("distance")))
.filter("distance_order = 1")
.select("customer_id", "cluster", "distance")
)+-----------+-------+---------+
|customer_id|cluster| distance|
+-----------+-------+---------+
| customer_1| 4|13.234212|
| customer_2| 4| 8.194665|
| customer_3| 1| 8.00042|
| customer_4| 3|14.705576|
We can then summarise our customer base to get the cluster prominence:
然后,我們可以總結(jié)我們的客戶群,以突出顯示集群 :
total_customers = customer_clusters.count()(
customer_clusters
.groupBy("cluster")
.agg(
F.count("customer_id").alias("customers"),
F.avg("distance").alias("avg_distance")
)
.withColumn("pct", F.col("customers") / F.lit(total_customers))
)+-------+---------+------------------+-----+
|cluster|customers| avg_distance| pct|
+-------+---------+------------------+-----+
| 0| xxxx|12.882028355869513| xxxx|
| 5| xxxx|10.084179072882444| xxxx|
| 1| xxxx|13.966814632296622| xxxx|
This completes all the steps needed to derive a data driven segmentation from our Neural Network embeddings:
這樣就完成了從神經(jīng)網(wǎng)絡(luò)嵌入中導(dǎo)出數(shù)據(jù)驅(qū)動的分割所需的所有步驟:
[OC][OC]Read more about segmentation and ways to extract insights from our model in my previous article:
在我之前的文章中,詳細(xì)了解細(xì)分和從我們的模型中提取見解的方法:
實時評分 (Real-time scoring)
To learn more about how to deploy a model for real-time scoring I recommend my previous article on the topic:
要了解有關(guān)如何部署模型進(jìn)行實時評分的更多信息,我建議我上一篇有關(guān)該主題的文章:
一般說明和建議 (General Notes and Advice)
Compared to the collaborative filtering approach in the linked article, the Neural network learns to generalise and a trained model can be used with new customers and new products. The Neural Network has no cold start problem.
與鏈接文章中的協(xié)作過濾方法相比,神經(jīng)網(wǎng)絡(luò)可以進(jìn)行概括,并且訓(xùn)練有素的模型可以用于新客戶和新產(chǎn)品。 神經(jīng)網(wǎng)絡(luò)沒有冷啟動問題。
If you use at least some behavioural data as input for your customers in addition to historic purchases and other customer profile data, your trained model can make purchase propensity predictions even for new customers without any transactional or customer profile data.
如果除了歷史性購買和其他客戶資料數(shù)據(jù)之外,您至少還使用某些行為數(shù)據(jù)作為客戶的輸入,那么經(jīng)過訓(xùn)練的模型甚至可以針對沒有任何交易或客戶資料數(shù)據(jù)的新客戶做出購買傾向預(yù)測。
The learned product feature embeddings will cluster into a bigger number of distinct clusters than your customer feature embeddings. It’s not unusual that most customers fall into one big cluster. This does NOT mean 90% of your customers are alike. As described in the introduction, most of your customers have complex, intertwined and changing preferences and needs. This means that they cannot be separated into distinct groups. It doesn’t mean that they are the same. The simplification of a cluster is not able to capture this which just reiterates the need for machine learning to make sense of customers.
與您的客戶功能嵌入相比,學(xué)習(xí)到的產(chǎn)品功能嵌入將聚集到更多數(shù)量的不同群集中。 大多數(shù)客戶都屬于一個大集群 ,這并不罕見。 這并不意味著90%的客戶都是一樣的。 如引言中所述,您的大多數(shù)客戶都具有復(fù)雜,相互交織和變化的偏好和需求。 這意味著它們不能分為不同的組。 這并不意味著它們是相同的 。 集群的簡化無法捕捉到這一點,而只是重申了機(jī)器學(xué)習(xí)對客戶有意義的需求。
While many stakeholders will love the insights and segmentation the model can produce, the real value of the model is in its ability to predict a purchase propensity.
盡管許多利益相關(guān)者會喜歡該模型可以產(chǎn)生的見識和細(xì)分,但該模型的真正價值在于其預(yù)測購買傾向的能力。
Jan is a successful thought leader and consultant in the data transformation of companies and has a track record of bringing data science into commercial production usage at scale. He has recently been recognised by dataIQ as one of the 100 most influential data and analytics practitioners in the UK.
Jan是公司數(shù)據(jù)轉(zhuǎn)換方面成功的思想領(lǐng)袖和顧問,并且擁有將數(shù)據(jù)科學(xué)大規(guī)模應(yīng)用于商業(yè)生產(chǎn)的記錄。 最近,他被dataIQ認(rèn)可為英國100位最具影響力的數(shù)據(jù)和分析從業(yè)者之一。
Connect on LinkedIn: https://www.linkedin.com/in/janteichmann/
在LinkedIn上連接: https : //www.linkedin.com/in/janteichmann/
Read other articles: https://medium.com/@jan.teichmann
閱讀其他文章: https : //medium.com/@jan.teichmann
翻譯自: https://towardsdatascience.com/customer-preferences-in-the-age-of-the-platform-business-with-the-help-of-ai-98b0eabf42d9
神經(jīng)網(wǎng)絡(luò)架構(gòu)搜索
總結(jié)
以上是生活随笔為你收集整理的神经网络架构搜索_神经网络架构的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 苹果xr电信卡能用吗
- 下一篇: win7的winsxs文件清理方法