python 多分类逻辑回归_机器学习实践:多分类逻辑回归(softmax回归)的sklearn实现和tensorflow实现...
本文所有代碼及數(shù)據(jù)可下載。
Scikit Learn 篇:Light 版
scikit learn內(nèi)置了邏輯回歸,對(duì)于小規(guī)模的應(yīng)用較為簡單,一般使用如下代碼即可
from sklearn.linear_model.logistic import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
如果要根據(jù)LR的系數(shù)及截距手動(dòng)計(jì)算概率,可以如下操作:
def softmax(x):
e_x = np.exp(x - np.max(x)) # 防止exp()數(shù)值溢出
return e_x / e_x.sum(axis=0)
pred = [np.argmax(softmax(np.dot(classifier.coef_, X_test[i,:]) + classifier.intercept_)) for i in range(len(X_test))]
print np.sum(pred != predictions) # 檢查是否存在差異
完整代碼LR_sklearn_light.py可下載。
Scikit Learn 篇:Pro 版
當(dāng)數(shù)據(jù)量較大時(shí),例如量級(jí):1千萬條數(shù)據(jù),1000個(gè)特征維度。此時(shí)直接使用LogisticRegression()的訓(xùn)練速度較慢,并且要求把所有數(shù)據(jù)預(yù)先加載到內(nèi)存中,這有可能導(dǎo)致內(nèi)存不足的問題。解決辦法是采用mini batch的形式,每讀取一個(gè)batch的數(shù)據(jù)便進(jìn)行一次訓(xùn)練。
通常從HDFS上取得的數(shù)據(jù)都會(huì)以多個(gè)分片的形式存在,以最常用的csv文件格式為例。首先我們使用glob包生成所有數(shù)據(jù)文件的文件名list,訓(xùn)練集數(shù)據(jù)文件名是train_part0.csv這樣的,使用 * 可以進(jìn)行任意字符匹配。
filenames = sorted(glob.glob("./TrainData/train*"))
再遍歷每一個(gè)文件,每次讀取chunksize行的數(shù)據(jù)塊作為一個(gè)batch,對(duì)于每一個(gè)batch,調(diào)用sklearn中分類器的partial_fit()方法在當(dāng)前batch上進(jìn)行梯度下降,直到所有數(shù)據(jù)都被使用過或達(dá)到設(shè)置的最大訓(xùn)練步數(shù)。
filenames = sorted(glob.glob("./TrainData/train*"))
MaxIterNum = 100
count = 0
for c, filename in enumerate(filenames):
TrainDF = pd.read_csv(filename, header = None, chunksize = 10)
for Batch in TrainDF:
count += 1
print count
y_train = np.array(Batch.iloc[:,0])
X_train = np.array(Batch.iloc[:,1:])
st1 = time.time()
classifier.partial_fit(X_train,y_train, classes=np.array([0, 1, 2]))
ed1 = time.time()
st2 = time.time()
predictions = classifier.predict(X_train)
acc = metrics.accuracy_score(y_train,predictions)
ed2 = time.time()
print ed1-st1, ed2-st2, acc
if count == MaxIterNum:
break
if count == MaxIterNum:
break
對(duì)于測試數(shù)據(jù),同樣可以采用這種batch的讀取方法,并拼接起來統(tǒng)一進(jìn)行測試,調(diào)用sklearn的accuracy_score()函數(shù)得到準(zhǔn)確率,調(diào)用confusion_matrix()得到混淆矩陣。
X_test = np.zeros([100, np.shape(X_train)[1]])
y_test = np.zeros(100)
TestSampleNum = 0
filenames = sorted(glob.glob("./TestData/test*"))
MaxIterNum = 100
count = 0
for c, filename in enumerate(filenames):
TestDF = pd.read_csv(filename, header = None, chunksize = 10)
for Batch in TestDF:
count += 1
print count
y_test[TestSampleNum:TestSampleNum+np.shape(Batch)[0]] = np.array(Batch.iloc[:,0])
X_test[TestSampleNum:TestSampleNum+np.shape(Batch)[0],:] = np.array(Batch.iloc[:,1:])
TestSampleNum = TestSampleNum+np.shape(Batch)[0]
if count == MaxIterNum:
break
if count == MaxIterNum:
break
X_test = X_test[0:TestSampleNum,:]
y_test = y_test[0:TestSampleNum]
st1 = time.time()
predictions = classifier.predict(X_test)
acc = accuracy_score(y_test,predictions)
ed1 = time.time()
print ed1-st1, acc
A = confusion_matrix(y_test, predictions)
print A
完整代碼LR_sklearn_pro.py可下載。
Tensorflow 篇
tensorflow主要用于處理較大數(shù)據(jù)量的情況。第一步是使用inputPipeLine將數(shù)據(jù)讀入過程與訓(xùn)練過程并行。這里對(duì)訓(xùn)練集和測試集分別建立讀取管線,注意訓(xùn)練集的numEpochs與測試集是不同的,這里允許重復(fù)訓(xùn)練多次:
def readMyFileFormat(fileNameQueue):
reader = tf.TextLineReader()
key, value = reader.read(fileNameQueue)
record_defaults = [[0]] + [[0.0]] * 4
user = tf.decode_csv(value, record_defaults=record_defaults)
userlabel = user[0]
userlabel01 = tf.cast(tf.one_hot(userlabel,ClassNum,1,0), tf.float32)
userfeature = user[1:]
return userlabel01, userfeature
def inputPipeLine_batch(fileNames, batchSize, numEpochs = None):
fileNameQueue = tf.train.string_input_producer(fileNames, num_epochs = numEpochs, shuffle = False )
example = readMyFileFormat(fileNameQueue)
min_after_dequeue = 10
capacity = min_after_dequeue + 3 * batch_size_train
YBatch, XBatch = tf.train.batch(
example, batch_size = batchSize,
capacity = capacity)
return YBatch, XBatch
filenames = tf.train.match_filenames_once(DataDir)
YBatch, XBatch = inputPipeLine_batch(filenames, batchSize = batch_size, numEpochs = 20)
pfilenames = tf.train.match_filenames_once(pDataDir)
pYBatch, pXBatch = inputPipeLine_batch(pfilenames, batchSize = batch_size, numEpochs = 1)
然后構(gòu)建網(wǎng)絡(luò):
# LR
X_LR = tf.placeholder(tf.float32, [None, FeatureSize])
Y_LR = tf.placeholder(tf.float32, [None, ClassNum])
W_LR = tf.Variable(tf.truncated_normal([FeatureSize, ClassNum], stddev=0.1), dtype=tf.float32)
bias_LR = tf.Variable(tf.constant(0.1,shape=[ClassNum]), dtype=tf.float32)
Ypred_LR = tf.matmul(X_LR, W_LR) + bias_LR
Ypred_prob = tf.nn.softmax(Ypred_LR)
cost = -tf.reduce_mean(Y_LR*tf.log(Ypred_prob))
optimizer = tf.train.AdamOptimizer(lr).minimize(cost)
使用mini batch的梯度下降方法訓(xùn)練網(wǎng)絡(luò),并用TrainBatchNum控制最大訓(xùn)練步數(shù):
# 訓(xùn)練
try:
for i in range(TrainBatchNum):
print i
y, x = sess.run([YBatch, XBatch], feed_dict={batch_size: batch_size_train})
flag, c = sess.run([optimizer, cost], feed_dict={X_LR: x, Y_LR: y})
print c
except tf.errors.OutOfRangeError:
print 'Done Train'
分批讀取測試集后再拼接起來統(tǒng)一評(píng)估:
# 測試
Y = np.array([0, 0, 0])
Pred = np.array([0, 0, 0])
try:
i = 0
while True:
print i
i = i + 1
y, x = sess.run([pYBatch, pXBatch], feed_dict={batch_size: batch_size_test})
pred = sess.run(Ypred_prob, feed_dict={X_LR: x, Y_LR: y})
Pred = np.vstack([Pred,pred])
Y = np.vstack([Y,y])
except tf.errors.OutOfRangeError:
print 'Done Test'
Y = Y[1:]
Pred = Pred[1:]
acc = accuracy_score(np.argmax(Y, axis = 1),np.argmax(Pred, axis = 1))
print acc
A = confusion_matrix(np.argmax(Y, axis = 1),np.argmax(Pred, axis = 1))
print A
完整代碼LR_tf.py可下載。
總結(jié)
以上是生活随笔為你收集整理的python 多分类逻辑回归_机器学习实践:多分类逻辑回归(softmax回归)的sklearn实现和tensorflow实现...的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 模板建站和开发网站区别_湖南网站建设定制
- 下一篇: 电机控制器软件设计规范_电机控制器市场及