當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

2022宁夏杯B题思路分析+代码（大学生就业问题分析）

發布時間：2023/12/20 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 2022宁夏杯B题思路分析+代码（大学生就业问题分析）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

2022寧夏杯B題完整題目：

鏈接：https://pan.baidu.com/s/1aClw5k-Ux-17rckTIRWrdg?pwd=1234

提取碼：1234

文章目錄

一、題目
二、數據預處理
三、第一問
- 1、方法一：繪制圖形
- 2、方法二：多元線性回歸
- 3、方法三：多因素方差分析
- 4、方法四：決策樹
四、第二問
- 1、回歸關系
- 2、薪水預測
五、第三問

一、題目

大學生就業問題一直是社會關注的焦點。據此前教育部新聞發布會通報， 2022 屆高校畢業生規模達 1076 萬人，首次突破 1000 萬人，規模和增量均創下了歷史新高。同時受市場環境和疫情等因素的影響，就業壓力較大。大學生就業呈現出哪些特征和趨勢呢？在眾多就業的學生中，是什么樣的因素決定了部分學生在眾多的競爭中獲得了薪水不同的工作？這些因素可能包括大學的成績、本身的技能、大學與工業中心的接近程度、擁有的專業化程度、特定行業的市場條件等。據悉，印度共有 6214 所工程和技術院校，其中約有 290 萬名學生。每年平均有 150 萬學生獲得工程學學位，但由于缺乏從事技術工作所需的技能，只有不到 20% 的學生在其核心領域找到工作。附件（https://www.datafountain.cn/datasets/4955）給出了印度工程類專業畢業生就業的工資水平和各因素情況表。

根據附件數據結合其他資料研究：

(1) 分析影響高校工程類專業畢業生就業的主要因素。
(2) 根據附件一建立模型，刻畫工程類專業畢業生薪水和各因素的關系。
(3) 根據以上的分析，對我國高校工程類專業學生培養是否有一定的啟迪？如果有，請為你所在的高校寫一份咨詢建議。

屬性說明

ID	用于識別候選人的唯一ID
薪金	向候選人提供的年度CTC（以INR為單位）
性別	候選人的性別
DOB	候選人的出生日期
10％	在10年級考試中獲得的總成績
10board	10年級時遵循其課程的校務委員會
12畢業	畢業年份-高中
12％	在12年級考試中獲得的總成績
12board	候選人遵循其課程的校務委員會
CollegeID	唯一ID，用于標識候選人為其大學就讀的大學/學院
CollegeTier	每所大學都被標注為1或2。標注是根據該學院/大學學生獲得的平均AMCAT分數計算得出的。平均分數高于閾值的大學被標記為1，其他被標記為2。
學位	候選人獲得/追求的學位
專業化	候選人追求的專業化
CollegeGPA	畢業時的GPA總計
CollegeCityID	唯一的ID，用于標識學院所在的城市。
CollegeCityTier	學院所在城市的層。這是根據城市人口進行注釋的。
CollegeState	學院所在州的名稱
畢業年份	畢業年份（學士學位）
英語	AMCAT英語部分中的分數
邏輯	在AMCAT邏輯能力部分中得分
數量	在AMCAT的“定量能力”部分中得分
域	AMCAT域模塊中的分數
ComputerProgramming	AMCAT的“計算機編程”部分中的得分
ElectronicsAndSemicon	AMCAT的“電子和半導體工程”部分得分
計算機科學	在AMCAT的“計算機科學”部分中得分
MechanicalEngg	AMCAT機械工程部分中的得分
ElectricalEngg	AMCAT的電氣工程部分中的得分
TelecomEngg	AMCAT的“電信工程”部分中的得分
CivilEngg	AMCAT的“土木工程”部分中的得分
盡職調查	AMCAT人格測驗之一的分數
一致性	AMCAT人格測驗之一的分數
外向性	AMCAT人格測驗之一的分數
營養療法	AMCAT人格測驗之一的分數
開放性到經驗	分數在AMCAT的個性測試的部分之一

二、數據預處理

目標變量：Salary（薪資）。
自變量（特征變量）：除了Salary之外的其他變量。

import pandas as pd import numpy as np data=pd.read_csv('B題附件.csv') data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 2998 entries, 0 to 2997 Data columns (total 34 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2998 non-null int64 1 Gender 2998 non-null object 2 DOB 2998 non-null object 3 10percentage 2998 non-null float644 10board 2998 non-null object 5 12graduation 2998 non-null int64 6 12percentage 2998 non-null float647 12board 2998 non-null object 8 CollegeID 2998 non-null int64 9 CollegeTier 2998 non-null int64 10 Degree 2998 non-null object 11 Specialization 2998 non-null object 12 collegeGPA 2998 non-null float6413 CollegeCityID 2998 non-null int64 14 CollegeCityTier 2998 non-null int64 15 CollegeState 2998 non-null object 16 GraduationYear 2998 non-null int64 17 English 2998 non-null int64 18 Logical 2998 non-null int64 19 Quant 2998 non-null int64 20 Domain 2998 non-null float6421 ComputerProgramming 2998 non-null int64 22 ElectronicsAndSemicon 2998 non-null int64 23 ComputerScience 2998 non-null int64 24 MechanicalEngg 2998 non-null int64 25 ElectricalEngg 2998 non-null int64 26 TelecomEngg 2998 non-null int64 27 CivilEngg 2998 non-null int64 28 conscientiousness 2998 non-null float6429 agreeableness 2998 non-null float6430 extraversion 2998 non-null float6431 nueroticism 2998 non-null float6432 openess_to_experience 2998 non-null float6433 Salary 2998 non-null int64 dtypes: float64(9), int64(18), object(7) memory usage: 796.5+ KB

描述性統計：可以看到每一列數據的數量，均值，最大最小值等信息

查看是否有缺失值：

data.isnull().sum()

根據皮爾遜相關性繪制熱力圖

# seaborn中文亂碼解決方案 from matplotlib.font_manager import FontProperties myfont=FontProperties(fname=r'C:\Windows\Fonts\simhei.ttf',size=40) sns.set(font=myfont.get_name(), color_codes=True)data_corr = data.corr(method="spearman")#計算相關性系數 plt.figure(figsize=(20,15))#figsize可以規定熱力圖大小 fig=sns.heatmap(data_corr,annot=True,fmt='.2g')#annot為熱力圖上顯示數據；fmt='.2g'為數據保留兩位有效數字 figfig.get_figure().savefig('data_corr.png')#保留圖片

下圖當中，可以判斷各個特征之間是否有影響，如果系數越大，則變量之間相關性越強。

計算每個學生到現在為止的年齡：

data['Age']=((pd.to_datetime('today') - pd.to_datetime(list(data['DOB']))).days / 365).astype(int) # 到現在位置的年齡 data

觀察數據發現，在AMCAT的某些課程當中，由于許多同學沒有分數，因此分數顯示的是-1，所以為了進行更好的預測，在數據清理的時候將 -1 替換為總課程的平均值，以獲得更好的預測。

columns = ['ComputerProgramming','ElectronicsAndSemicon','ComputerScience','MechanicalEngg','ElectricalEngg','TelecomEngg','CivilEngg'] for col in columns:data[col] = data[col].replace({ -1 : np.nan})#先將-1填充為空值data[col] = data[col].fillna(data[col].mean()) #再將空值替換為平均值

同時，將性別這列數字化：

data['Gender'] = data['Gender'].replace({'m': 0, 'f': 1}) data

將以數字開頭的屬性類更改名字：

data.rename(columns ={'10percentage':'tenth_percentage','12percentage':'twelveth_percentage','10board':'tenth_board','12graduation':'twelveth_graduation','12board':'twelveth_board',}, inplace =True) data data.to_csv('finish.csv')

三、第一問

分析影響高校工程類專業畢業生就業的主要因素。

根據前面的題目描述可知，我們需要使用薪水來作為就業情況的表示。

1、方法一：繪制圖形

這里選擇使用柱形圖

plt.style.use('ggplot') plt.bar(x.tenth_percentage,y,color ="red") plt.xlabel("10th_percantage")#在10年級考試中獲得的總成績 plt.ylabel("salary") plt.title("10th marks vs salary")

plt.bar(x.twelveth_percentage,y,color ="blue") plt.xlabel("12th_percantage")#在12年級考試中獲得的總成績 plt.ylabel("salary") plt.title("12th marks vs salary")

plt.scatter(x.CollegeTier,y,color ="pink") plt.xlabel("CollegeTier")#學院所在城市的層 plt.ylabel("salary") plt.title("CollegeTier vs salary")

plt.bar(x.Logical,y,color ="red") plt.xlabel("Logical")#邏輯能力 plt.ylabel("salary") plt.title("Logical vs salary")

plt.bar(x.TelecomEngg,y,color ="black") plt.xlabel("TelecomEngg")#電信工程得分 plt.ylabel("salary") plt.title("TelecomEngg vs salary")

plt.bar(x.collegeGPA,y,color ="purple") plt.xlabel("collegeGPA")#畢業時的GPA總計 plt.ylabel("salary") plt.title("collegeGPA vs salary")

plt.figure(figsize = (15,8)) # 性格測試和薪水 sns.scatterplot(data.openess_to_experience, data.Salary, palette = 'inferno')

2、方法二：多元線性回歸

由于多元線性回歸的自變量需要是數值類型，考慮把Degree，Specialization，CollegeState變成數值。

preprocessing.OrdinalEncoder：特征專用，能夠將分類特征轉換為分類數值

# 由于多元線性回歸的自變量需要是數值類型，考慮把Degree，Specialization，CollegeState變成數值。 # preprocessing.OrdinalEncoder：特征專用，能夠將分類特征轉換為分類數值 from sklearn.preprocessing import OrdinalEncoder data_=data.copy() data_ # 取出需要轉換的兩個字段 OrdinalEncoder().fit(data_[['Degree','Specialization','CollegeState']]).categories_ # 使用OrdinalEncoder將字符型變成數值 data_[['Degree','Specialization','CollegeState']]=OrdinalEncoder().fit_transform(data_[['Degree','Specialization','CollegeState']])

然后我們就開始生成多元線性模型，代碼如下：

x = sm.add_constant(data_[['Gender', 'tenth_percentage', 'twelveth_graduation', 'twelveth_percentage', 'CollegeID', 'CollegeTier', 'Degree', 'Specialization', 'collegeGPA','CollegeCityID', 'CollegeCityTier', 'CollegeState', 'GraduationYear','English', 'Logical', 'Quant', 'Domain', 'ComputerProgramming','ElectronicsAndSemicon', 'ComputerScience', 'MechanicalEngg','ElectricalEngg', 'TelecomEngg', 'CivilEngg', 'conscientiousness','agreeableness', 'extraversion', 'nueroticism', 'openess_to_experience']]) #生成自變量 y = data['Salary'] #生成因變量 model = sm.OLS(y, x) #生成模型 result = model.fit() #模型擬合 result.summary() #模型描述

在這個結果中，我們主要看“coef”、“t”和“P>|t|”這三列。coef就是前面說過的回歸系數，const這個值就是回歸常數，所以我們得到的這個回歸模型就是y = coef這列 $×\times$ 對應的系數。

而“t”和“P>|t|”這兩列是等價的，使用時選擇其中一個就行，其主要用來判斷每個自變量和y的線性顯著關系。從圖中還可以看出，Prob (F-statistic)為1.40e-92，這個值就是我們常用的P值，其接近于零，說明我們的多元線性方程是顯著的，也就是y與自變量有著顯著的線性關系，而R-squared是0.161，也說明這個線性關系并不顯著。

理論上，這個多元線性方程已經求出來了，但是效果一般，我們還是要進行更深一步的探討。

前面說過，y與自變量有著顯著的線性關系，這里要注意所有的自變量被看作是一個整體，y與這個整體有顯著的線性關系，但不代表y與其中的每個自變量都有顯著的線性關系，我們在這里要找出那些與y的線性關系不顯著的自變量，然后把它們剔除，只留下關系顯著的。

我們可以通過圖中“P>|t|”這一列來判斷，這一列中我們可以選定一個閾值，比如統計學常用的就是0.05、0.02或0.01，這里我們就用0.05，凡是P>|t|這列中數值大于0.05的自變量，我們都把它剔除掉，這些就是和y線性關系不顯著的自變量，所以都舍去，請注意這里指的自變量不包括圖中const這個值。

但是這里有一個原則，就是一次只能剔除一個，剔除的這個往往是P值最大的那個，比如圖中P值最大的是GraduationYear，那么就把它剔除掉，然后再用剩下的自變量來重復上述建模過程，再找出P值最大的那個自變量，把它剔除，如此重復這個過程，直到所有P值都小于等于0.05，剩下的這些自變量就是我們需要的自變量，這些自變量和y的線性關系都比較顯著，我們要用這些自變量來進行建模。

我們可以將上述過程寫成一個函數，命名為looper，代碼如下：

def looper(limit):cols = ['Gender', 'tenth_percentage', 'twelveth_graduation', 'twelveth_percentage', 'CollegeID', 'CollegeTier', 'Degree', 'Specialization', 'collegeGPA','CollegeCityID', 'CollegeCityTier', 'CollegeState', 'GraduationYear','English', 'Logical', 'Quant', 'Domain', 'ComputerProgramming','ElectronicsAndSemicon', 'ComputerScience', 'MechanicalEngg','ElectricalEngg', 'TelecomEngg', 'CivilEngg', 'conscientiousness','agreeableness', 'extraversion', 'nueroticism', 'openess_to_experience']for i in range(len(cols)):data1 = data_[cols]x = sm.add_constant(data1) #生成自變量y = data_['Salary'] #生成因變量model = sm.OLS(y, x) #生成模型result = model.fit() #模型擬合pvalues = result.pvalues #得到結果中所有P值pvalues.drop('const',inplace=True) #把const取得pmax = max(pvalues) #選出最大的P值if pmax>limit:ind = pvalues.idxmax() #找出最大P值的indexcols.remove(ind) #把這個index從cols中刪除else:return resultresult = looper(0.05) result.summary()

由上圖的相關系數可以看出，薪水和twelveth_graduation，twelveth_percentage，CollegeTier，Degree，English，ComputerProgramming具有較強的相關性。

3、方法三：多因素方差分析

多因素方差分析，用于研究一個因變量是否受到多個自變量（也稱為因素）的影響，它檢驗多個因素取值水平的不同組合之間，因變量的均值之間是否存在顯著的差異。多因素方差分析既可以分析單個因素的作用（主效應），也可以分析因素之間的交互作用（交互效應），還可以進行協方差分析，以及各個因素變量與協變量的交互作用。

根據觀測變量（即因變量）的數目，可以把多因素方差分析分為：單變量多因素方差分析（也叫一元多因素方差分析）與多變量多因素方差分析（即多元多因素方差分析）。本案例是一元多因素方差分析。

這里使用SPSS進行演示：

1、首先在文件選項卡當中導入 finish.csv 數據：

2、分析-》一般線性模型-》單變量

4、方法四：決策樹

使用機器學習算法，可以轉換成決策樹來得到特征重要性排名：

from sklearn import tree # 從sklearn中導入treefrom sklearn import datasets, model_selection # 從sklearn中導入datasets用于加載數據集，這里我們使用iris數據集 # 從sklearn中導入model_selection用戶劃分測試集和訓練集合 feature_name = ['Gender', 'tenth_percentage', 'twelveth_graduation', 'twelveth_percentage', 'CollegeID', 'CollegeTier', 'Degree', 'Specialization', 'collegeGPA','CollegeCityID', 'CollegeCityTier', 'CollegeState', 'GraduationYear','English', 'Logical', 'Quant', 'Domain', 'ComputerProgramming','ElectronicsAndSemicon', 'ComputerScience', 'MechanicalEngg','ElectricalEngg', 'TelecomEngg', 'CivilEngg', 'conscientiousness','agreeableness', 'extraversion', 'nueroticism', 'openess_to_experience','Age'] X = data_[feature_name] Y = data_['Salary'] # 劃分訓練集和測試集 8:2 x_train,x_test, y_train, y_text = model_selection.train_test_split(X, Y, test_size=0.2, random_state=0)# 創建一顆分類樹，默認使用Gini classification_tree = tree.DecisionTreeClassifier() classification_tree.fit(x_train, y_train) # 輸出每個特征的重要性 [*zip(feature_name,classification_tree.feature_importances_)]

根據上面的數據就可以分析特征的重要性了。

四、第二問

根據附件一建立模型，刻畫工程類專業畢業生薪水和各因素的關系。

先畫一下薪水分布：

plt.figure(figsize = (12, 6))plt.subplot(121) # 薪水分布 plt.title('Salary Distribuition') sns.distplot(data['Salary'])plt.subplot(122) g1 = plt.scatter(range(data.shape[0]), np.sort(data.Salary.values)) # 薪水分布曲線 g1= plt.title("Salary Curve Distribuition", fontsize=15) g1 = plt.xlabel("") g1 = plt.ylabel("Salary", fontsize=12)plt.subplots_adjust(wspace = 0.3, hspace = 0.5,top = 0.9) plt.show()

1、回歸關系

這里說的是各個因素，那就要全部因素考慮進來，那就仿照第一問的方法二，可能需要把所有的object變量都變成int或者float類型，然后再進行擬合，得到具體的回歸方程。

2、薪水預測

印度工科學生畢業后的工作情況和薪水。但是我們都不知道影響印度工程專業畢業生工資的不同因素是什么。該項目根據第 10 和第 12 班的分數百分比、大學等級、不同科目的分數、總體 gpa、邏輯推理和畢業年份等參數來預測工程師的薪水。該項目包括一個 ML 模型，該模型使用不同的算法來預測畢業生的薪水。這里我們使用一些主要因素來多薪水做預測（你也可以試試全部因素）。

from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.linear_model import Ridge from sklearn.linear_model import Lasso from sklearn.linear_model import ElasticNet from sklearn.neighbors import KNeighborsRegressor from sklearn.svm import SVR, LinearSVR from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import GradientBoostingRegressor from sklearn.ensemble import AdaBoostRegressor from sklearn.neural_network import MLPRegressor def preprocess_inputs(data_):data_ = data_.copy()data_['Degree'] = LabelEncoder().fit_transform(data_.Degree)data_['Specialization'] = LabelEncoder().fit_transform(data_.Specialization)X=data_[['Gender', 'tenth_percentage', 'twelveth_graduation', 'twelveth_percentage', 'CollegeID', 'CollegeTier', 'Degree', 'Specialization', 'collegeGPA','CollegeCityID', 'CollegeCityTier', 'CollegeState', 'GraduationYear','English', 'Logical', 'Quant', 'Domain', 'ComputerProgramming','ElectronicsAndSemicon', 'ComputerScience', 'MechanicalEngg','ElectricalEngg', 'TelecomEngg', 'CivilEngg', 'conscientiousness','agreeableness', 'extraversion', 'nueroticism', 'openess_to_experience','Age']]y=data_['Salary']X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.7, shuffle=True, random_state=43)scaler = StandardScaler()scaler.fit(X_train)X_train = pd.DataFrame(scaler.transform(X_train), columns = X_train.columns, index = X_train.index)X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns, index = X_test.index)return X_train, X_test, y_train, y_test X_train, X_test, y_train, y_test = preprocess_inputs(data_) X_train

models = {' Linear Regression': LinearRegression(),' Ridge': Ridge(),' Decision Tree': DecisionTreeRegressor(),' Random Forest': RandomForestRegressor(random_state=100,bootstrap=True,max_depth=2,max_features=2,min_samples_leaf=3,min_samples_split=5,n_estimators=3),' Lasso' : Lasso(),' Elastic Net' : ElasticNet(),' Neural network' : MLPRegressor(),' Gradient Boosting': GradientBoostingRegressor(),'Adaboost Classifier': AdaBoostRegressor(),'KNN': KNeighborsRegressor() }for name, model in models.items():model = model.fit(X_train, y_train)print(name + " trained") for name, model in models.items():print(name,model.score(X_test, y_test))

emmmmmm，訓練的結果最好的也才0.23，不是很理想

五、第三問

根據以上的分析，對我國高校工程類專業學生培養是否有一定的啟迪？寫一份建議書。

注意不要亂寫，不是讓你編個小論文，要根據前面兩個問題進行分析，從而寫關于我國的建議（注意：針對自己本校就行了）。最終目的是希望學生的就業薪水更高。

比如哪些因素不應該過度嚴厲，哪些因素學校應該嚴抓等。。。。應該有這方面論文,去找找。

參考：

1、對學校就業工作的建議

2、大學生就業形勢

3、對學校人才培養工作有何建議

總結

以上是生活随笔為你收集整理的2022宁夏杯B题思路分析+代码（大学生就业问题分析）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：计算机桌面任务栏怎样显示输入法,电脑任务
下一篇：企业微信机器人读取服务器,用企业微信机器