数据集成-作业三

发表于 2024-12-15 更新于 2025-01-05 分类于数据集成阅读次数： Waline：本文字数： 27k 阅读时长 ≈ 48 分钟

完整项目地址：homework3 · main · 垃圾桶 / Data-Intergration · GitLab

一数据收集与盘点

1. 初选合适字段

一共有23张原始数据表，从中初选出的合适字段有以下：

uid	证件号码	理由
credit_level	信用等级
star_level	客户星级
贷记卡开户明细	dm_v_as_djk_info	贷记卡中一般反应了与信用等级有关的信息
is_withdrw	是否开通取现功能	信用等级越高越有可能开通
is_transfer	是否开通转账功能	信用等级越高越有可能开通
is_deposit	是否开通存款功能	信用等级越高越有可能开通
is_purchse	是否开通消费功能	信用等级越高越有可能开通
cred_limit	信用额度	直接与信用挂钩
dlay_amt	逾期金额	直接与信用挂钩
five_class	五级分类	与信用相关
is_mob_bank	是否绑定手机银行	信用等级越高越有可能绑定
is_etc	是否绑定ETC	信用等级越高越有可能绑定
bal	余额	一般与信用等级正相关
dlay_mths	逾期期数	直接与信用挂钩
贷记卡分期付款明细	dm_v_as_djkfq_info	贷记卡中一般反应了与信用等级有关的信息
mp_type	分期类型	与信用等级相关
mp_status	分期付款状态	直接影响信用
total_amt	总产品金额	可能与信用等级相关
rem_ppl	剩余未还本金	未还越少一般信用等级越高
total_fee	总费用	可能与信用等级相关
rem_fee	剩余未还费用	未还越少一般信用等级越高
rec_fee	已收手续费	间接反映还款能力，与信用等级相关
存款账号信息	pri_cust_asset_acct_info	存款账户反应客户的财务能力，与星级有关
term	存款期限	期限越长一般客户星级越高
acct_char	账户性质	账户性质能反应客户星级
deps_type	存款种类	存款种类可能与客户星级有关
is_secu_card	是否社保卡	是社保卡往往更重要，与客户星级有关
acct_sts	账户状态	账户状态反应客户星级
frz_sts	冻结状态	客户星级高的一般不冻结
stp_sts	止付状态	客户星级高的一般不止付
acct_bal	原币余额	余额高的一般客户星级高
bal	余额	余额高的一般客户星级高
avg_mth	月日均	月均高的一般客户星级高
avg_qur	季度日均	季均高的一般客户星级高
avg_year	年日均	年均高的一般客户星级高
存款汇总信息	pri_cust_asset_info	存款汇总信息反应客户财务能力，与星级有关
all_bal	总余额	余额高的一般客户星级高
avg_mth	月日均	月均高的一般客户星级高
avg_qur	季度日均	季均高的一般客户星级高
avg_year	年日均	年均高的一般客户星级高
sa_bal	活期余额	余额高的一般客户星级高
td_bal	定期余额	余额高的一般客户星级高
fin_bal	理财余额	余额高的一般客户星级高
sa_crd_bal	卡活期余额	余额高的一般客户星级高
td_crd_bal	卡内定期	定期存款高的一般客户星级高
sa_td_bal	定活两便	两便存款余额高的一般客户星级高
ntc_bal	通知存款	存款余额高的一般客户星级高
td_3m_bal	定期3个月	定期存款高的一般客户星级高
td_6m_bal	定期6个月	定期存款高的一般客户星级高
td_1y_bal	定期1年	定期存款高的一般客户星级高
td_2y_bal	定期2年	定期存款高的一般客户星级高
td_3y_bal	定期3年	定期存款高的一般客户星级高
td_5y_bal	定期5年	定期存款高的一般客户星级高
oth_td_bal	定期其他余额	定期存款高的一般客户星级高
cd_bal	大额存单余额	定期存款高的一般客户星级高
个人基本信息	pri_cust_base_info	个人基本信息能反映个人经济水平，与星级相关
sex	性别	无明显关联
marrige	婚姻状况	无明显关联
education	教育程度	教育程度高的一般经济能力强，与星级相关
career	职业	职业反映经济能力，与星级相关
prof_titl	职称	职称反映经济能力，与星级相关
country	国籍	无明显关联
is_employee	员工标志	无明显关联
is_shareholder	是否股东	股东经济能力强，与星级相关
is_black	是否黑名单	进入黑名单一般星级低
is_contact	是否关联人	无明显关联
is_mgr_dep	是否营销部客户	营销部客户一般星级高
贷款账号信息	pri_cust_liab_acct_info	贷款账号信息直接反应信用等级
loan_amt	贷款金额	可能与信用等级有关
loan_bal	贷款余额	余额越少，一般信用等级越低
vouch_type	主要担保方式	担保方式反应信用
is_mortgage	是否按揭	按揭一般信用比不按揭的信用好
is_online	是否线上贷款	可能与信用相关
is_extend	是否展期	可能与信用相关
five_class	五级分类	可能与信用相关
overdue_class	逾期细分	直接影响信用
overdue_flag	逾期标志	直接影响信用
owed_int_flag	欠息标志	直接影响信用
credit_amt	合同金额	可能与信用相关
owed_int_in	表内欠息金额	欠息越少信用越高
owed_int_out	表外欠息金额	欠息越少信用越高
delay_bal	逾期金额	预期越少信用越高
is_book_acct	是否授信台账客户	授信一般信用高
贷款账户汇总	pri_cust_liab_info	贷款账户一般能反映信用等级
all_bal	总余额	余额越多一般信用越高
bad_bal	不良余额	不良越少信用越高
due_intr	欠息总额	欠息越少信用越高
norm_bal	正常余额	正常总额越高一般信用越高
delay_bal	逾期总额	逾期越少信用越高
合同明细	dm_v_tr_contract_mx	合同明细反应信用等级
buss_amt	金额	无明显关联
bal	余额	无明显关联
norm_bal	正常余额	余额越多一般信用越高
dlay_bal	逾期余额	逾期越少信用越高
dull_bal	呆滞余额	呆滞越少信用越高
owed_int_in	表内欠息金额	欠息越少信用越高
owed_int_out	表外欠息余额	欠息越少信用越高
fine_pr_int	本金罚息	罚息越少信用越高
fine_intr_int	利息罚息	罚息越少信用越高
dlay_days int4 NULL,	逾期天数	逾期天数越少信用越高
five_class	五级分类	与信用相关
is_bad	不良记录标志	不良记录直接影响信用
frz_amt	冻结金额	冻结余额直接影响信用
due_intr_days	欠息天数	欠息天数越少信用越高
shift_bal	移交余额	可能与信用相关
贷记卡交易	dm_v_tr_djk_mx	可能与信用相关
tran_amt	交易金额	可能与信用相关
第三方交易	dm_v_tr_dsf_mx	可能与信用相关
tran_amt	交易金额	可能与信用相关
借据明细	dm_v_tr_duebill_mx	借据明细直接反应信用
buss_amt	金额	金额约低一般信用越高
bal	余额	余额越高一般信用越高
norm_bal	正常余额	余额越高一般信用越高
dlay_amt	逾期金额	逾期余额越高一般信用越低
dull_amt	呆滞金额	呆滞余额越高一般信用越低
bad_debt_amt	呆帐金额	呆账余额越高一般信用越低
owed_int_in	表内欠息金额	欠息余额越高一般信用越低
owed_int_out	表外欠息金额	欠息余额越高一般信用越低
fine_pr_int	本金罚息	罚息额越高一般信用越低
fine_intr_int	利息罚息	罚息额越高一般信用越低
dlay_days	逾期天数	天数越高一般信用越低
due_intr_days	欠息天数	天数越高一般信用越低
pay_freq	还款频率	可能与信用相关
ETC交易	dm_v_tr_etc_mx	反映客户星级
tran_amt_fen	交易金额	金额越高星级越高
real_amt	实收金额	金额越高星级越高
conces_amt	优惠金额	金额越高星级越高
个人网银交易	dm_v_tr_grwy_mx	反映客户星级
tran_amt	交易金额	金额越高星级越高
工资代发明细	dm_v_tr_gzdf_m	反映客户星级
tran_amt	交易金额	金额越高星级越高
贷款还本明细	dm_v_tr_huanb_mx	反应客户信用
tran_amt	交易金额	贷款金额越低一般信用越高
bal	余额	贷款余额额越低一般信用越低
pay_term	还款期数	期数越高一般信用越高
pprd_rfn_amt	每期还款金额	可能与信用相关
pprd_amotz_intr	每期摊还额计算利息	可能与信用相关
贷款还息明细	dm_v_tr_huanx_mx	反应客户信用
tran_amt	利息	可能与信用相关
cac_intc_pr	计息本金	可能与信用相关
pay_term	还款期数	可能与信用相关
intr	利率	可能与信用相关
活期交易	dm_v_tr_sa_mx	交易额越高，客户星级越高
tran_amt	交易金额	交易额越高，客户星级越高
社保医保交易	dm_v_tr_sbyb_mx	交易额越高，客户星级越高
tran_amt_fen	交易金额	交易额越高，客户星级越高
水电燃气交易	dm_v_tr_sdrq_mx	交易额越高，客户星级越高
tran_amt_fen	交易金额	交易额越高，客户星级越高
商户交易明细	dm_v_tr_shop_mx	交易额越高，客户星级越高
tran_amt	交易金额	交易额越高，客户星级越高
score_num	优惠积分	积分越高，客户星级越高
手机银行交易	dm_v_tr_sjyh_mx	交易额越高，客户星级越高
tran_amt	交易金额	交易额越高，客户星级越高

2. 表内按uid合并

在每张表内，可能同一个uid会出现多次，需要进行合并，使得每张表内每个uid都是唯一的，具体处理策略为：

对于数值型列，如果是可加的使用sum合并，如果是不能相加的（一些特殊的），使用mean合并；

对于非数值列，使用众数合并；

# 表内按照uid合并
def inner_merge(df, table_name):
    # 将能转换成数字的都转换成数字，不能转化的保留
    df = df.apply(pd.to_numeric, errors='ignore')
    # 按uid排序，方便合并
    df = df.sort_values('uid', ascending=True)
    
    # 先选择合适字段
    df_selected = df[select_cols[table_name]]

    # 对于不良标记（is_bad)列，没有时应填充，以防被后面过滤
    df_selected = df_selected.copy()
    if 'is_bad' in df_selected.columns:
        df_selected.fillna('N', inplace=True)

    # 找出数值列与非数值列
    numeric_cols = df_selected.select_dtypes(include='number').columns
    non_numeric_cols = df_selected.select_dtypes(exclude='number').columns

    # 特殊处理: 当’five_class‘为非数值型时，转化为数值型
    if 'five_class' in non_numeric_cols:
        df['five_class'] = df['five_class'].str.extract(r'(\d+)')
        df['five_class'] = pd.to_numeric(df['five_class'])

    # 定义列的合并方式，对可取和的数值型的列，使用sum合并，不能取和的使用mean合并;对于枚举型字符串的列，使用众数合并
    agg_function = {}
    for col in numeric_cols:
        if col in ['five_class']:
            agg_function[col] = 'mean'
        else:
            agg_function[col] = 'sum'

    for col in non_numeric_cols:
        if col == 'uid':
            continue
        agg_function[col] = util.util.compute_mode

    # 按uid合并
    df_merged = df_selected.groupby('uid').agg(agg_function).reset_index()
    df_merged = df_merged.sort_values('uid', ascending=True)

    return df_merged

3. 多表按uid连接

将所有表按照uid进行连接，合并成一个很多列的大表，为了区分不同表之间可能重复的列名，将除了uid、star_level、credit_level以外的列名改为’表名:列名‘。

# 多表按uid合并为一个大表
def table_aggregate(tables):
    df_merged = None

    for table_name, table_data in tables.items():
        print("now aggregate table " + table_name)

        # 将除了uid、star_level、credit_level以外的列名改为 表名+列名
        prefix = f'{table_name}:'
        table_data.columns = [prefix + col if col not in ['uid', 'credit_level', 'star_level'] else col for col in table_data.columns]

        if df_merged is None:
            df_merged = table_data.copy()
        else:
            df_merged = pd.merge(df_merged, table_data, on='uid', how='outer')

    return df_merged

4. 区分训练集与测试集

根据’star_level‘列和’credit_level‘列是否为-1来区分测试集和训练集，并且将这两列的值为空的行排除。

# 区分用户星级的训练集与测试集，并过滤掉用户星级为空的行
df_star_train = df_merged.loc[df_merged['star_level'] != -1]
df_star_train = df_star_train.dropna(subset=['star_level']).drop('credit_level', axis=1)
df_star_test = df_merged.loc[df_merged['star_level'] == -1].drop('credit_level', axis=1)

# 区分客户信用的训练集与测试集，并过滤掉客户信用为空的行
df_credit_train = df_merged.loc[df_merged['credit_level'] != -1]
df_credit_train = df_credit_train.dropna(subset=['credit_level']).drop('star_level', axis=1)
df_credit_test = df_merged.loc[df_merged['credit_level'] == -1].drop('star_level', axis=1)

5. 数据盘点

使用pandas的describe函数进行数据盘点，统计数据集各字段的平均数、最小值、最大值、四位位数和标准差，结果保存在resources/inventory/中，示例如下：

inventory

数据盘点的可视化使用matplot绘制数据集的柱状图和箱线图，结果保存在resources/inventory/中，示例如下：

在数据盘点时，由于是否处理异常值会影响盘点结果，在处理异常值前和后都进行了一次数据盘点，代码如下：

# star
df = pd.read_csv("resources/star/star_train.csv")

# 数据盘点，此时数据还没有经过异常值处理
description = df.describe()

# 保存数据盘点结果
print("Description(before handling missing values):\n", description)
description.to_csv('resources/inventory/star_train_description_before_handling_missing.csv', index=True)

# 绘制柱状图
description.plot(kind='bar', figsize=(20, 16))
plt.title('Descriptive Statistics')
plt.xlabel('Statistics')
plt.ylabel('Values')
plt.legend(loc='best')
plt.show()

# 绘制箱线图
description.plot(kind='box', figsize=(20, 16))
plt.title('Descriptive Statistics')
plt.xlabel('Statistics')
plt.ylabel('Values')
plt.show()

# 进行异常值处理后，再盘点一次
df_handled = preprocess.handle_missing_values(df, is_test=False)
description_handled = df_handled.describe()

# 保存数据盘点结果
print("Description(after handling missing values):\n", description_handled)
description_handled.to_csv('resources/inventory/star_train_description_after_handling_missing.csv', index=True)

# 绘制柱状图
description_handled.plot(kind='bar', figsize=(20, 16))
plt.title('Descriptive Statistics')
plt.xlabel('Statistics')
plt.ylabel('Values')
plt.legend(loc='best')
plt.show()

# 绘制箱线图
description_handled.plot(kind='box', figsize=(20, 16))
plt.title('Descriptive Statistics')
plt.xlabel('Statistics')
plt.ylabel('Values')
plt.show()

二数据预处理

1. 缺失值处理

首先剔除缺失率大于0.7的列，然后对于剩下的缺失值进行填充，具体的，对于数值型变量，用中位数填充缺失值；对于类别型变量，用众数填充缺失值。

def handle_missing_values(df, is_test):

    # 对于训练集，需要剔除缺失率大的列，而测试集的列应与训练集保持一致
    if is_test is False:
        # 计算每列的缺失率
        missing_rate = df.isna().mean()

        # 根据缺失率剔除缺失率大的列
        min_missing_rate = 0.7
        df = df.loc[:, (missing_rate < min_missing_rate) | (df.columns.isin(columns_to_keep))]

    df = df.copy()
    # 对于数值型变量，用中位数填充缺失值
    numeric_cols = df.select_dtypes(include='number').columns
    for col in numeric_cols:
        if col in columns_to_keep:
            continue
        median = df[col].median()
        df[col].fillna(median, inplace=True)

    # 对于类别型变量，用众数填充缺失值
    non_numeric_cols = df.select_dtypes(include='object').columns
    for col in non_numeric_cols:
        if col in columns_to_keep:
            continue
        mode = df[col].mode()[0]
        df[col].fillna(mode, inplace=True)

    df_handled = df
    return df_handled

2. 异常值处理

对于数值型列，将小于5%和大于95%的使用5%和95%的数据替换。

def handle_outliers(df):

    numeric_cols = df.select_dtypes(include='number').columns

    for col in numeric_cols:
        if col in columns_to_keep:
            continue
        lower_threshold = df[col].quantile(0.05)
        upper_threshold = df[col].quantile(0.95)

        df.loc[df[col] < lower_threshold, col] = lower_threshold
        df.loc[df[col] > upper_threshold, col] = upper_threshold

    df_handled = df
    return df_handled

3. 数据转换

使用标签编码（Label Encoding），将非数值特征转换为数值特征。在数据转换时，排除uid、credit_level 、star_level三个列。

def transfer_nonnumerical(df):

    label_encoder = LabelEncoder()

    # 选择非数值型列
    non_numeric_cols = df.select_dtypes(exclude='number').columns

    for col in non_numeric_cols:
        if col in columns_to_keep:
            continue
        df[col] = label_encoder.fit_transform(df[col])

    df_transferred = df

    return df_transferred

4. 数据标准化

使用Min-Max标准化（Normalization），通过线性变换将数据缩放到指定的范围，通常是0到1之间，转换公式：
$$
x* = ( x − min ) / ( max − min )
$$
为了在之后预测时复用同一个标准化工具对象，使得标准化过程一致，需要使用pickle保存标准化对象（scaler）；在标准化时，排除uid、credit_level 、star_level三个列。

def standardize(df, is_test):

    # 为了在之后预测时复用同一个标准化工具对象，需要保存标准化对象（scaler）
    save_scaler_pickle_path = ""

    # 选择不想被标准化的列
    y_columns = ['uid']
    if 'credit_level' in df.columns:
        y_columns.append('credit_level')
        save_scaler_pickle_path = 'resources/pickle/credit_scaler_pickle.pk1'
    else:
        y_columns.append('star_level')
        save_scaler_pickle_path = 'resources/pickle/star_scaler_pickle.pk1'

    # 如果是训练集，需要新建一个标准化对象并拟合；如果是测试集，则复用之前的
    if is_test is False:
        scaler = MinMaxScaler()
        scaler.fit(df.drop(y_columns, axis=1))

        # 将拟合后的标准化对象保存
        with open(save_scaler_pickle_path, 'wb') as file:
            pickle.dump(scaler, file)
    else:
        with open(save_scaler_pickle_path, 'rb') as file:
            scaler = pickle.load(file)

    # 对数据进行标准化
    df_standardized = scaler.transform(df.drop(y_columns, axis=1))

    # 将未被标准化的列添加回来
    df_standardized = pd.DataFrame(df_standardized, columns=df.drop(y_columns, axis=1).columns)
    df_standardized = pd.concat([df[y_columns], df_standardized], axis=1)  # 合并 DataFrame

    return df_standardized

经过标准化后的数据示例如下： credit_std

三特征工程与特征选择

1. 特征工程

特征工程（Feature Engineering）是数据预处理的一部分，涉及创建新的特征或转化现有特征以便更好地表示潜在问题，以提高机器学习模型的性能。它是通过使用领域知识来提取从原始数据中提取有用特征的过程。它有着如下作用：

改善模型性能：通过创建与预测目标更密切相关的特征，可以提高模型的预测准确性。
降低计算成本：去除不相关或冗余的特征可以减少模型训练和预测所需的计算资源。
提高模型的可解释性：对于某些类型的模型，如决策树和线性模型，创建有意义的特征可以让模型的决策过程更容易理解。

特征工程的常见方法包括缩放、离散化、交互、编码、缺失值处理等。这些方法的选择和实施通常需要依赖于实际问题的具体情况，如数据类型（例如，连续的、类别的、有序的等）、数据质量（例如，是否存在缺失值、异常值等）、预测目标、所使用的模型类型等。

本次实验中，我们选择如下的特征选择方式：

1.1 皮尔逊相关系数

$$
\rho_{x,y} = \frac{cov(X,Y)}{\sigma_{x}\sigma_{y}} = \frac{E((X - \mu_{X})(Y - \mu_{Y}))}{\sigma_{x}\sigma_{y}} = \frac{E(XY) - E(X)E(Y)}{\sqrt{E(X^{2}) - E^{2}(X)} - \sqrt{E(Y^{2}) - E^{2}(Y)}}
$$

$$
\rho_{x,y} = \frac{N\Sigma{XY} - \Sigma{X}\Sigma{Y}}{\sqrt{N\Sigma{X^{2}} - (\Sigma{X})^{2}}\sqrt{N\Sigma{Y^{2}} - (\Sigma{Y})^{2}}}
$$

皮尔逊相关系数（Pearson correlation coefficient）是一种衡量两个变量之间线性关系强度和方向的统计量。它的值范围在-1和1之间，其中：

1 表示完全正相关（一个变量增加，另一个变量也增加）。

-1 表示完全负相关（一个变量增加，另一个变量减少）。

0 表示两个变量之间没有线性关系。

皮尔逊相关系数的主要作用如下：

关系强度和方向：皮尔逊相关系数是最常用的方法之一，用于量化两个连续变量之间关系的强度和方向。但是，皮尔逊相关系数只衡量线性关系。如果两个变量之间的关系是非线性的，即使它们非常紧密，皮尔逊相关系数也可能接近于零。此外，皮尔逊相关系数对离群值非常敏感，这可能会对结果产生显著影响。因此，在进行相关性分析时，也需要考虑这些因素。

代码如下：

""" 
根据Pearson相关系数选择关联最大的k个特征
输入为合并表后的路径，k为待选择的特征数量  
返回值为一个二维列表，每一维长度均为k，分别代表对credit_level和star_level最相关的k个特征  
"""
def pearson_select_features(file_path, k):
    f = open(file_path, 'r', encoding='utf8')
    """
    y[0]代表credit_level
    y[1]代表star_level
    tbd，x[i]对应的就是第i个特征
    scores[i][j]代表第j个特征对第i个标签的pearson相关系数，内容为(string:<label_name>, float:score)
    """
    y = [[], []]
    x = []
    scores = [[], []]
    # 初始化 x,y
    for line in f.readlines():
        line = line[1:-1]
        lst = line.split(",")
        for item in lst:
            key = item.split(":")[0]
            value = item.split(":")[1]
            if key == "credit_level" or key == "star_level":
                y[dict_tag[key]].append(value)
            else:
                x[dict_label[key]].append(value)
    label_lst = list(dict_label.keys())
    for j, feature in enumerate(x):
        for i, tag in enumerate(y):
            scores[i][j] = (label_lst[j], np.corrcoef(feature, tag))
    res = [[], []]
    for i in range(len(scores)):
        tag_scores = scores[i]
        res[i] = select_k_features(tag_scores, k)
    return res

1.2 互信息

$$
I(X;Y) = \int_{X} \int_{Y} P(X,Y)log\frac{P(X,Y)}{P(X)P(Y)}
$$

$$
= \int_{X} \int_{Y}P(X,Y)log\frac{P(X,Y)}{P(X)} - \int_{X} \int_{Y}P(X,Y)logP(Y)
$$

$$
= \int_{X} \int_{Y}P(X)\int_{Y}P(Y|X)log(Y|X) - \int_{Y}P(Y)logP(Y)
$$

$$
= - \int_{X}P(X)H(Y|X=x) + H(Y) = H(Y) - H(Y|X)
$$

互信息，Mutual Information，缩写为MI，表示两个变量X与Y是否有关系，以及关系的强弱。互信息度量两个随机变量共享的信息——知道随机变量X,对随机变量Y的不确定性减少的程度（或者知道随机变量Y，对随机变量X的不确定性减少的程度。

代码如下：

"""
皮尔逊系数只能衡量线性相关性，但互信息系数能够解决这一局限
互信息系数能够很好地度量各种相关性，得到相关性后可以排序选择特征
输入为合并表后的路径，k为待选择的特征数量  
返回值为一个二维列表，每一维长度均为k，分别代表对credit_level和star_level最相关的k个特征
"""


def mutual_select_features(file_path, k):
  df = pd.read_csv(file_path)
  df.pop('uid')
  y = df.pop('credit_level')
  X = df
  best_features = SelectKBest(score_func=mutual_info_classif, k=len(X.columns))
  fit = best_features.fit(X, y)
  df_scores = pd.DataFrame(fit.scores_)
  df_columns = pd.DataFrame(X.columns)
  df_feature_scores = pd.concat([df_columns, df_scores], axis=1)
  df_feature_scores.columns = ['Feature', 'Score']
  df_feature_scores = df_feature_scores.sort_values(by='Score', ascending=False)
  f = open('resources/feature/credit_scores_mutual.txt', 'w', encoding='utf-8')
  for line in df_feature_scores.iterrows():
    f.write(str(line) + "\n")

1.3 卡方检验

$$
X = \Sigma{(0-E)/E}
$$

卡方检验可以用于特征选择，特别是在处理分类问题时。基本思想是看特征与目标变量之间的关联程度，关联度越高，特征就越重要。卡方统计量度量了观察分布和期望分布之间的差异，卡方值越大，差异就越大，这意味着特征和目标变量之间的关联性更强，反之亦然。因此，可以通过计算每个特征的卡方统计量来对特征进行评分，并选择得分最高的特征。

这种方法适用于分类目标变量和分类特征，它不适用于连续变量。如果你的特征或目标变量是连续的，你可能需要先将其离散化或考虑使用其他方法，如基于皮尔逊相关系数或基于模型的特征选择。

需要注意的是，尽管卡方检验可以帮助我们找出与目标变量有强关联的特征，但它并不能检测出特征之间的交互作用。这意味着，如果一个特征只有在与另一个特征相结合时才对目标变量有影响，卡方检验可能无法识别出这个特征的重要性。

代码如下：

def chi2_select_features(file_path, k):
  df = pd.read_csv(file_path)
  df.pop('uid')
  y = df.pop('credit_level')
  X = df
  best_features = SelectKBest(score_func=chi2, k=len(X.columns))
  fit = best_features.fit(X, y)
  df_scores = pd.DataFrame(fit.scores_)
  df_columns = pd.DataFrame(X.columns)
  df_feature_scores = pd.concat([df_columns, df_scores], axis=1)
  df_feature_scores.columns = ['Feature', 'Score']
  df_feature_scores = df_feature_scores.sort_values(by='Score', ascending=False)
  f = open('resources/feature/credit_scores_chi2.txt', 'w', encoding='utf-8')
  for line in df_feature_scores.iterrows():
    f.write(str(line) + "\n")

2. 特征选择

根据前面的三种方法，我们最终决定使用基于经验、基于卡方检验、基于互信息的三种情形作为我们最终的特征选择。

选择credit_scores_chi2（卡方检验得分）部分展示：

首先选取每种情况下最大得分的5%的所有特征，然后取这些特征的并集作为最终的特征集，将数据中的在特征集以外的列丢弃。

def select_feature(feature_scores, proportion):
  select_features = []
  max_score = max(feature_scores.values())
  for feature, score in feature_scores.items():
    if score >= max_score * proportion:
      select_features.append(feature)
  return select_features


credit_chisquare_scores_path = 'resources/feature/credit_scores_chi2.txt'
credit_chisquare_feature_scores = read_feature_scores(credit_chisquare_scores_path)
credit_chisquare_select_features = select_feature(credit_chisquare_feature_scores, 0.05)

credit_mutual_scores_path = 'resources/feature/credit_scores_mutual.txt'
credit_mutual_feature_scores = read_feature_scores(credit_mutual_scores_path)
credit_mutual_select_features = select_feature(credit_mutual_feature_scores, 0.05)

credit_select_features = list(set(credit_chisquare_select_features).union(set(credit_chisquare_select_features)))

四模型选择与训练

从已知标签的数据集中以7：3比例分割为训练集和测试集，分别得到特征和标签。

def get_features_and_labels(train_csv, test_csv, level):
    train_data = pd.read_csv(train_csv)
    test_data = pd.read_csv(test_csv)
    # 分离特征和标签
    features = train_data.drop(['uid', level], axis=1)
    label = train_data[level]
    label = label.astype(int)
    train1, test1, train2, test2 = train_test_split(features, label, test_size=0.3, random_state=18)
    return train1, test1, train2, test2, test_data

选用了逻辑回归、决策树、随机森林、XGBoost和神经网络作为模型，将相应模型传入run_model方法进行训练。

# category代表是credit还是start
def run_model(model, train_features, train_label, test_features, test_label, test, category):
    print('credit:')
    model.fit(train_features, train_label)

1. 逻辑回归

默认最大迭代次数为100次，在运行实践中发现会报算法没有收敛，得到最优解就结束的错误。通过实践将其设置为1000次，既可以充分收敛得到最优解，又不至于因为迭代次数过多，影响效率。

# 逻辑回归
lr = LogisticRegression(max_iter=1000)  # 默认最大迭代次数为100次，没有收敛到最优解
print('Logistic Regression:')
    run_model(lr, X_credit_train, y_credit_train, X_credit_test, y_credit_test, credit_test,
              ['credit', 'logistic_regression'])
    run_model(lr, X_star_train, y_star_train, X_star_test, y_star_test, star_test,
              ['star', 'logistic_regression'])

2. 决策树

# 决策树
dt = DecisionTreeClassifier()
run_model(dt, X_credit_train, y_credit_train, X_credit_test, y_credit_test, credit_test,
              ['credit', 'decision_tree'])
    run_model(dt, X_star_train, y_star_train, X_star_test, y_star_test, star_test,
              ['star', 'decision_tree'])

3. 随机森林

# 随机森林
rt = RandomForestClassifier()
run_model(rt, X_credit_train, y_credit_train, X_credit_test, y_credit_test, credit_test,
              ['credit', 'random_forest'])
    run_model(rt, X_star_train, y_star_train, X_star_test, y_star_test, star_test,
              ['star', 'random_forest'])

4. XGBoost

由于XGBoost要求标签类别必须是从0开始的整数序列，所以进行如下转换，将标签映射为0~n的整数。

# 创建XGBoost模型
xgb_model = xgb.XGBClassifier()
# 定义映射关系
star_map = {1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7, 9: 8}
credit_map = {35: 0, 50: 1, 60: 2, 70: 3, 85: 4}
# 将标签列进行映射
y_star_train_mapped = y_star_train.map(star_map)
y_star_test_mapped = y_star_test.map(star_map)
y_credit_train_mapped = y_credit_train.map(credit_map)
y_credit_test_mapped = y_credit_test.map(credit_map)
print('XGBoost:')
run_model(xgb_model, X_credit_train, y_credit_train_mapped, X_credit_test, y_credit_test_mapped, credit_test,
          ['credit', 'XGBoost'])
run_model(xgb_model, X_star_train, y_star_train_mapped, X_star_test, y_star_test_mapped, star_test,
          ['star', 'XGBoost'])

5. 神经网络

通过搭建神经网络来进行模型训练与预测，其中star模型的网络结构如下（对credit的结构同理）：

输入层：接受输入数据，大小为14，对应了14个输入属性。
隐藏层1：包含128个神经元，通过线性变换和ReLU激活函数对输入进行处理。
隐藏层2：包含64个神经元，同样通过线性变换和ReLU激活函数对隐藏层1的输出进行处理。
输出层：包含9个神经元，对应了1~9的star评级，通过线性变换得到最终的输出结果，然后使用对数 softmax 函数对输出进行归一化。

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.fc1 = nn.Linear(39, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 5)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return F.log_softmax(x, dim=1)

数据集类如下：

class MyDataset(torch.utils.data.Dataset):
    def __init__(self, csv_file, train=True):
        self.train = train
        csv = pd.read_csv(csv_file)
        csv = csv.drop(['uid'], axis=1)
        # 8:2划分训练集和验证集
        if self.train:
            self.data, _ = train_test_split(csv, test_size=0.2, random_state=42)
        else:
            _, self.data = train_test_split(csv, test_size=0.2, random_state=42)

    def __getitem__(self, index):
        # 将数据和标签转换为PyTorch的Tensor对象
        data_tensor = torch.tensor(self.data.iloc[index, 0:-9].values, dtype=torch.float32).to(device)
        label_tensor = torch.tensor(self.data.iloc[index, -9:].values, dtype=torch.float32).to(device)

        return data_tensor, label_tensor

    def __len__(self):
        return len(self.data)

进行30次迭代后，loss值基本稳定：

# device为cuda，使用GPU进行训练
model = Net().to(device)

# 定义损失函数为均方误差损失函数
criterion = nn.MSELoss()

# 定义优化器为Adam优化器，学习率为0.001
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

train_dataset = MyDataset("star_train_std_selected_split.csv", True)
train_dataloader = DataLoader(train_dataset, batch_size=5000, shuffle=True, drop_last=True,
                              num_workers=0) # 一批量为5000条数据

valid_dataset = MyDataset("star_train_std_selected_split.csv", False)
valid_dataloader = DataLoader(valid_dataset, batch_size=5000, shuffle=True, drop_last=True,
                              num_workers=0)


for epoch in range(30):
    # 训练
    for i, (data, label) in enumerate(train_dataloader):
        data = data.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, label)
        loss.backward()
        optimizer.step()
        
    # 测试
    for i, (data, label) in enumerate(valid_dataloader):
        output = model(data)
        loss = criterion(output, label)
        accuracy = (output.argmax(dim=1) == label.argmax(dim=1)).float().mean()
        print("Epoch: {}, Iter: {}, Accuracy: {}, loss: {}".format(epoch, i, accuracy.item(), loss.data))
        predictions = output.argmax(dim=1).cpu().numpy()
        true_labels = label.argmax(dim=1).cpu().numpy()

最后，用现有模型进行star预测：

# 存放分类对应的index
predictions = []
with torch.no_grad():
    for data in inference_dataloader:
        data = data.to(device)
        output = model(data)
        batch_predictions = output.argmax(dim=1).cpu().numpy()
        predictions.extend(batch_predictions)

# 打印预测分类的index列表
print("Predictions:")
print(predictions)

# 映射到uid:star_level
df_temp = pd.DataFrame(predictions, columns=['star_level'])
df = pd.read_csv('star_test_std_selected.csv')
df['star_level'] = df_temp['star_level'] + 1
df = df[['uid', 'star_level']]

# 预测结果写入csv文件
df.to_csv('star_neural_network_predict.csv', index=False)
print(df)

五模型评估

模型评估使用准确率、混淆矩阵、精确率、召回率、F1分数和Cohen’s Kappa系数。

其中对于精确率、召回率和F1分数分别采取macro和weighted作为average参数的值，得到相应结果。两者的区别在于前者采取宏平均，对每个类别单独计算指标，然后对所有类别的指标取算术平均值；而后者采取加权平均，根据每个类别在真实标签中的样本数量进行加权平均。由于预处理后得到的数据各类别标签个数相差较大，所以本小组认为weighted更加合理，但同时也给出了macro作为参数值时的结果。

1
2
3

predictions = model.predict(test_features)
# 模型评估
evaluate(test_label, predictions, category)

# # 计算混淆矩阵
confusion = confusion_matrix(label, predictions)
draw_confusion(confusion, category)
print('Confusion Matrix')

# 计算准确率 = (TP + TN) / (TP + TN + FP + FN)
accuracy = accuracy_score(label, predictions)
# 计算Cohen's Kappa系数
kappa_score = cohen_kappa_score(label, predictions)
# 计算精确率 = TP / (TP + FP)
macro_precision = precision_score(label, predictions, average='macro', zero_division=1)
# 计算召回率 = TP / (TP + FN)
macro_recall = recall_score(label, predictions, average='macro')
# 计算F1分数 = 2 * (精确率 * 召回率) / (精确率 + 召回率)
macro_f1 = f1_score(label, predictions, average='macro')
# 计算精确率 = TP / (TP + FP)
weighted_precision = precision_score(label, predictions, average='weighted', zero_division=1)
# 计算召回率 = TP / (TP + FN)
weighted_recall = recall_score(label, predictions, average='weighted')
# 计算F1分数 = 2 * (精确率 * 召回率) / (精确率 + 召回率)
weighted_f1 = f1_score(label, predictions, average='weighted')

def draw_confusion(confusion, category):
    # 绘制混淆矩阵图
    plt.imshow(confusion, cmap='Blues')
    if category[0] == 'star':
        class_names = ['1', '2', '3', '4', '5', '6', '7', '8', '9']
    elif category[0] == 'credit':
        class_names = ['35', '50', '60', '70', '85']
    # 添加颜色条
    plt.colorbar()
    # 设置标签
    tick_marks = np.arange(len(class_names))
    plt.xticks(tick_marks, class_names)
    plt.yticks(tick_marks, class_names)
    # 添加数值
    thresh = confusion.max() / 2
    for i, j in itertools.product(range(confusion.shape[0]), range(confusion.shape[1])):
        plt.text(j, i, format(confusion[i, j], 'd'),
                 horizontalalignment="center",
                 color="white" if confusion[i, j] > thresh else "black")
    # 添加轴标签
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    # 调整布局
    plt.tight_layout()
    # 保存为图片
    save_path = 'resources/evaluation results/'+category[0]+'_'+category[1]+'_confusion_matrix.png'
    plt.savefig(save_path)
    # 显示图像
    plt.show()

部分结果展示：

混淆矩阵图片(采用决策树模型针对star_level做评估)：

准确率、精确率、召回率、F1分数和Cohen’s Kappa系数：

star	准确率	精确率(weighted)	召回率(weighted)	F1分数(weighted)	精确率(macro)	召回率(macro)	F1分数(macro)	Cohen’s Kappa系数
逻辑回归模型	0.8443	0.8388	0.8443	0.8353	0.6835	0.4218	0.4268	0.7419
决策树模型	0.8865	0.8873	0.8865	0.8868	0.6732	0.4530	0.4519	0.8183
随机森林模型	0.9059	0.9067	0.9059	0.9061	0.6945	0.4726	0.4717	0.8500
XGBoost模型	0.9114	0.9137	0.9114	0.9120	0.7063	0.4870	0.4833	0.8589
神经网络	0.9039	0.9066	0.9040	0.9003				0.8435

credit	准确率	精确率(weighted)	召回率(weighted)	F1分数(weighted)	精确率(macro)	召回率(macro)	F1分数(macro)	Cohen’s Kappa系数
逻辑回归模型	0.6628	0.6158	0.6628	0.6139	0.7065	0.2703	0.2550	0.3000
决策树模型	0.6101	0.6142	0.6101	0.6121	0.2885	0.2895	0.2890	0.2857
随机森林模型	0.6845	0.6523	0.6845	0.6615	0.5260	0.3049	0.3063	0.3773
XGBoost模型	0.6869	0.6620	0.6869	0.6680	0.5289	0.3109	0.3108	0.3932
神经网络	0.6800	0.6474	0.6800	0.9003	0.6506			0.3694

六模型应用

对测试数据进行预处理（包括缺失值处理、数据转化、数据标准化）后，使用模型预测最终测试集（即未知的数据，’star_level’和’credit_level’是-1的），并将结果保存至resources/predict/中。

# 模型应用
   label = category[0]+'_level'
   predictions_test = model.predict(test.drop(['uid', label], axis=1))
   # 对预测结果进行处理
   predictions_test = predictions_test.astype(int)
   # 如果是XGBoost模型要将其映射回去
   if category[1] == 'XGBoost':
       if category[0] == 'star':
           # 定义映射关系
           label_map = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9}
       elif category[0] == 'credit':
           label_map = {0: 35, 1: 50, 2: 60, 3: 70, 4: 85}
       # 使用列表推导式映射标签
       predictions_test_mapped = [label_map[l] for l in predictions_test]
       predictions_test_mapped = np.array(predictions_test_mapped)
       test[label] = predictions_test_mapped
   else:
       test[label] = predictions_test

   # 只取uid和目标列
   test = test[['uid', category[0]+'_level']]
   # 保存预测结果至csv 文件
   test.to_csv('resources/predict/'+category[0]+'_'+category[1]+'_predict.csv', index=False)

预测结果示例（逻辑回归的star预测）：

predict_result

项目结构

├─ pictures	                    存放文档所需的图片								
├─ resources 					
│   ├─ credit 					信用等级相关数据
│   ├─ evaluation results		模型评估结果
│   ├─ feature					特征工程结果
│   ├─ inventory				数据盘点结果
│   ├─ pickle								
│   ├─ predict					预测分类结果
│   ├─ raw data					原始数据
│   └─ star						客户星级相关数据
├─ util							工具类					
├─ inventory.py					数据盘点
├─ merge.py						表合并
├─ model.py						模型训练、应用
├─ nn_model.py					神经网络模型
├─ preprocess.py				预处理
├─ select_feature.py			特征工程与特征选择
└─ 数据集成实验报告.md
└─ 数据集成实验报告.pdf

团队分工

姓名	学号	分工
xxx	xxxxxxxxx	模型训练、模型评估、模型应用
xxx	xxxxxxxxx	数据收集、数据预处理
xxx	xxxxxxxxx	特征工程、特征选择
xxx	xxxxxxxxx	模型训练、模型评估、模型应用
xxx	xxxxxxxxx	数据盘点与可视化

一 数据收集与盘点

1. 初选合适字段

2. 表内按uid合并

3. 多表按uid连接

4. 区分训练集与测试集

5. 数据盘点

二 数据预处理

1. 缺失值处理

2. 异常值处理

3. 数据转换

4. 数据标准化

三 特征工程与特征选择

1. 特征工程

1.1 皮尔逊相关系数

1.2 互信息

1.3 卡方检验

2. 特征选择

四 模型选择与训练

1. 逻辑回归

2. 决策树

3. 随机森林

4. XGBoost

5. 神经网络

五 模型评估

六 模型应用

项目结构

团队分工

一数据收集与盘点

二数据预处理

三特征工程与特征选择

四模型选择与训练

五模型评估

六模型应用