数据科学和人工智能技术笔记 十九、数据整理(1)

作者 : 开心源码 本文共33477个字,预计阅读时间需要84分钟 发布时间: 2022-05-12 共260人阅读

十九、数据整理(1)

作者:Chris Albon

译者:飞龙

协议:CC BY-NC-SA 4.0

在 Pandas 中通过分组应用函数

import pandas as pd# 创立示例数据帧data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],       'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}df = pd.DataFrame(data)df
CasualtiesPlatoon
01A
14A
25A
37A
45A
55A
66B
71B
84B
95B
106B
117C
124C
136C
144C
156C
# 按照 df.platoon 对 df 分组# 而后将滚动平均 lambda 函数应用于 df.casualtiesdf.groupby('Platoon')['Casualties'].apply(lambda x:x.rolling(center=False,window=2).mean())'''0     NaN1     2.52     4.53     6.04     6.05     5.06     NaN7     3.58     2.59     4.510    5.511    NaN12    5.513    5.014    5.015    5.0dtype: float64''' 

在 Pandas 中向分组应用操作

# 导入板块import pandas as pd# 创立数据帧raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],         'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],         'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],         'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])df
regimentcompanynamepreTestScorepostTestScore
0Nighthawks1stMiller425
1Nighthawks1stJacobson2494
2Nighthawks2ndAli3157
3Nighthawks2ndMilner262
4Dragoons1stCooze370
5Dragoons1stJacon425
6Dragoons2ndRyaner2494
7Dragoons2ndSone3157
8Scouts1stSloan262
9Scouts1stPiger370
10Scouts2ndRiani262
11Scouts2ndAli370
# 创立一个 groupby 变量,按团队(regiment)对 preTestScores 分组groupby_regiment = df['preTestScore'].groupby(df['regiment'])groupby_regiment# <pandas.core.groupby.SeriesGroupBy object at 0x113ddb550> 

“这个分组变量现在是GroupBy对象。 除了分组的键df ['key1']的少量中间数据之外,它实际上还没有计算任何东西。 我们的想法是,该对象具备将所有操作应用于每个分组所需的所有信息。” — PyDA

使用list()显示分组的样子。

list(df['preTestScore'].groupby(df['regiment']))'''[('Dragoons', 4     3  5     4  6    24  7    31  Name: preTestScore, dtype: int64), ('Nighthawks', 0     4  1    24  2    31  3     2  Name: preTestScore, dtype: int64), ('Scouts', 8     2  9     3  10    2  11    3  Name: preTestScore, dtype: int64)] '''df['preTestScore'].groupby(df['regiment']).describe()
countmeanstdmin25%50%75%max
regiment
Dragoons4.015.5014.1539163.03.7514.025.7531.0
Nighthawks4.015.2514.4539502.03.5014.025.7531.0
Scouts4.02.500.5773502.02.002.53.003.0
# 每个团队的 preTestScore 均值groupby_regiment.mean()'''regimentDragoons      15.50Nighthawks    15.25Scouts         2.50Name: preTestScore, dtype: float64 '''df['preTestScore'].groupby([df['regiment'], df['company']]).mean()'''regiment    companyDragoons    1st         3.5            2nd        27.5Nighthawks  1st        14.0            2nd        16.5Scouts      1st         2.5            2nd         2.5Name: preTestScore, dtype: float64 '''df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()
company1st2nd
regiment
Dragoons3.527.5
Nighthawks14.016.5
Scouts2.52.5
# 按团队和公司(company)对整个数据帧分组df.groupby(['regiment', 'company']).mean()
preTestScorepostTestScore
regimentcompany
Dragoons1st3.547.5
2nd27.575.5
Nighthawks1st14.059.5
2nd16.559.5
Scouts1st2.566.0
2nd2.566.0
# 每个团队和公司的观测数量df.groupby(['regiment', 'company']).size()'''regiment    companyDragoons    1st        2            2nd        2Nighthawks  1st        2            2nd        2Scouts      1st        2            2nd        2dtype: int64 '''# 按团队对数据帧分组,对于每个团队,for name, group in df.groupby('regiment'):     # 打印团队名称    print(name)    # 打印它的数据    print(group)'''Dragoons   regiment company    name  preTestScore  postTestScore4  Dragoons     1st   Cooze             3             705  Dragoons     1st   Jacon             4             256  Dragoons     2nd  Ryaner            24             947  Dragoons     2nd    Sone            31             57Nighthawks     regiment company      name  preTestScore  postTestScore0  Nighthawks     1st    Miller             4             251  Nighthawks     1st  Jacobson            24             942  Nighthawks     2nd       Ali            31             573  Nighthawks     2nd    Milner             2             62Scouts   regiment company   name  preTestScore  postTestScore8    Scouts     1st  Sloan             2             629    Scouts     1st  Piger             3             7010   Scouts     2nd  Riani             2             6211   Scouts     2nd    Ali             3             70 '''

按列分组:

特别是在这种情况下:按列对数据类型(即axis = 1)分组,而后使用list()查看该分组的外观。

list(df.groupby(df.dtypes, axis=1))'''[(dtype('int64'),     preTestScore  postTestScore  0              4             25  1             24             94  2             31             57  3              2             62  4              3             70  5              4             25  6             24             94  7             31             57  8              2             62  9              3             70  10             2             62  11             3             70), (dtype('O'),       regiment company      name  0   Nighthawks     1st    Miller  1   Nighthawks     1st  Jacobson  2   Nighthawks     2nd       Ali  3   Nighthawks     2nd    Milner  4     Dragoons     1st     Cooze  5     Dragoons     1st     Jacon  6     Dragoons     2nd    Ryaner  7     Dragoons     2nd      Sone  8       Scouts     1st     Sloan  9       Scouts     1st     Piger  10      Scouts     2nd     Riani  11      Scouts     2nd       Ali)] df.groupby('regiment').mean().add_prefix('mean_')
mean_preTestScoremean_postTestScore
regiment
Dragoons15.5061.5
Nighthawks15.2559.5
Scouts2.5066.0
# 创立获取分组状态的函数def get_stats(group):    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}bins = [0, 25, 50, 75, 100]group_names = ['Low', 'Okay', 'Good', 'Great']df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)df['postTestScore'].groupby(df['categories']).apply(get_stats).unstack()
countmaxmeanmin
categories
Good8.070.063.7557.0
Great2.094.094.0094.0
Low2.025.025.0025.0
Okay0.0NaNNaNNaN

在 Pandas 数据帧上应用操作

# 导入模型import pandas as pdimport numpy as npdata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],         'year': [2012, 2012, 2013, 2014, 2014],         'reports': [4, 24, 31, 2, 3],        'coverage': [25, 94, 57, 62, 70]}df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])df
coveragenamereportsyear
Cochice25Jason42012
Pima94Molly242012
Santa Cruz57Tina312013
Maricopa62Jake22014
Yuma70Amy32014
# 创立大写转换的 lambda 函数capitalizer = lambda x: x.upper()

capitalizer函数应用于name列。

apply()可以沿数据帧的任意轴应用函数。

df['name'].apply(capitalizer)'''Cochice       JASONPima          MOLLYSanta Cruz     TINAMaricopa       JAKEYuma            AMYName: name, dtype: object '''

capitalizer lambda 函数映射到序列name中的每个元素。

map()对序列的每个元素应用操作。

df['name'].map(capitalizer)'''Cochice       JASONPima          MOLLYSanta Cruz     TINAMaricopa       JAKEYuma            AMYName: name, dtype: object '''

将平方根函数应用于整个数据帧中的每个单元格。

applymap()将函数应用于整个数据帧中的每个元素。

# 删除字符串变量,以便 applymap() 可以运行df = df.drop('name', axis=1)# 返回数据帧每个单元格的平方根df.applymap(np.sqrt)
coveragereportsyear
Cochice5.0000002.00000044.855323
Pima9.6953604.89897944.855323
Santa Cruz7.5498345.56776444.866469
Maricopa7.8740081.41421444.877611
Yuma8.3666001.73205144.877611

在数据帧上应用函数。

# 创立叫做 times100 的函数def times100(x):    # 假如 x 是字符串,    if type(x) is str:        # 原样返回它        return x    # 假如不是,返回它乘上 100    elif x:        return 100 * x    # 并留下其它东西    else:        returndf.applymap(times100)
coveragereportsyear
Cochice2500400201200
Pima94002400201200
Santa Cruz57003100201300
Maricopa6200200201400
Yuma7000300201400

向 Pandas 数据帧赋予新列

import pandas as pd# 创立空数据帧df = pd.DataFrame()# 创立一列df['name'] = ['John', 'Steve', 'Sarah']# 查看数据帧df
name
0John
1Steve
2Sarah
# 将一个新列赋予名为 age 的 df,它包含年龄列表df.assign(age = [31, 32, 19])
nameage
0John31
1Steve32
2Sarah19

将列表拆分为大小为 N 的分块

在这个片段中,我们接受一个列表并将其分解为大小为 n 的块。 在解决具备最大请求大小的 API 时,这是一种非常常见的做法。

这个漂亮的函数由 Ned Batchelder 贡献,发布于 StackOverflow。

# 创立名称列表first_names = ['Steve', 'Jane', 'Sara', 'Mary','Jack','Bob', 'Bily', 'Boni', 'Chris','Sori', 'Will', 'Won','Li']# 创立叫做 chunks 的函数,有两个参数 l 和 ndef chunks(l, n):    # 对于长度为 l 的范围中的项目 i    for i in range(0, len(l), n):        # 创立索引范围        yield l[i:i+n]# 从函数 chunks 的结果创立一个列表list(chunks(first_names, 5))'''[['Steve', 'Jane', 'Sara', 'Mary', 'Jack'], ['Bob', 'Bily', 'Boni', 'Chris', 'Sori'], ['Will', 'Won', 'Li']] '''

在 Pandas 中使用正则表达式将字符串分解为列

# 导入板块import reimport pandas as pd# 创立带有一列字符串的数据帧data = {'raw': ['Arizona 1 2014-12-23       3242.0',                'Iowa 1 2010-02-23       3453.7',                'Oregon 0 2014-06-20       2123.0',                'Maryland 0 2014-03-14       1123.6',                'Florida 1 2013-01-15       2134.0',                'Georgia 0 2012-07-14       2345.6']}df = pd.DataFrame(data, columns = ['raw'])df
raw
0Arizona 1 2014-12-23 3242.0
1Iowa 1 2010-02-23 3453.7
2Oregon 0 2014-06-20 2123.0
3Maryland 0 2014-03-14 1123.6
4Florida 1 2013-01-15 2134.0
5Georgia 0 2012-07-14 2345.6
# df['raw'] 的哪些行包含 'xxxx-xx-xx'?df['raw'].str.contains('....-..-..', regex=True)'''0    True1    True2    True3    True4    True5    TrueName: raw, dtype: bool '''# 在 raw 列中,提取字符串中的单个数字df['female'] = df['raw'].str.extract('(\d)', expand=True)df['female']'''0    11    12    03    04    15    0Name: female, dtype: object '''# 在 raw 列中,提取字符串中的 xxxx-xx-xxdf['date'] = df['raw'].str.extract('(....-..-..)', expand=True)df['date']'''0    2014-12-231    2010-02-232    2014-06-203    2014-03-144    2013-01-155    2012-07-14Name: date, dtype: object '''# 在 raw 列中,提取字符串中的 ####.##df['score'] = df['raw'].str.extract('(\d\d\d\d\.\d)', expand=True)df['score']'''0    3242.01    3453.72    2123.03    1123.64    2134.05    2345.6Name: score, dtype: object '''# 在 raw 列中,提取字符串中的单词df['state'] = df['raw'].str.extract('([A-Z]\w{0,})', expand=True)df['state']'''0     Arizona1        Iowa2      Oregon3    Maryland4     Florida5     GeorgiaName: state, dtype: object '''df
rawfemaledatescorestate
0Arizona 1 2014-12-23 3242.012014-12-233242.0Arizona
1Iowa 1 2010-02-23 3453.712010-02-233453.7Iowa
2Oregon 0 2014-06-20 2123.002014-06-202123.0Oregon
3Maryland 0 2014-03-14 1123.602014-03-141123.6Maryland
4Florida 1 2013-01-15 2134.012013-01-152134.0Florida
5Georgia 0 2012-07-14 2345.602012-07-142345.6Georgia

由两个数据帧贡献列

# 导入库import pandas as pd# 创立数据帧dataframe_one = pd.DataFrame()dataframe_one['1'] = ['1', '1', '1']dataframe_one['B'] = ['b', 'b', 'b']# 创立第二个数据帧dataframe_two = pd.DataFrame()dataframe_two['2'] = ['2', '2', '2']dataframe_two['B'] = ['b', 'b', 'b']# 将每个数据帧的列转换为集合,# 而后找到这两个集合的交集。# 这将是两个数据帧共享的列的集合。set.intersection(set(dataframe_one), set(dataframe_two))# {'B'} 

从多个列表构建字典

# 创立官员名称的列表officer_names = ['Sodoni Dogla', 'Chris Jefferson', 'Jessica Billars', 'Michael Mulligan', 'Steven Johnson']# 创立官员军队的列表officer_armies = ['Purple Army', 'Orange Army', 'Green Army', 'Red Army', 'Blue Army']# 创立字典,它是两个列表的 zipdict(zip(officer_names, officer_armies))'''{'Chris Jefferson': 'Orange Army', 'Jessica Billars': 'Green Army', 'Michael Mulligan': 'Red Army', 'Sodoni Dogla': 'Purple Army', 'Steven Johnson': 'Blue Army'} '''

将 CSV 转换为 Python 代码来重建它

# 导入 pandas 包import pandas as pd# 将 csv 文件加载为数据帧df_original = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')df = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')# 打印创立数据帧的代码print('==============================')print('RUN THE CODE BELOW THIS LINE')print('==============================')print('raw_data =', df.to_dict(orient='list'))print('df = pd.DataFrame(raw_data, columns = ' + str(list(df_original)) + ')')'''==============================RUN THE CODE BELOW THIS LINE==============================raw_data = {'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150]}'''df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']) # 假如你打算检查结果# 1\. 输入此单元格中上面单元格生成的代码raw_data = {'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150], 'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996]}df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'])# 查看原始数据帧的前几行df.head()
Unnamed: 0Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
015.13.51.40.2setosa
124.93.01.40.2setosa
234.73.21.30.2setosa
344.63.11.50.2setosa
455.03.61.40.2setosa
# 查看使用我们的代码创立的,数据帧的前几行df_original.head()
Unnamed: 0Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
015.13.51.40.2setosa
124.93.01.40.2setosa
234.73.21.30.2setosa
344.63.11.50.2setosa
455.03.61.40.2setosa

将分类变量转换为虚拟变量

# 导入板块import pandas as pd# 创立数据帧raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],         'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],         'sex': ['male', 'female', 'male', 'female', 'female']}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'sex'])df
first_namelast_namesex
0JasonMillermale
1MollyJacobsonfemale
2TinaAlimale
3JakeMilnerfemale
4AmyCoozefemale
# 从 sex 变量创立一组虚拟变量df_sex = pd.get_dummies(df['sex'])# 将虚拟变量连接到主数据帧df_new = pd.concat([df, df_sex], axis=1)df_new
first_namelast_namesexfemalemale
0JasonMillermale0.01.0
1MollyJacobsonfemale1.00.0
2TinaAlimale0.01.0
3JakeMilnerfemale1.00.0
4AmyCoozefemale1.00.0
# 连接新列的替代方案df_new = df.join(df_sex)df_new
first_namelast_namesexfemalemale
0JasonMillermale0.01.0
1MollyJacobsonfemale1.00.0
2TinaAlimale0.01.0
3JakeMilnerfemale1.00.0
4AmyCoozefemale1.00.0

将分类变量转换为虚拟变量

# 导入板块import pandas as pdimport patsy# 创立数据帧raw_data = {'countrycode': [1, 2, 3, 2, 1]} df = pd.DataFrame(raw_data, columns = ['countrycode'])df
countrycode
01
12
23
32
41
# 将 countrycode 变量转换为三个二元变量patsy.dmatrix('C(countrycode)-1', df, return_type='dataframe')
C(countrycode)[1]C(countrycode)[2]C(countrycode)[3]
01.00.00.0
10.01.00.0
20.00.01.0
30.01.00.0
41.00.00.0

将字符串分类变量转换为数字变量

# 导入板块import pandas as pdraw_data = {'patient': [1, 1, 1, 2, 2],         'obs': [1, 2, 3, 1, 2],         'treatment': [0, 1, 0, 1, 0],        'score': ['strong', 'weak', 'normal', 'weak', 'strong']} df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])df
patientobstreatmentscore
0110strong
1121weak
2130normal
3211weak
4220strong
# 创立一个函数,将 df['score'] 的所有值转换为数字def score_to_numeric(x):    if x=='strong':        return 3    if x=='normal':        return 2    if x=='weak':        return 1df['score_num'] = df['score'].apply(score_to_numeric)df
patientobstreatmentscorescore_num
0110strong3
1121weak1
2130normal2
3211weak1
4220strong3

说明
1. 本站所有资源来源于用户上传和网络,如有侵权请邮件联系站长!
2. 分享目的仅供大家学习和交流,您必须在下载后24小时内删除!
3. 不得使用于非法商业用途,不得违反国家法律。否则后果自负!
4. 本站提供的源码、模板、插件等等其他资源,都不包含技术服务请大家谅解!
5. 如有链接无法下载、失效或广告,请联系管理员处理!
6. 本站资源售价只是摆设,本站源码仅提供给会员学习使用!
7. 如遇到加密压缩包,请使用360解压,如遇到无法解压的请联系管理员
开心源码网 » 数据科学和人工智能技术笔记 十九、数据整理(1)

发表回复