数据清理 5 天挑战

偶然看到了 Kaggle 的数据清理 5 天挑战,大致看了一下,还是比较实用的。因此全部做完后记录一下。不是很想在无谓的整理上花太多时间,因此各类标题直接使用原文中的英文标题,用于串联内容的文字较少,且代码不一定完整(主要是缺少导入包和数据的语句)。如果你希望能够全面地了解这 5 次挑战的内容,以下是 Kaggle 上原 kernel 的地址:

Handling missing values

Take a first look at the data

看一下数据的前几行有没有缺失值。

1
2
nfl_data.sample(5)
# 使用 nfl_data.head() 效果相同

See how many missing data points we have

统计每种属性包含的缺失值的数量:

1
2
3
4
5
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

看一看缺失值占总体数量的百分比:

1
2
3
4
5
6
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

Figure out why the data is missing

分析数据缺失的原因以及缺失对数据分析所造成的影响被称为“data intution”,决定了是否需要填充该缺失,以及填充的策略。

面对数据缺失,一个必须要考虑的问题是:

某个数据的缺失是因为它没有被记录还是根本不存在?

对于根本不存在的缺失数据,没有必要对其进行操作;而没有被记录的缺失值,则应该基于同行同列的其他值来猜想其可能的值,尝试进行填充。

例如,Street Number Suffix应该是其本身不存在,而Zipcode则应该是没有记录。

Drop missing values

移除包含缺失值的是可行的,但是如果缺失值较多且分布较散,可能导致所有的数据全部被移除:

1
nfl_data.dropna()

也可以移除包含至少一个缺失值的

1
columns_with_na_dropped = nfl_data.dropna(axis=1)

Filling in missing values automatically

将某个数值型属性的缺失值全部补 0:

1
2
3
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data
1
2
# replace all NA's with 0
subset_nfl_data.fillna(0)

若存在字符串类型的数据,也可以通过method = 'bfill',选择将空值置为与它相邻的下一行对应的数据,没有下一行数据就置为 0:

1
2
3
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)

Scale and Normalize Data

主要搞清两个概念:

  • scaling(缩放): change the range of data
  • normalize(标准化): change the shape of the distribution of data,即将数据尽可能地描述为正态分布。常用于 t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes

import packages

1
2
3
4
5
6
7
8
9
# for Box-Cox Transformation
from scipy import stats

# for min_max scaling
from mlxtend.preprocessing import minmax_scaling

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

Scale

1
2
3
4
5
6
7
8
9
10
11
12
# select the usd_goal_real column
usd_goal = kickstarters_2017.usd_goal_real

# scale the goals from 0 to 1
scaled_data = minmax_scaling(usd_goal, columns = [0])

# plot the original & scaled data together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(kickstarters_2017.usd_goal_real, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

scale

Normalization

使用 Box-Cox 转换法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# get the index of all positive pledges (Box-Cox only takes postive values)
index_of_positive_pledges = kickstarters_2017.usd_pledged_real > 0

# get only positive pledges (using their indexes)
positive_pledges = kickstarters_2017.usd_pledged_real.loc[index_of_positive_pledges]

# normalize the pledges (w/ Box-Cox)
normalized_pledges = stats.boxcox(positive_pledges)[0]

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(positive_pledges, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_pledges, ax=ax[1])
ax[1].set_title("Normalized data")

normalize

Parsing Dates

Get our environment set up

1
import datetime

Check the data type of our date column

pandas 用object表示各类数据类型,一般含有字符串。

1
2
3
landslides['date'].dtype
# dtype('O')
# 'O' 表示 'object'

Convert our date columns to datetime

将字符串转化为 datetime 时,要注明原字符串在表示日期时所遵循的格式,例如:

  • 1/17/07 has the format “%m/%d/%y”
  • 17-1-2007 has the format “%d-%m-%Y”

可以看到,%y代表 2 个数字表示的年份,%Y代表 4 个数字表示的年份。

1
landslides['date_parsed'] = pd.to_datetime(landslides['date'], format = "%m/%d/%y")

补充:由于数据中还有个别不是 “month/day/four-digit year” 格式,直接转换的话会报错,Pandas 提供了一个可选的参数 errors,传入 errors=’coerce’,当遇到不能转换的数据就会将其置为 NaN,我们之后再对其进行手工处理。但如果我们不设定 format,这个错误就会被忽略,不过就无法得到特定格式的日期数据了,可能影响之后分析工作。详见Kaggle 数据清洗挑战 Day 3 - 快速解析日期(date)数据

Select just the day of the month from our column

在处理后的数据中按 month 选取 day:

1
day_of_month_landslides = landslides['date_parsed'].dt.day

Plot the day of the month to check the date parsing

可视化,用于确认没有搞混“月“和”日“:

1
2
3
4
5
# remove na's
day_of_month_landslides = day_of_month_landslides.dropna()

# plot the day of the month
sns.distplot(day_of_month_landslides, kde=False, bins=31)

补充

时间序列可视化

Character Encodings

1
2
# helpful character encoding module
import chardet

What are encodings?

UTF-8 天下第一!

1
2
3
4
5
6
# start with a string
before = "This is the euro symbol: €"

# check to see what datatype it is
type(before)
#str
1
2
3
4
5
6
7
8
9
after = before.encode("utf-8", errors = "replace")

# check the type
type(after)
# bytes

# take a look at what the bytes look like
after
# b'This is the euro symbol: \xe2\x82\xac'
1
2
3
# convert it back to utf-8
print(after.decode("utf-8"))
# This is the euro symbol: €

Reading in files with encoding problems

读取非 UTF-8 编码的文件会出错。先试着读取前几行看一下能不能判断编码:

1
2
3
4
5
6
# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

判断为 Windows-1252 编码。

1
2
3
4
5
# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')

# look at the first few lines
kickstarter_2016.head()

Saving your files with UTF-8 encoding

默认以 UTF-8 编码保存。

Inconsistent Data Entry

将包括加了空格、大小写不一致、个别拼写错误等类型的不同值整理为一致。

Import

1
2
3
import fuzzywuzzy
from fuzzywuzzy import process
import chardet

Do some preliminary text pre-processing

看一看有哪些值:

1
2
3
4
5
6
# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

去空格、全部调小写:

1
2
3
4
# convert to lower case
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# remove trailing white spaces
suicide_attacks['City'] = suicide_attacks['City'].str.strip()

Use fuzzy matching to correct inconsistent data entry

fuzzywuzzy包可以帮助辨认那些拼写接近的字符串。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# get the top 10 closest matches to "d.i khan"
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

'''
[('d. i khan', 100),
('d.i khan', 100),
('d.g khan', 88),
('khanewal', 50),
('sudhanoti', 47),
('hangu', 46),
('kohat', 46),
('dara adam khel', 45),
('chaman', 43),
('mardan', 43)]
'''

可以考虑写一个函数做总体的转化:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
# get a list of unique strings
strings = df[column].unique()

# get the top 10 closest matches to our input string
matches = fuzzywuzzy.process.extract(string_to_match, strings,
limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# only get matches with a ratio > 90
close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

# get the rows of all the close matches in our dataframe
rows_with_matches = df[column].isin(close_matches)

# replace all rows with close matches with the input matches
df.loc[rows_with_matches, column] = string_to_match

# let us know the function's done
print("All done!")

之后只需要对那些疑似相同的字符串逐个调用该函数即可:

1
2
# use the function we just wrote to replace close matches to "d.i khan" with "d.i khan"
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")