数据清理 5 天挑战

发表于 2018-03-30 分类于数据科学笔记阅读次数：

偶然看到了 Kaggle 的数据清理 5 天挑战，大致看了一下，还是比较实用的。因此全部做完后记录一下。不是很想在无谓的整理上花太多时间，因此各类标题直接使用原文中的英文标题，用于串联内容的文字较少，且代码不一定完整（主要是缺少导入包和数据的语句）。如果你希望能够全面地了解这 5 次挑战的内容，以下是 Kaggle 上原 kernel 的地址：

Handling missing values

Take a first look at the data

看一下数据的前几行有没有缺失值。

1 2	nfl_data.sample(5) # 使用 nfl_data.head() 效果相同

See how many missing data points we have

统计每种属性包含的缺失值的数量：

# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

看一看缺失值占总体数量的百分比：

# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

Figure out why the data is missing

分析数据缺失的原因以及缺失对数据分析所造成的影响被称为“data intution”，决定了是否需要填充该缺失，以及填充的策略。

面对数据缺失，一个必须要考虑的问题是：

某个数据的缺失是因为它没有被记录还是根本不存在？

对于根本不存在的缺失数据，没有必要对其进行操作；而没有被记录的缺失值，则应该基于同行同列的其他值来猜想其可能的值，尝试进行填充。

例如，Street Number Suffix应该是其本身不存在，而Zipcode则应该是没有记录。

Drop missing values

移除包含缺失值的行是可行的，但是如果缺失值较多且分布较散，可能导致所有的数据全部被移除：

1	nfl_data.dropna()

也可以移除包含至少一个缺失值的列：

1	columns_with_na_dropped = nfl_data.dropna(axis=1)

Filling in missing values automatically

将某个数值型属性的缺失值全部补 0：

1
2
3

# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

1 2	# replace all NA's with 0 subset_nfl_data.fillna(0)

若存在字符串类型的数据，也可以通过method = 'bfill'，选择将空值置为与它相邻的下一行对应的数据，没有下一行数据就置为 0：

1
2
3

# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)

Scale and Normalize Data

主要搞清两个概念：

scaling（缩放）: change the range of data
normalize（标准化）: change the shape of the distribution of data，即将数据尽可能地描述为正态分布。常用于 t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes

import packages

# for Box-Cox Transformation
from scipy import stats

# for min_max scaling
from mlxtend.preprocessing import minmax_scaling

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

Scale

# select the usd_goal_real column
usd_goal = kickstarters_2017.usd_goal_real

# scale the goals from 0 to 1
scaled_data = minmax_scaling(usd_goal, columns = [0])

# plot the original & scaled data together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(kickstarters_2017.usd_goal_real, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

scale

Normalization

使用 Box-Cox 转换法：

# get the index of all positive pledges (Box-Cox only takes postive values)
index_of_positive_pledges = kickstarters_2017.usd_pledged_real > 0

# get only positive pledges (using their indexes)
positive_pledges = kickstarters_2017.usd_pledged_real.loc[index_of_positive_pledges]

# normalize the pledges (w/ Box-Cox)
normalized_pledges = stats.boxcox(positive_pledges)[0]

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(positive_pledges, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_pledges, ax=ax[1])
ax[1].set_title("Normalized data")

normalize

Parsing Dates

Get our environment set up

1	import datetime

Check the data type of our date column

pandas 用object表示各类数据类型，一般含有字符串。

1
2
3

landslides['date'].dtype
# dtype('O')
# 'O' 表示 'object'

Convert our date columns to datetime

将字符串转化为 datetime 时，要注明原字符串在表示日期时所遵循的格式，例如：

1/17/07 has the format “%m/%d/%y”
17-1-2007 has the format “%d-%m-%Y”

可以看到，%y代表 2 个数字表示的年份，%Y代表 4 个数字表示的年份。

1	landslides['date_parsed'] = pd.to_datetime(landslides['date'], format = "%m/%d/%y")

补充：由于数据中还有个别不是 “month/day/four-digit year” 格式，直接转换的话会报错，Pandas 提供了一个可选的参数 errors，传入 errors=’coerce’，当遇到不能转换的数据就会将其置为 NaN，我们之后再对其进行手工处理。但如果我们不设定 format，这个错误就会被忽略，不过就无法得到特定格式的日期数据了，可能影响之后分析工作。详见Kaggle 数据清洗挑战 Day 3 - 快速解析日期（date）数据

Select just the day of the month from our column

在处理后的数据中按 month 选取 day：

1	day_of_month_landslides = landslides['date_parsed'].dt.day

Plot the day of the month to check the date parsing

可视化，用于确认没有搞混“月“和”日“：

# remove na's
day_of_month_landslides = day_of_month_landslides.dropna()

# plot the day of the month
sns.distplot(day_of_month_landslides, kde=False, bins=31)

补充

时间序列可视化

Character Encodings

1 2	# helpful character encoding module import chardet

What are encodings?

UTF-8 天下第一！

# start with a string
before = "This is the euro symbol: €"

# check to see what datatype it is
type(before)
#str

after = before.encode("utf-8", errors = "replace")

# check the type
type(after)
# bytes

# take a look at what the bytes look like
after
# b'This is the euro symbol: \xe2\x82\xac'

1
2
3

# convert it back to utf-8
print(after.decode("utf-8"))
# This is the euro symbol: €

Reading in files with encoding problems

读取非 UTF-8 编码的文件会出错。先试着读取前几行看一下能不能判断编码：

# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

判断为 Windows-1252 编码。

# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')

# look at the first few lines
kickstarter_2016.head()

Saving your files with UTF-8 encoding

默认以 UTF-8 编码保存。

Inconsistent Data Entry

将包括加了空格、大小写不一致、个别拼写错误等类型的不同值整理为一致。

Import

1
2
3

import fuzzywuzzy
from fuzzywuzzy import process
import chardet

Do some preliminary text pre-processing

看一看有哪些值：

# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

去空格、全部调小写：

# convert to lower case
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# remove trailing white spaces
suicide_attacks['City'] = suicide_attacks['City'].str.strip()

Use fuzzy matching to correct inconsistent data entry

fuzzywuzzy包可以帮助辨认那些拼写接近的字符串。

# get the top 10 closest matches to "d.i khan"
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

'''
[('d. i khan', 100),
 ('d.i khan', 100),
 ('d.g khan', 88),
 ('khanewal', 50),
 ('sudhanoti', 47),
 ('hangu', 46),
 ('kohat', 46),
 ('dara adam khel', 45),
 ('chaman', 43),
 ('mardan', 43)]
'''

可以考虑写一个函数做总体的转化：

# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")

之后只需要对那些疑似相同的字符串逐个调用该函数即可：

1 2	# use the function we just wrote to replace close matches to "d.i khan" with "d.i khan" replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")