Python入門 〜pandasでデータ分析編〜

TL;DR

KaggleのTitanicをpandasとscikit-learnでいい感じに. 春にやったこれとほぼ同じです^^;

Pandasとは

Pandas PANel DAtaSが名前の由来
もともと統計解析用のR言語で使われていたdata.frameという表(テーブル)みたいな
データ形式をPythonに移植したモジュール(ライブラリ)
jupyterと組み合わせるとテーブルの中身を見ながら処理できるので最強
内部的にはNumPyというベクトルや行列といった数値計算用のモジュール
(リストより効率のいい配列みたいな奴)をデータ分析用の機能でラップしたもの
ただしあんまりにもでかいデータ(数GBオーダー)は苦手
代替としてDaskや次世代数値計算ライブラリBlazeを参照

Kaggleとは

Kaggle
データ分析の世界的なコンペティションを行うWebサービス
ユーザーはお題に沿って最も性能の良いモデルを構築することを目的とする
賞金が出るお題もある

今回はKaggleのチュートリアルに沿うことでデータ分析入門を目指す

データ分析のプロセス

Process

Process

死にゆくアンチウイルスへの祈り より

データ分析は一筋縄では行かないので何段階かに分けて考える
プログラミングでの大雑把な手順は
  1. データの加工
  2. 予測モデルの作成・学習
  3. 予測モデルの評価

の3段階で考えるのはどうでしょう(?)

本編

1. データの加工

データのロード

今回の対象となるデータは,1912年4月15日に沈没したタイタニック号の乗員乗客名簿.
女性,子供,1等船室の人々が生存確率が高かったことが知られている.
Dead or Aliveをデータからより正確に予測することはできるか?
いわゆる二値分類問題.

A free interactive Machine Learning tutorial in Pythonにインタラクティブなチュートリアルがあります

In [1]:
import pandas as pd
In [2]:
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train_df = pd.read_csv(train_url)
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
891人分のデータ(タイタニック号に乗っていた2224人のうちの40%)があるようだが,
欠損値のある要素(Ageなど)もある
  • PassengerID: 乗客ID
  • Survived:   生存結果 (1: 生存, 0: 死亡)
  • Pclass:    客室の等級
  • Name:    乗客の名前
  • Sex:     性別
  • Age:     年齢
  • SibSp:    乗船している兄弟,配偶者の数
  • Parch:    乗船している両親,子供の数
  • Ticket:    チケット番号
  • Fare:     乗船料金
  • Cabin:    部屋番号
  • Embarked:   乗船した港 Cherbourg,Queenstown,Southamptonの3種類
In [3]:
train_df
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

行(row)が乗客一人ひとりを表し,列(column)が乗客の性質を表す
「NaN」と書かれているデータは「データ無し(欠損値)」を表す
891人分のデータが訓練データとして与えられている
太字の列はindexでpandasが自動的に割り振る番号(この場合は0からスタート)
基本的には重複のない整数だが,好きなカラムをindexに指定することもでき,使い方いろいろ

量的データ

In [4]:
# train_df.describe()
# train_df.describe(percentiles=[.61, .62])
# train_df.describe(percentiles=[.75, .8])
# train_df.describe(percentiles=[.68, .69])
train_df.describe(percentiles=[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99])
Out[4]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
10% 90.000000 0.000000 1.000000 14.000000 0.000000 0.000000 7.550000
20% 179.000000 0.000000 1.000000 19.000000 0.000000 0.000000 7.854200
30% 268.000000 0.000000 2.000000 22.000000 0.000000 0.000000 8.050000
40% 357.000000 0.000000 2.000000 25.000000 0.000000 0.000000 10.500000
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
60% 535.000000 0.000000 3.000000 31.800000 0.000000 0.000000 21.679200
70% 624.000000 1.000000 3.000000 36.000000 1.000000 0.000000 27.000000
80% 713.000000 1.000000 3.000000 41.000000 1.000000 1.000000 39.687500
90% 802.000000 1.000000 3.000000 50.000000 1.000000 2.000000 77.958300
99% 882.100000 1.000000 3.000000 65.870000 5.000000 4.000000 249.006220
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

describe()で簡単な統計量(基本・要約・記述統計量)を一覧できる

  • 891人中生き残った人のデータは38%ほど
  • 親や子供と共に乗っていない人が75%くらい
  • 兄弟や配偶者と共に乗っていない人が70%くらい
  • $512も払っている人は1%未満
  • 65歳以上の人も1%未満

可視化

パーセンタイルを折れ線グラフで見てみる

matplotlibはPythonのグラフ作図ライブラリで最も有名
seabornはmatplotlibの見た目をナウくする
% matplotlib inline というマジックコマンドで,グラフをセル内に表示できる
% matplotlib notebook ならインタラクティブな表示が可能
In [5]:
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
In [6]:
train_df.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
Out[6]:
PassengerId Survived Pclass Age SibSp Parch Fare
0.1 90.0 0.0 1.0 14.0 0.0 0.0 7.5500
0.2 179.0 0.0 1.0 19.0 0.0 0.0 7.8542
0.3 268.0 0.0 2.0 22.0 0.0 0.0 8.0500
0.4 357.0 0.0 2.0 25.0 0.0 0.0 10.5000
0.5 446.0 0.0 3.0 28.0 0.0 0.0 14.4542
0.6 535.0 0.0 3.0 31.8 0.0 0.0 21.6792
0.7 624.0 1.0 3.0 36.0 1.0 0.0 27.0000
0.8 713.0 1.0 3.0 41.0 1.0 1.0 39.6875
0.9 802.0 1.0 3.0 50.0 1.0 2.0 77.9583
In [7]:
import numpy as np
np.arange(0, 1, .1)
Out[7]:
array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])
In [8]:
train_df.quantile(np.arange(0, 1, .001)).plot(y="Age")
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x10aa3b358>
_images/Python02_21_1.png

質的データ

In [9]:
train_df.describe(include=[np.object])
Out[9]:
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Blackwell, Mr. Stephen Weart male CA. 2343 G6 S
freq 1 577 7 4 644
  • Nameは一意
  • 65%が男性
  • 客室には複数の人が泊まっていたり,一人が複数の客室を使用していたりする
  • 乗船した港は3種類だがほとんどがS
  • Ticketは22%も重複している

データ観察からの仮説

参考:https://www.kaggle.com/startupsci/titanic-data-science-solutions

相関
 どの指標が生死と相関があるのかを知り,モデルのたたき台にする

補完

  1. 生死と相関のある年齢は補完したい
  2. Embarkedも補完したいかもしれない

修正

  1. 22%も重複しているTicketは分析には不適ではないか
  2. Cabinは欠損値が多すぎるのでつかえない
  3. PassengerIdはただの番号
  4. 名前は生死に無関係では

作成

  1. ParchとSibSpを使えば家族数が割り出せるのでは
  2. 名前から敬称を取り出せる
  3. 年齢の連続値より年齢層として扱う方がいいかもしれない
  4. 料金も同様に

分類

  1. 女性は生き残る可能性が高いか
  2. 子供は生き残る可能性が高いか
  3. 上位クラスの乗客は生き残る可能性が高いか

相関を見る

※ Pandasのデータ操作はさまざま.
Pandasを使ったデータ操作の基本などを参考に,適宜「(やりたいこと) pandas」で検索.
バージョン違いでAPIが変わる可能性があるので,なるべく新しい記事を参考にしよう.
In [10]:
train_df[0:5]
Out[10]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [11]:
print(type(train_df[["Sex"]]))
train_df[["Sex"]]
<class 'pandas.core.frame.DataFrame'>
Out[11]:
Sex
0 male
1 female
2 female
3 female
4 male
5 male
6 male
7 male
8 female
9 female
10 female
11 female
12 male
13 male
14 female
15 female
16 male
17 male
18 female
19 female
20 male
21 male
22 female
23 male
24 female
25 female
26 male
27 male
28 female
29 male
... ...
861 male
862 female
863 female
864 male
865 female
866 female
867 male
868 male
869 male
870 male
871 female
872 male
873 male
874 female
875 female
876 male
877 male
878 male
879 female
880 female
881 male
882 female
883 male
884 male
885 female
886 male
887 female
888 female
889 male
890 male

891 rows × 1 columns

In [12]:
print(type(train_df["Sex"]))
train_df["Sex"]
<class 'pandas.core.series.Series'>
Out[12]:
0        male
1      female
2      female
3      female
4        male
5        male
6        male
7        male
8      female
9      female
10     female
11     female
12       male
13       male
14     female
15     female
16       male
17       male
18     female
19     female
20       male
21       male
22     female
23       male
24     female
25     female
26       male
27       male
28     female
29       male
        ...
861      male
862    female
863    female
864      male
865    female
866    female
867      male
868      male
869      male
870      male
871    female
872      male
873      male
874    female
875    female
876      male
877      male
878      male
879    female
880    female
881      male
882    female
883      male
884      male
885    female
886      male
887    female
888    female
889      male
890      male
Name: Sex, Length: 891, dtype: object
In [13]:
train_df["Sex"][train_df["Sex"] == "male"]
Out[13]:
0      male
4      male
5      male
6      male
7      male
12     male
13     male
16     male
17     male
20     male
21     male
23     male
26     male
27     male
29     male
30     male
33     male
34     male
35     male
36     male
37     male
42     male
45     male
46     male
48     male
50     male
51     male
54     male
55     male
57     male
       ...
840    male
841    male
843    male
844    male
845    male
846    male
847    male
848    male
850    male
851    male
857    male
859    male
860    male
861    male
864    male
867    male
868    male
869    male
870    male
872    male
873    male
876    male
877    male
878    male
881    male
883    male
884    male
886    male
889    male
890    male
Name: Sex, Length: 577, dtype: object
In [14]:
train_df.loc[train_df["Sex"] == "male", "Sex"]
Out[14]:
0      male
4      male
5      male
6      male
7      male
12     male
13     male
16     male
17     male
20     male
21     male
23     male
26     male
27     male
29     male
30     male
33     male
34     male
35     male
36     male
37     male
42     male
45     male
46     male
48     male
50     male
51     male
54     male
55     male
57     male
       ...
840    male
841    male
843    male
844    male
845    male
846    male
847    male
848    male
850    male
851    male
857    male
859    male
860    male
861    male
864    male
867    male
868    male
869    male
870    male
872    male
873    male
876    male
877    male
878    male
881    male
883    male
884    male
886    male
889    male
890    male
Name: Sex, Length: 577, dtype: object
相関を見るために質的変数を量的変数に変えておく
※ pandasの機能(関数)は基本的に非破壊的(=データ自体は変わらない).
 関数の戻り値を変数へ再代入して変更を上書きする.
In [15]:
train_df_select = train_df.copy()
train_df_select.loc[train_df_select["Sex"] == "male", "Sex"] = 0
train_df_select.loc[train_df_select["Sex"] == "female", "Sex"] = 1
train_df_select = train_df_select.astype({"Sex": int})
train_df_select[["Sex"]]
Out[15]:
Sex
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 1
9 1
10 1
11 1
12 0
13 0
14 1
15 1
16 0
17 0
18 1
19 1
20 0
21 0
22 1
23 0
24 1
25 1
26 0
27 0
28 1
29 0
... ...
861 0
862 1
863 1
864 0
865 1
866 1
867 0
868 0
869 0
870 0
871 1
872 0
873 0
874 1
875 1
876 0
877 0
878 0
879 1
880 1
881 0
882 1
883 0
884 0
885 1
886 0
887 1
888 1
889 0
890 0

891 rows × 1 columns

In [16]:
train_df_select.loc[train_df_select["Embarked"] == "S", "Embarked"] = 0
train_df_select.loc[train_df_select["Embarked"] == "C", "Embarked"] = 1
train_df_select.loc[train_df_select["Embarked"] == "Q", "Embarked"] = 2
train_df_select = train_df_select.dropna(subset=["Embarked"])
train_df_select = train_df_select.astype({"Embarked": int})
train_df_select[["Embarked"]]
Out[16]:
Embarked
0 0
1 1
2 0
3 0
4 0
5 2
6 0
7 0
8 0
9 1
10 0
11 0
12 0
13 0
14 0
15 0
16 2
17 0
18 0
19 1
20 0
21 0
22 2
23 0
24 0
25 0
26 1
27 0
28 2
29 0
... ...
861 0
862 0
863 0
864 0
865 0
866 1
867 0
868 0
869 0
870 0
871 0
872 0
873 0
874 1
875 1
876 0
877 0
878 0
879 1
880 0
881 0
882 0
883 0
884 0
885 2
886 0
887 0
888 0
889 1
890 2

889 rows × 1 columns

とりあえずNaNの含まれるレコードを消して,相関を算出できるデータのみを取り出す

In [17]:
train_df_select_drop = train_df_select.dropna()
train_df_select_drop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 12 columns):
PassengerId    183 non-null int64
Survived       183 non-null int64
Pclass         183 non-null int64
Name           183 non-null object
Sex            183 non-null int64
Age            183 non-null float64
SibSp          183 non-null int64
Parch          183 non-null int64
Ticket         183 non-null object
Fare           183 non-null float64
Cabin          183 non-null object
Embarked       183 non-null int64
dtypes: float64(2), int64(7), object(3)
memory usage: 18.6+ KB
In [18]:
train_df_select_drop.corr()
Out[18]:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
PassengerId 1.000000 0.148495 -0.089136 0.025205 0.030933 -0.083488 -0.051454 0.029740 -0.054246
Survived 0.148495 1.000000 -0.034542 0.532418 -0.254085 0.106346 0.023582 0.134241 0.083231
Pclass -0.089136 -0.034542 1.000000 0.046181 -0.306514 -0.103592 0.047496 -0.315235 -0.235027
Sex 0.025205 0.532418 0.046181 1.000000 -0.184969 0.104291 0.089581 0.130433 0.060862
Age 0.030933 -0.254085 -0.306514 -0.184969 1.000000 -0.156162 -0.271271 -0.092424 0.088112
SibSp -0.083488 0.106346 -0.103592 0.104291 -0.156162 1.000000 0.255346 0.286433 0.015962
Parch -0.051454 0.023582 0.047496 0.089581 -0.271271 0.255346 1.000000 0.389740 -0.097495
Fare 0.029740 0.134241 -0.315235 0.130433 -0.092424 0.286433 0.389740 1.000000 0.233452
Embarked -0.054246 0.083231 -0.235027 0.060862 0.088112 0.015962 -0.097495 0.233452 1.000000
In [19]:
%config InlineBackend.figure_formats = {'png', 'retina'}
sns.heatmap(train_df_select_drop.corr(), annot=True, cmap='RdBu_r')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x10aa3bc88>
_images/Python02_40_1.png
Sex > Age > Fare > SibSp > Embarked > Pclass > Parch の順にSurvivedと相関(の絶対値)が高い
ただし,EmbarkedとPclassは名義尺度なので注意
尺度 大小比較
名義尺度 電話番号 × × ×
順序尺度 震度 × ×
間隔尺度 温度(℃) ×
比率尺度 長さ

分布を見る

In [20]:
sns.pairplot(train_df_select[["Sex", 'Survived']], hue='Survived')
Out[20]:
<seaborn.axisgrid.PairGrid at 0x10ab174a8>
_images/Python02_43_1.png

Femaleのほうが生存率が高い

In [21]:
sns.pairplot(train_df_select[["Age", 'Survived']].dropna(), hue='Survived')
Out[21]:
<seaborn.axisgrid.PairGrid at 0x10b8e34e0>
_images/Python02_45_1.png
お年寄りはだいたい亡くなっている
幼児の生存率は高い
In [22]:
sns.pairplot(train_df_select[["Pclass", 'Survived']], hue='Survived')
Out[22]:
<seaborn.axisgrid.PairGrid at 0x10c3ab2b0>
_images/Python02_47_1.png

1>2>3等の順で生存率が高い

In [23]:
sns.pairplot(train_df_select[["Fare", 'Survived']], hue='Survived')
Out[23]:
<seaborn.axisgrid.PairGrid at 0x10bcd70f0>
_images/Python02_49_1.png
In [24]:
sns.pairplot(train_df_select[["Embarked", 'Survived']], hue='Survived')
Out[24]:
<seaborn.axisgrid.PairGrid at 0x10c70a400>
_images/Python02_50_1.png
In [25]:
sns.pairplot(train_df_select[["SibSp", 'Survived']], hue='Survived')
Out[25]:
<seaborn.axisgrid.PairGrid at 0x10de8f898>
_images/Python02_51_1.png
In [26]:
sns.pairplot(train_df_select[["Parch", 'Survived']], hue='Survived')
Out[26]:
<seaborn.axisgrid.PairGrid at 0x10e523ef0>
_images/Python02_52_1.png

2.予測モデルの作成・学習

In [27]:
train_df = pd.read_csv("http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv")
test_df = pd.read_csv("http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv")
train_df.shape, test_df.shape
Out[27]:
((891, 12), (418, 11))

データの整理

Pclass, Sex, Age, Fareのみで予測することにする
名義尺度はダミー変数化すべし
axis

axis

In [28]:
all_df = pd.concat([train_df.drop('Survived', axis=1), test_df], axis=0)
all_df = pd.get_dummies(all_df[["Pclass", "Sex", "Age", "Fare"]])
all_df
Out[28]:
Pclass Age Fare Sex_female Sex_male
0 3 22.0 7.2500 0 1
1 1 38.0 71.2833 1 0
2 3 26.0 7.9250 1 0
3 1 35.0 53.1000 1 0
4 3 35.0 8.0500 0 1
5 3 NaN 8.4583 0 1
6 1 54.0 51.8625 0 1
7 3 2.0 21.0750 0 1
8 3 27.0 11.1333 1 0
9 2 14.0 30.0708 1 0
10 3 4.0 16.7000 1 0
11 1 58.0 26.5500 1 0
12 3 20.0 8.0500 0 1
13 3 39.0 31.2750 0 1
14 3 14.0 7.8542 1 0
15 2 55.0 16.0000 1 0
16 3 2.0 29.1250 0 1
17 2 NaN 13.0000 0 1
18 3 31.0 18.0000 1 0
19 3 NaN 7.2250 1 0
20 2 35.0 26.0000 0 1
21 2 34.0 13.0000 0 1
22 3 15.0 8.0292 1 0
23 1 28.0 35.5000 0 1
24 3 8.0 21.0750 1 0
25 3 38.0 31.3875 1 0
26 3 NaN 7.2250 0 1
27 1 19.0 263.0000 0 1
28 3 NaN 7.8792 1 0
29 3 NaN 7.8958 0 1
... ... ... ... ... ...
388 3 21.0 7.7500 0 1
389 3 6.0 21.0750 0 1
390 1 23.0 93.5000 0 1
391 1 51.0 39.4000 1 0
392 3 13.0 20.2500 0 1
393 2 47.0 10.5000 0 1
394 3 29.0 22.0250 0 1
395 1 18.0 60.0000 1 0
396 3 24.0 7.2500 0 1
397 1 48.0 79.2000 1 0
398 3 22.0 7.7750 0 1
399 3 31.0 7.7333 0 1
400 1 30.0 164.8667 1 0
401 2 38.0 21.0000 0 1
402 1 22.0 59.4000 1 0
403 1 17.0 47.1000 0 1
404 1 43.0 27.7208 0 1
405 2 20.0 13.8625 0 1
406 2 23.0 10.5000 0 1
407 1 50.0 211.5000 0 1
408 3 NaN 7.7208 1 0
409 3 3.0 13.7750 1 0
410 3 NaN 7.7500 1 0
411 1 37.0 90.0000 1 0
412 3 28.0 7.7750 1 0
413 3 NaN 8.0500 0 1
414 1 39.0 108.9000 1 0
415 3 38.5 7.2500 0 1
416 3 NaN 8.0500 0 1
417 3 NaN 22.3583 0 1

1309 rows × 5 columns

groupby(集約関数)という便利機能

In [29]:
all_df.groupby(["Pclass"]).mean()
Out[29]:
Age Fare Sex_female Sex_male
Pclass
1 39.159930 87.508992 0.445820 0.554180
2 29.506705 21.179196 0.382671 0.617329
3 24.816367 13.302889 0.304654 0.695346

groupbyで欠損値の傾向を確認する

In [30]:
missing = all_df.copy()
missing = missing.isnull()
pd.DataFrame(missing.groupby(missing.columns.tolist()).size())
Out[30]:
0
Pclass Age Fare Sex_female Sex_male
False False False False False 1045
True False False 1
True False False False 263
欠損値をどうにかする(捨てる,置き換える,補う)方法はいろいろある
今回は単純に中央値を当てはめてみることにする.

Python pandas 欠損値/外れ値/離散化の処理

リストワイズ法 欠損レコードを除去
ペアワイズ法 相関係数など2変数を用いて計算を行う際に、 対象の変数が欠損している場合に計算対象から除外
平均代入法 欠損を持つ変数の平均値を補完
回帰代入法 欠損を持つ変数の値を回帰式をもとに補完

確率的回帰代入法,完全情報最尤推定法,多重代入法などなど

In [31]:
all_df = all_df.fillna(all_df.median())
all_df
Out[31]:
Pclass Age Fare Sex_female Sex_male
0 3 22.0 7.2500 0 1
1 1 38.0 71.2833 1 0
2 3 26.0 7.9250 1 0
3 1 35.0 53.1000 1 0
4 3 35.0 8.0500 0 1
5 3 28.0 8.4583 0 1
6 1 54.0 51.8625 0 1
7 3 2.0 21.0750 0 1
8 3 27.0 11.1333 1 0
9 2 14.0 30.0708 1 0
10 3 4.0 16.7000 1 0
11 1 58.0 26.5500 1 0
12 3 20.0 8.0500 0 1
13 3 39.0 31.2750 0 1
14 3 14.0 7.8542 1 0
15 2 55.0 16.0000 1 0
16 3 2.0 29.1250 0 1
17 2 28.0 13.0000 0 1
18 3 31.0 18.0000 1 0
19 3 28.0 7.2250 1 0
20 2 35.0 26.0000 0 1
21 2 34.0 13.0000 0 1
22 3 15.0 8.0292 1 0
23 1 28.0 35.5000 0 1
24 3 8.0 21.0750 1 0
25 3 38.0 31.3875 1 0
26 3 28.0 7.2250 0 1
27 1 19.0 263.0000 0 1
28 3 28.0 7.8792 1 0
29 3 28.0 7.8958 0 1
... ... ... ... ... ...
388 3 21.0 7.7500 0 1
389 3 6.0 21.0750 0 1
390 1 23.0 93.5000 0 1
391 1 51.0 39.4000 1 0
392 3 13.0 20.2500 0 1
393 2 47.0 10.5000 0 1
394 3 29.0 22.0250 0 1
395 1 18.0 60.0000 1 0
396 3 24.0 7.2500 0 1
397 1 48.0 79.2000 1 0
398 3 22.0 7.7750 0 1
399 3 31.0 7.7333 0 1
400 1 30.0 164.8667 1 0
401 2 38.0 21.0000 0 1
402 1 22.0 59.4000 1 0
403 1 17.0 47.1000 0 1
404 1 43.0 27.7208 0 1
405 2 20.0 13.8625 0 1
406 2 23.0 10.5000 0 1
407 1 50.0 211.5000 0 1
408 3 28.0 7.7208 1 0
409 3 3.0 13.7750 1 0
410 3 28.0 7.7500 1 0
411 1 37.0 90.0000 1 0
412 3 28.0 7.7750 1 0
413 3 28.0 8.0500 0 1
414 1 39.0 108.9000 1 0
415 3 38.5 7.2500 0 1
416 3 28.0 8.0500 0 1
417 3 28.0 22.3583 0 1

1309 rows × 5 columns

In [32]:
train, test = all_df[:train_df.shape[0]], all_df[train_df.shape[0]:]
train.shape, test.shape
Out[32]:
((891, 5), (418, 5))
In [33]:
t_train = train_df["Survived"].values
x_train = train.values
x_test = test.values

scikit-learn

Scikit
Pythonで最もメンテナンスされている機械学習ライブラリ.
わかりやすいAPIと豊富な機能が売り.
Cheat
膨大な手法があるので,このチートシートを頼りにどれを使うかの参考にするといい.
この図に載っているのはごく一部だけれど.

決定木で予測

In [34]:
import sklearn.tree
clf_decision_tree = sklearn.tree.DecisionTreeClassifier(max_depth=2, random_state=0)
In [35]:
clf_decision_tree.fit(x_train, t_train)
Out[35]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')
In [36]:
y_train = clf_decision_tree.predict(x_train)
y_train
Out[36]:
array([0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])

accuracy(正解数÷データ数)はライブラリを使うまでもないですが

In [37]:
sum(t_train == y_train) / len(t_train)
Out[37]:
0.79573512906846244
In [38]:
sklearn.metrics.accuracy_score(t_train, y_train)
Out[38]:
0.79573512906846244

長いimportはこれで短縮(名前空間を汚染するので諸刃の剣)

In [39]:
from sklearn.metrics import accuracy_score
accuracy_score(t_train, y_train)
Out[39]:
0.79573512906846244

クラス数が多い分類はconfusion matrixを見ると傾向がつかみやすい

In [40]:
from sklearn.metrics import confusion_matrix
confusion_matrix(t_train, y_train)
Out[40]:
array([[532,  17],
       [165, 177]])
横軸が正解,縦軸が予測値.
(左上)個が正しく死亡と判定,(左下)個が誤って死亡と判定
(右上)個が誤って生存と判定,(右下)個が正しく生存と判定
In [41]:
from sklearn.metrics import classification_report
print(classification_report(t_train, y_train))
             precision    recall  f1-score   support

          0       0.76      0.97      0.85       549
          1       0.91      0.52      0.66       342

avg / total       0.82      0.80      0.78       891

決定木の可視化

graphvizを別途インストールしていないと動きません^^;

In [42]:
import pydotplus
from sklearn.externals.six import StringIO
dot_data = StringIO()
sklearn.tree.export_graphviz(clf_decision_tree, out_file=dot_data
                     , feature_names=train.columns, filled=True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

# PDFファイルに出力
# graph.write_pdf("graph.pdf")
In [43]:
from IPython.display import Image
Image(graph.create_png())
Out[43]:
_images/Python02_85_0.png

Random Forestで予測

Random Forestは欠損値に強い,次元数の多さに強い
木が十分な数なら過学習しない(?)
OOBエラーによる評価なのでCross Varidation要らず(?)
In [44]:
from sklearn.ensemble import RandomForestClassifier
# n_estimatorは木の数
clf_random_forest = RandomForestClassifier(n_estimators=100, random_state=0)
In [45]:
clf_random_forest.fit(x_train, t_train)
Out[45]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)
In [46]:
y_train = clf_random_forest.predict(x_train)
print(accuracy_score(t_train, y_train))
0.977553310887
In [47]:
print(classification_report(t_train, y_train))
             precision    recall  f1-score   support

          0       0.98      0.99      0.98       549
          1       0.98      0.96      0.97       342

avg / total       0.98      0.98      0.98       891

In [48]:
print(confusion_matrix(t_train, y_train))
[[541   8]
 [ 12 330]]

3.予測モデルの評価

何をもってよいモデルだとするかは問題設定による
accuracy?precision?recall?f1-score?
PR曲線のAUC?ROC曲線のAUC?

モデルを良くする

  • もっといいパラメータがあるのでは?
    パラメータをいろいろ試す必要がある.→ Grid Search
  • 他のデータでもうまく行く保証はあるのか?
    過学習(overfitting) を防ぎたい=汎化性能を高める.→ Cross Varidation

Cross 機械学習によるデータ分析まわりのお話 より

両方skit-learnでできます!

In [49]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
tuned_parameters = [
    {'C': [1, 1e1, 1e2, 1e3, 1e4],
     'kernel': ['rbf'],
     'gamma': [1e-2, 1e-3, 1e-4]},
]
clf_SVC_cv = GridSearchCV(
    SVC(random_state=0),
    tuned_parameters,
    cv=5,
    scoring="f1",
    n_jobs=-1
)
In [50]:
clf_SVC_cv.fit(x_train, t_train)
Out[50]:
GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid=[{'C': [1, 10.0, 100.0, 1000.0, 10000.0], 'kernel': ['rbf'], 'gamma': [0.01, 0.001, 0.0001]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1', verbose=0)
In [51]:
pd.DataFrame(clf_SVC_cv.cv_results_)
Out[51]:
mean_fit_time mean_score_time mean_test_score mean_train_score param_C param_gamma param_kernel params rank_test_score split0_test_score ... split2_test_score split2_train_score split3_test_score split3_train_score split4_test_score split4_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.022302 0.004612 0.500432 0.599439 1 0.01 rbf {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'} 12 0.348624 ... 0.537037 0.581395 0.469388 0.593103 0.596491 0.565321 0.000882 0.000682 0.086323 0.025639
1 0.017771 0.003363 0.478282 0.498602 1 0.001 rbf {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'} 13 0.336283 ... 0.509434 0.513636 0.437500 0.432836 0.563636 0.494172 0.000585 0.000074 0.083251 0.041543
2 0.015337 0.003225 0.373277 0.401322 1 0.0001 rbf {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'} 15 0.222222 ... 0.333333 0.368000 0.391304 0.354839 0.391304 0.381963 0.000719 0.000057 0.099299 0.047146
3 0.030071 0.002918 0.680114 0.808045 10 0.01 rbf {'C': 10.0, 'gamma': 0.01, 'kernel': 'rbf'} 9 0.666667 ... 0.676471 0.802239 0.638655 0.797794 0.733813 0.791128 0.003710 0.000044 0.030994 0.014921
4 0.023025 0.003189 0.696823 0.736891 10 0.001 rbf {'C': 10.0, 'gamma': 0.001, 'kernel': 'rbf'} 7 0.671053 ... 0.705036 0.739526 0.651163 0.758123 0.706767 0.719858 0.002123 0.000111 0.033928 0.012797
5 0.016498 0.003338 0.442728 0.442583 10 0.0001 rbf {'C': 10.0, 'gamma': 0.0001, 'kernel': 'rbf'} 14 0.330275 ... 0.421053 0.422680 0.458333 0.390501 0.453608 0.416452 0.001508 0.000120 0.070980 0.047659
6 0.115308 0.002867 0.685099 0.851002 100 0.01 rbf {'C': 100.0, 'gamma': 0.01, 'kernel': 'rbf'} 8 0.652778 ... 0.724638 0.851449 0.656000 0.843066 0.725926 0.845173 0.029387 0.000320 0.033026 0.006237
7 0.049189 0.002460 0.710631 0.771008 100 0.001 rbf {'C': 100.0, 'gamma': 0.001, 'kernel': 'rbf'} 5 0.701987 ... 0.724638 0.763158 0.645161 0.772313 0.755556 0.753731 0.009958 0.000300 0.036865 0.012765
8 0.041163 0.003029 0.704471 0.713795 100 0.0001 rbf {'C': 100.0, 'gamma': 0.0001, 'kernel': 'rbf'} 6 0.733813 ... 0.695652 0.719101 0.625000 0.724584 0.723077 0.711111 0.012930 0.000138 0.042910 0.007052
9 0.572957 0.002542 0.669756 0.870491 1000 0.01 rbf {'C': 1000.0, 'gamma': 0.01, 'kernel': 'rbf'} 10 0.652778 ... 0.704225 0.862816 0.638655 0.866667 0.696296 0.871324 0.242091 0.000094 0.025698 0.005719
10 0.311519 0.002486 0.715629 0.794218 1000 0.001 rbf {'C': 1000.0, 'gamma': 0.001, 'kernel': 'rbf'} 2 0.716216 ... 0.716418 0.785441 0.672131 0.788104 0.779412 0.780037 0.082325 0.000140 0.035728 0.013975
11 0.174307 0.003853 0.712710 0.722426 1000 0.0001 rbf {'C': 1000.0, 'gamma': 0.0001, 'kernel': 'rbf'} 4 0.739130 ... 0.705882 0.722117 0.645161 0.738806 0.723077 0.716981 0.075424 0.002672 0.036903 0.008729
12 4.124018 0.001604 0.649124 0.888971 10000 0.01 rbf {'C': 10000.0, 'gamma': 0.01, 'kernel': 'rbf'} 11 0.638298 ... 0.686567 0.881481 0.611570 0.890130 0.656489 0.887218 1.239600 0.000048 0.024471 0.004438
13 1.460751 0.001553 0.713517 0.814629 10000 0.001 rbf {'C': 10000.0, 'gamma': 0.001, 'kernel': 'rbf'} 3 0.722222 ... 0.729927 0.798479 0.650407 0.811111 0.770370 0.812386 0.218736 0.000039 0.039692 0.010188
14 0.505528 0.002130 0.724973 0.755406 10000 0.0001 rbf {'C': 10000.0, 'gamma': 0.0001, 'kernel': 'rbf'} 1 0.753425 ... 0.720588 0.756554 0.650407 0.769797 0.753846 0.726930 0.216026 0.000431 0.039189 0.015732

15 rows × 23 columns

In [52]:
clf_SVC_cv.best_params_
Out[52]:
{'C': 10000.0, 'gamma': 0.0001, 'kernel': 'rbf'}
In [53]:
clf_SVC_cv.best_score_
Out[53]:
0.72497257558382566
In [54]:
y_train = clf_SVC_cv.predict(x_train)
print(classification_report(t_train, y_train))
             precision    recall  f1-score   support

          0       0.84      0.87      0.85       549
          1       0.77      0.73      0.75       342

avg / total       0.81      0.82      0.81       891

In [55]:
print(accuracy_score(t_train, y_train))
0.81593714927
In [56]:
print(confusion_matrix(t_train, y_train))
[[476  73]
 [ 91 251]]

モデルの保存

pickleで保存してもいいが,ここではjoblibで保存することにする.
compressをつけておくと,本来は複数ファイルのモデルをひとまとめにできる.
In [57]:
from sklearn.externals import joblib
joblib.dump(clf_SVC_cv.best_estimator_, 'svc.pkl.cmp', compress=True)
Out[57]:
['svc.pkl.cmp']
In [58]:
!ls
Python01.ipynb     _templates         predict_result.csv
Python02.ipynb     conf.py            siritori.pkl
_static            index.rst          svc.pkl.cmp
In [59]:
classifier = joblib.load('svc.pkl.cmp')
In [60]:
y_train = clf_SVC_cv.predict(x_train)
print(accuracy_score(t_train, y_train))
0.81593714927

Kaggleへテストの結果を提出

問題分を読んで回答に合う形式で提出

In [61]:
y_test = clf_SVC_cv.predict(x_test)
In [62]:
result_df = pd.DataFrame({
    "PassengerId": test.index,
    "Survived": y_test
})
result_df
Out[62]:
PassengerId Survived
0 0 0
1 1 1
2 2 0
3 3 0
4 4 1
5 5 0
6 6 1
7 7 0
8 8 1
9 9 0
10 10 0
11 11 0
12 12 1
13 13 0
14 14 1
15 15 1
16 16 0
17 17 0
18 18 1
19 19 1
20 20 0
21 21 0
22 22 1
23 23 0
24 24 1
25 25 0
26 26 1
27 27 0
28 28 0
29 29 0
... ... ...
388 388 0
389 389 0
390 390 0
391 391 1
392 392 0
393 393 0
394 394 0
395 395 1
396 396 0
397 397 1
398 398 0
399 399 0
400 400 1
401 401 0
402 402 1
403 403 0
404 404 0
405 405 0
406 406 0
407 407 0
408 408 1
409 409 1
410 410 1
411 411 1
412 412 1
413 413 0
414 414 1
415 415 0
416 416 0
417 417 0

418 rows × 2 columns

In [63]:
result_df.to_csv('predict_result.csv', index=False)
In [64]:
!head predict_result.csv
PassengerId,Survived
0,0
1,1
2,0
3,0
4,1
5,0
6,1
7,0
8,1

おわりに

コンペ的にはここからがスタート,トライアンドエラーが基本.
最後にデータ分析における格言?を載せておきます.

ノーフリーランチ定理

ただ(=事前の知見なし)の飯(=予測や探索での改善)はない(no free lunch)」
FreeLunch

醜いアヒルの子の定理

特徴量を同等に扱っていれば,アヒルの子とガチョウの子の類似度はアヒルの子同士の類似度と同じ.
あらゆる特徴を使うのではなく,目的あった特徴を選ぶべき.
Duck

見せかけの回帰

シミュレーションで発生させた互いに独立なランダムウォークを回帰したもの.
独立なのにも関わらず非常に高いt検定統計量の値となっている. Spurious

テキサスの狙撃兵の誤謬(クラスター錯覚)

納屋の壁に向かって何発も発砲した後で,最も弾痕が集中した箇所に標的を描き,
自分は狙撃兵(射撃の名手)だと主張するテキサス人がいたというジョーク.
仮説の構築と検証を同じデータで行ってはいけない.
Texas

ハンマー釘病

If all you have is a hammer, Everything looks like a nail. Hummer