- 14.5 2012联邦选举委员会数据库
- 根据职业和雇主统计赞助信息
- 对出资额分组
- 根据州统计赞助信息
14.5 2012联邦选举委员会数据库
美国联邦选举委员会发布了有关政治竞选赞助方面的数据。其中包括赞助者的姓名、职业、雇主、地址以及出资额等信息。我们对2012年美国总统大选的数据集比较感兴趣(http://www.fec.gov/disclosurep/PDownload.do)。我在2012年6月下载的数据集是一个150MB的CSV文件(P00000001-ALL.csv),我们先用pandas.read_csv将其加载进来:
In [184]: fec = pd.read_csv('datasets/fec/P00000001-ALL.csv')In [185]: fec.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 1001731 entries, 0 to 1001730Data columns (total 16 columns):cmte_id 1001731 non-null objectcand_id 1001731 non-null objectcand_nm 1001731 non-null objectcontbr_nm 1001731 non-null objectcontbr_city 1001712 non-null objectcontbr_st 1001727 non-null objectcontbr_zip 1001620 non-null objectcontbr_employer 988002 non-null objectcontbr_occupation 993301 non-null objectcontb_receipt_amt 1001731 non-null float64contb_receipt_dt 1001731 non-null objectreceipt_desc 14166 non-null objectmemo_cd 92482 non-null objectmemo_text 97770 non-null objectform_tp 1001731 non-null objectfile_num 1001731 non-null int64dtypes: float64(1), int64(1), object(14)memory usage: 122.3+ MB
该DataFrame中的记录如下所示:
In [186]: fec.iloc[123456]Out[186]:cmte_id C00431445cand_id P80003338cand_nm Obama, Barackcontbr_nm ELLMAN, IRAcontbr_city TEMPE...receipt_desc NaNmemo_cd NaNmemo_text NaNform_tp SA17Afile_num 772372Name: 123456, Length: 16, dtype: object
你可能已经想出了许多办法从这些竞选赞助数据中抽取有关赞助人和赞助模式的统计信息。我将在接下来的内容中介绍几种不同的分析工作(运用到目前为止已经学到的方法)。
不难看出,该数据中没有党派信息,因此最好把它加进去。通过unique,你可以获取全部的候选人名单:
In [187]: unique_cands = fec.cand_nm.unique()In [188]: unique_candsOut[188]:array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',"Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy','Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick', 'Cain, Herman','Gingrich, Newt', 'McCotter, Thaddeus G', 'Huntsman, Jon','Perry, Rick'], dtype=object)In [189]: unique_cands[2]Out[189]: 'Obama, Barack'
指明党派信息的方法之一是使用字典:
parties = {'Bachmann, Michelle': 'Republican','Cain, Herman': 'Republican','Gingrich, Newt': 'Republican','Huntsman, Jon': 'Republican','Johnson, Gary Earl': 'Republican','McCotter, Thaddeus G': 'Republican','Obama, Barack': 'Democrat','Paul, Ron': 'Republican','Pawlenty, Timothy': 'Republican','Perry, Rick': 'Republican',"Roemer, Charles E. 'Buddy' III": 'Republican','Romney, Mitt': 'Republican','Santorum, Rick': 'Republican'}
现在,通过这个映射以及Series对象的map方法,你可以根据候选人姓名得到一组党派信息:
In [191]: fec.cand_nm[123456:123461]Out[191]:123456 Obama, Barack123457 Obama, Barack123458 Obama, Barack123459 Obama, Barack123460 Obama, BarackName: cand_nm, dtype: objectIn [192]: fec.cand_nm[123456:123461].map(parties)Out[192]:123456 Democrat123457 Democrat123458 Democrat123459 Democrat123460 DemocratName: cand_nm, dtype: object# Add it as a columnIn [193]: fec['party'] = fec.cand_nm.map(parties)In [194]: fec['party'].value_counts()Out[194]:Democrat 593746Republican 407985Name: party, dtype: int64
这里有两个需要注意的地方。第一,该数据既包括赞助也包括退款(负的出资额):
In [195]: (fec.contb_receipt_amt > 0).value_counts()Out[195]:True 991475False 10256Name: contb_receipt_amt, dtype: int64
为了简化分析过程,我限定该数据集只能有正的出资额:
In [196]: fec = fec[fec.contb_receipt_amt > 0]
由于Barack Obama和Mitt Romney是最主要的两名候选人,所以我还专门准备了一个子集,只包含针对他们两人的竞选活动的赞助信息:
In [197]: fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack','Romney, Mitt'])]
根据职业和雇主统计赞助信息
基于职业的赞助信息统计是另一种经常被研究的统计任务。例如,律师们更倾向于资助民主党,而企业主则更倾向于资助共和党。你可以不相信我,自己看那些数据就知道了。首先,根据职业计算出资总额,这很简单:
In [198]: fec.contbr_occupation.value_counts()[:10]Out[198]:RETIRED 233990INFORMATION REQUESTED 35107ATTORNEY 34286HOMEMAKER 29931PHYSICIAN 23432INFORMATION REQUESTED PER BEST EFFORTS 21138ENGINEER 14334TEACHER 13990CONSULTANT 13273PROFESSOR 12555Name: contbr_occupation, dtype: int64
不难看出,许多职业都涉及相同的基本工作类型,或者同一样东西有多种变体。下面的代码片段可以清理一些这样的数据(将一个职业信息映射到另一个)。注意,这里巧妙地利用了dict.get,它允许没有映射关系的职业也能“通过”:
occ_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED','INFORMATION REQUESTED' : 'NOT PROVIDED','INFORMATION REQUESTED (BEST EFFORTS)' : 'NOT PROVIDED','C.E.O.': 'CEO'}# If no mapping provided, return xf = lambda x: occ_mapping.get(x, x)fec.contbr_occupation = fec.contbr_occupation.map(f)
我对雇主信息也进行了同样的处理:
emp_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED','INFORMATION REQUESTED' : 'NOT PROVIDED','SELF' : 'SELF-EMPLOYED','SELF EMPLOYED' : 'SELF-EMPLOYED',}# If no mapping provided, return xf = lambda x: emp_mapping.get(x, x)fec.contbr_employer = fec.contbr_employer.map(f)
现在,你可以通过pivot_table根据党派和职业对数据进行聚合,然后过滤掉总出资额不足200万美元的数据:
In [201]: by_occupation = fec.pivot_table('contb_receipt_amt',.....: index='contbr_occupation',.....: columns='party', aggfunc='sum')In [202]: over_2mm = by_occupation[by_occupation.sum(1) > 2000000]In [203]: over_2mmOut[203]:party Democrat Republicancontbr_occupationATTORNEY 11141982.97 7.477194e+06CEO 2074974.79 4.211041e+06CONSULTANT 2459912.71 2.544725e+06ENGINEER 951525.55 1.818374e+06EXECUTIVE 1355161.05 4.138850e+06... ... ...PRESIDENT 1878509.95 4.720924e+06PROFESSOR 2165071.08 2.967027e+05REAL ESTATE 528902.09 1.625902e+06RETIRED 25305116.38 2.356124e+07SELF-EMPLOYED 672393.40 1.640253e+06[17 rows x 2 columns]
把这些数据做成柱状图看起来会更加清楚(’barh’表示水平柱状图,如图14-12所示):
In [205]: over_2mm.plot(kind='barh')

你可能还想了解一下对Obama和Romney总出资额最高的职业和企业。为此,我们先对候选人进行分组,然后使用本章前面介绍的类似top的方法:
def get_top_amounts(group, key, n=5):totals = group.groupby(key)['contb_receipt_amt'].sum()return totals.nlargest(n)
然后根据职业和雇主进行聚合:
In [207]: grouped = fec_mrbo.groupby('cand_nm')In [208]: grouped.apply(get_top_amounts, 'contbr_occupation', n=7)Out[208]:cand_nm contbr_occupationObama, Barack RETIRED 25305116.38ATTORNEY 11141982.97INFORMATION REQUESTED 4866973.96HOMEMAKER 4248875.80PHYSICIAN 3735124.94...Romney, Mitt HOMEMAKER 8147446.22ATTORNEY 5364718.82PRESIDENT 2491244.89EXECUTIVE 2300947.03C.E.O. 1968386.11Name: contb_receipt_amt, Length: 14, dtype: float64In [209]: grouped.apply(get_top_amounts, 'contbr_employer', n=10)Out[209]:cand_nm contbr_employerObama, Barack RETIRED 22694358.85SELF-EMPLOYED 17080985.96NOT EMPLOYED 8586308.70INFORMATION REQUESTED 5053480.37HOMEMAKER 2605408.54...Romney, Mitt CREDIT SUISSE 281150.00MORGAN STANLEY 267266.00GOLDMAN SACH & CO. 238250.00BARCLAYS CAPITAL 162750.00H.I.G. CAPITAL 139500.00Name: contb_receipt_amt, Length: 20, dtype: float64
对出资额分组
还可以对该数据做另一种非常实用的分析:利用cut函数根据出资额的大小将数据离散化到多个面元中:
In [210]: bins = np.array([0, 1, 10, 100, 1000, 10000,.....: 100000, 1000000, 10000000])In [211]: labels = pd.cut(fec_mrbo.contb_receipt_amt, bins)In [212]: labelsOut[212]:411 (10, 100]412 (100, 1000]413 (100, 1000]414 (10, 100]415 (10, 100]...701381 (10, 100]701382 (100, 1000]701383 (1, 10]701384 (10, 100]701385 (100, 1000]Name: contb_receipt_amt, Length: 694282, dtype: categoryCategories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] <(10000, 100000] < (100000, 1000000] < (1000000,10000000]]
现在可以根据候选人姓名以及面元标签对奥巴马和罗姆尼数据进行分组,以得到一个柱状图:
In [213]: grouped = fec_mrbo.groupby(['cand_nm', labels])In [214]: grouped.size().unstack(0)Out[214]:cand_nm Obama, Barack Romney, Mittcontb_receipt_amt(0, 1] 493.0 77.0(1, 10] 40070.0 3681.0(10, 100] 372280.0 31853.0(100, 1000] 153991.0 43357.0(1000, 10000] 22284.0 26186.0(10000, 100000] 2.0 1.0(100000, 1000000] 3.0 NaN(1000000, 10000000] 4.0 NaN
从这个数据中可以看出,在小额赞助方面,Obama获得的数量比Romney多得多。你还可以对出资额求和并在面元内规格化,以便图形化显示两位候选人各种赞助额度的比例(见图14-13):
In [216]: bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)In [217]: normed_sums = bucket_sums.div(bucket_sums.sum(axis=1), axis=0)In [218]: normed_sumsOut[218]:cand_nm Obama, Barack Romney, Mittcontb_receipt_amt(0, 1] 0.805182 0.194818(1, 10] 0.918767 0.081233(10, 100] 0.910769 0.089231(100, 1000] 0.710176 0.289824(1000, 10000] 0.447326 0.552674(10000, 100000] 0.823120 0.176880(100000, 1000000] 1.000000 NaN(1000000, 10000000] 1.000000 NaNIn [219]: normed_sums[:-2].plot(kind='barh')

我排除了两个最大的面元,因为这些不是由个人捐赠的。
还可以对该分析过程做许多的提炼和改进。比如说,可以根据赞助人的姓名和邮编对数据进行聚合,以便找出哪些人进行了多次小额捐款,哪些人又进行了一次或多次大额捐款。我强烈建议你下载这些数据并自己摸索一下。
根据州统计赞助信息
根据候选人和州对数据进行聚合是常规操作:
In [220]: grouped = fec_mrbo.groupby(['cand_nm', 'contbr_st'])In [221]: totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)In [222]: totals = totals[totals.sum(1) > 100000]In [223]: totals[:10]Out[223]:cand_nm Obama, Barack Romney, Mittcontbr_stAK 281840.15 86204.24AL 543123.48 527303.51AR 359247.28 105556.00AZ 1506476.98 1888436.23CA 23824984.24 11237636.60CO 2132429.49 1506714.12CT 2068291.26 3499475.45DC 4373538.80 1025137.50DE 336669.14 82712.00FL 7318178.58 8338458.81
如果对各行除以总赞助额,就会得到各候选人在各州的总赞助额比例:
In [224]: percent = totals.div(totals.sum(1), axis=0)In [225]: percent[:10]Out[225]:cand_nm Obama, Barack Romney, Mittcontbr_stAK 0.765778 0.234222AL 0.507390 0.492610AR 0.772902 0.227098AZ 0.443745 0.556255CA 0.679498 0.320502CO 0.585970 0.414030CT 0.371476 0.628524DC 0.810113 0.189887DE 0.802776 0.197224FL 0.467417 0.532583
