全部评论(1条)
-
- oz9jo6 2016-09-18 00:00:00
- 用于GX处理数据的python工具。一般处理数据都需要完成以下几个步骤: 与外界进行交互 准备,数据清理、修整、整合、规范化、重塑、切片切换、变形等等 转换 建模和计算 展示 Introductory examples 1.usa.gov data from bit.ly [code]%pwd %cd ../book_scripts path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt' open(path).readline() import json path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt' records = [json.loads(line) for line in open(path)] records[0] records[0]['tz'] print(records[0]['tz']) Counting time zones in pure Python [code]time_zones = [rec['tz'] for rec in records] time_zones = [rec['tz'] for rec in records if 'tz' in rec] time_zones[:10] def get_counts(sequence): counts = {} for x in sequence: if x in counts: counts[x] += 1 else: counts[x] = 1 return counts from collections import defaultdict def get_counts2(sequence): counts = defaultdict(int) # values will initialize to 0 for x in sequence: counts[x] += 1 return counts counts = get_counts(time_zones) counts['America/New_York'] len(time_zones) def top_counts(count_dict, n=10): value_key_pairs = [(count, tz) for tz, count in count_dict.items()] value_key_pairs.sort() return value_key_pairs[-n:] top_counts(counts) from collections import Counter counts = Counter(time_zones) counts.most_common(10) Counting time zones with pandas [code]%matplotlib inline from __future__ import division from numpy.random import randn import numpy as np import os import matplotlib.pyplot as plt import pandas as pd plt.rc('figure', figsize=(10, 6)) np.set_printoptions(precision=4) import json path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt' lines = open(path).readlines() records = [json.loads(line) for line in lines] from pandas import DataFrame, Series import pandas as pd frame = DataFrame(records) frame frame['tz'][:10] tz_counts = frame['tz'].value_counts() tz_counts[:10] clean_tz = frame['tz'].fillna('Missing') clean_tz[clean_tz == ''] = 'Unknown' tz_counts = clean_tz.value_counts() tz_counts[:10] plt.figure(figsize=(10, 4)) tz_counts[:10].plot(kind='barh', rot=0) frame['a'][1] frame['a'][50] frame['a'][51] results = Series([x.split()[0] for x in frame.a.dropna()]) results[:5] results.value_counts()[:8] cframe = frame[frame.a.notnull()] operating_system = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows') operating_system[:5] by_tz_os = cframe.groupby(['tz', operating_system]) agg_counts = by_tz_os.size().unstack().fillna(0) agg_counts[:10] # Use to sort in ascending order indexer = agg_counts.sum(1).argsort() indexer[:10] count_subset = agg_counts.take(indexer)[-10:] count_subset plt.figure() count_subset.plot(kind='barh', stacked=True) plt.figure() normed_subset = count_subset.div(count_subset.sum(1), axis=0) normed_subset.plot(kind='barh', stacked=True) MovieLens 1M data set [code]import pandas as pd import os encoding = 'latin1' upath = os.path.expanduser('ch02/movielens/users.dat') rpath = os.path.expanduser('ch02/movielens/ratings.dat') mpath = os.path.expanduser('ch02/movielens/movies.dat') unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] rnames = ['user_id', 'movie_id', 'rating', 'timestamp'] mnames = ['movie_id', 'title', 'genres'] users = pd.read_csv(upath, sep='::', header=None, names=unames, encoding=encoding) ratings = pd.read_csv(rpath, sep='::', header=None, names=rnames, encoding=encoding) movies = pd.read_csv(mpath, sep='::', header=None, names=mnames, encoding=encoding) users[:5] ratings[:5] movies[:5] ratings data = pd.merge(pd.merge(ratings, users), movies) data data.ix[0] mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean') mean_ratings[:5] ratings_by_title = data.groupby('title').size() ratings_by_title[:5] active_titles = ratings_by_title.index[ratings_by_title >= 250] active_titles[:10] mean_ratings = mean_ratings.ix[active_titles] mean_ratings mean_ratings = mean_ratings.rename(index={'Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)': 'Seven Samurai (Shichinin no samurai) (1954)'}) top_female_ratings = mean_ratings.sort_index(by='F', ascending=False) top_female_ratings[:10] Measuring rating disagreement [code]mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F'] sorted_by_diff = mean_ratings.sort_index(by='diff') sorted_by_diff[:15] # Reverse order of rows, take first 15 rows sorted_by_diff[::-1][:15] # Standard deviation of rating grouped by title rating_std_by_title = data.groupby('title')['rating'].std() # Filter down to active_titles rating_std_by_title = rating_std_by_title.ix[active_titles] # Order Series by value in descending order rating_std_by_title.order(ascending=False)[:10] US Baby Names 1880-2010 [code]from __future__ import division from numpy.random import randn import numpy as np import matplotlib.pyplot as plt plt.rc('figure', figsize=(12, 5)) np.set_printoptions(precision=4) %pwd http://www.ssa.gov/oact/babynames/limits.html [code]!head -n 10 ch02/names/yob1880.txt import pandas as pd names1880 = pd.read_csv('ch02/names/yob1880.txt', names=['name', 'sex', 'births']) names1880 names1880.groupby('sex').births.sum() # 2010 is the last available year right now years = range(1880, 2011) pieces = [] columns = ['name', 'sex', 'births'] for year in years: path = 'names/names/yob%d.txt' % year frame = pd.read_csv(path, names=columns) frame['year'] = year pieces.append(frame) # Concatenate everything into a single DataFrame names = pd.concat(pieces, ignore_index=True) total_births = names.pivot_table('births', index='year', columns='sex', aggfunc=sum) total_births.tail() total_births.plot(title='Total births by sex and year') def add_prop(group): # Integer division floors births = group.births.astype(float) group['prop'] = births / births.sum() return group names = names.groupby(['year', 'sex']).apply(add_prop) names np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1) def get_top1000(group): return group.sort_index(by='births', ascending=False)[:1000] grouped = names.groupby(['year', 'sex']) top1000 = grouped.apply(get_top1000) pieces = [] for year, group in names.groupby(['year', 'sex']): pieces.append(group.sort_index(by='births', ascending=False)[:1000]) top1000 = pd.concat(pieces, ignore_index=True) top1000.index = np.arange(len(top1000)) top1000 Analyzing naming trends [code]boys = top1000[top1000.sex == 'M'] girls = top1000[top1000.sex == 'F'] total_births = top1000.pivot_table('births', index='year', columns='name', aggfunc=sum) total_births subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']] subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year") Measuring the increase in naming diversity [code]plt.figure() table = top1000.pivot_table('prop', index='year', columns='sex', aggfunc=sum) table.plot(title='Sum of table1000.prop by year and sex', yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10)) df = boys[boys.year == 2010] df prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum() prop_cumsum[:10] prop_cumsum.values.searchsorted(0.5) df = boys[boys.year == 1900] in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum() in1900.values.searchsorted(0.5) + 1 def get_quantile_count(group, q=0.5): group = group.sort_index(by='prop', ascending=False) return group.prop.cumsum().values.searchsorted(q) + 1 diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count) diversity = diversity.unstack('sex') def get_quantile_count(group, q=0.5): group = group.sort_index(by='prop', ascending=False) return group.prop.cumsum().values.searchsorted(q) + 1 diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count) diversity = diversity.unstack('sex') diversity.head() diversity.plot(title="Number of popular names in top 50%") The “Last letter” Revolution [code]# extract last letter from name column get_last_letter = lambda x: x[-1] last_letters = names.name.map(get_last_letter) last_letters.name = 'last_letter' table = names.pivot_table('births', index=last_letters, columns=['sex', 'year'], aggfunc=sum) subtable = table.reindex(columns=[1910, 1960, 2010], level='year') subtable.head() subtable.sum() letter_prop = subtable / subtable.sum().astype(float) import matplotlib.pyplot as plt fig, axes = plt.subplots(2, 1, figsize=(10, 8)) letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male') letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False) plt.subplots_adjust(hspace=0.25) letter_prop = table / table.sum().astype(float) dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T dny_ts.head() plt.close('all') dny_ts.plot() Boy names that became girl names (and vice versa) [code]all_names = top1000.name.unique() mask = np.array(['lesl' in x.lower() for x in all_names]) lesley_like = all_names[mask] lesley_like filtered = top1000[top1000.name.isin(lesley_like)] filtered.groupby('name').births.sum() table = filtered.pivot_table('births', index='year', columns='sex', aggfunc='sum') table = table.div(table.sum(1), axis=0) table.tail() plt.close('all') table.plot(style={'M': 'k-', 'F': 'k--'})
-
赞(3)
回复(0)
热门问答
- 利用python进行数据分析 使用什么软件
2016-09-17 07:18:19
366
1
- 利用python进行数据分析 用什么软件
2017-11-10 17:45:43
549
2
- python 数据分析 用什么软件
2016-10-23 00:37:06
323
2
- 利用生物软件进行数据分析有哪些
2017-03-24 00:12:47
384
1
- 使用统计图表进行数据分析有什么优点
2017-01-21 07:55:42
1253
2
- 如何使用python进行社交网络分析
2018-12-05 22:35:04
405
0
- 如何制作shp格式文件,并利用Geoda软件进行空间面板数据分析
2016-06-18 20:09:27
433
1
- python和r数据分析哪个更好
2015-09-02 12:22:33
451
3
- 学习python,用什么软件?
2012-12-01 01:12:11
282
4
- 毕业实验数据分析,使用spss软件对试验数据进行分析
- 使用一种激素对大鼠进行灌胃,将大鼠分为两组,青年组和老年组,每组又根据激素灌胃量不同分为,低剂量组,中剂量组,高剂量组,对照组,阳性对照组,Z后得到实验数据,判断这种激素... 使用一种激素对大鼠进行灌胃,将大鼠分为两组,青年组和老年组,每组又根据激素灌胃量不同分为,低剂量组,中剂量组,高剂量组,对照组,阳性对照组,Z后得到实验数据,判断这种激素对大鼠有没有影响,应该用spss中的哪种分析方法?? 展开
2017-03-23 05:47:53
482
1
- 可导入csv进行数据分析的软件?
- CSV数据量不少,希望不要太卡!... CSV数据量不少,希望不要太卡! 展开
2017-06-11 07:31:38
370
2
- 学习python应该下载什么软件
2017-09-11 01:48:14
422
1
- 数据分析用 什么软件好
2017-09-03 07:30:53
311
1
- 如何进行销售数据分析,有没有好的软件?
- 如何进行销售数据分析,有没有好的软件?如何进行销售数据分析?每月都要进行本月销售数据的汇总,还有销售形式的预测评估。单纯从数据上自己分析感觉很不专业请问有没有专业的软件?... 如何进行销售数据分析,有没有好的软件? 如何进行销售数据分析?每月都要进行本月销售数据的汇总,还有销售形式的预测评估。单纯从数据上自己分析感觉很不专业 请问有没有专业的软件?SPSS适合不适合?还有没有更好软件 展开
2009-04-22 01:34:38
702
6
- python的GUI设计用什么软件?
2017-12-10 04:26:31
364
2
- 软件数据分析方向
- 我现在想未来的方向是软件数据分析方向,我不知道现在应该从事软件的哪个方向比较好,请各位大神给予赐教... 我现在想未来的方向是软件数据分析方向,我不知道现在应该从事软件的哪个方向比较好,请各位大神给予赐教 展开
2017-02-28 19:47:00
566
1
- SPSS软件进行数据分析时,如何选择检验方法?
- 如何选择检验方法?比如用Duncan,Tukey还是Bonferroni或者LSD?分别用在什么情况下?各有什么优缺点?... 如何选择检验方法?比如用Duncan,Tukey还是Bonferroni或者LSD?分别用在什么情况下?各有什么优缺点? 展开
2012-09-18 00:36:30
713
3
- 求推荐一款能进行简单数据分析的软件
- Z近要处理大量销售数据想看到销售每周每月每季度的变化还有客户地址分布商品销量变化等等excel处理太慢了有没有什么软件可以直接把数据导进去就能把这些数据分析出来的Z好是单机版不... Z近要处理大量销售数据 想看到销售 每周 每月 每季度的变化 还有客户地址分布 商品销量变化 等等 excel 处理太慢了 有没有什么软件可以直接把数据导进去 就能把这些数据分析出来的 Z好是单机版 不用联网的 有知道的给推荐下呗 要中文版哈 展开
2018-03-28 05:03:10
415
2
- 如何进行spss软件中描述统计的数据分析呢?
2011-06-28 02:16:04
400
4
- 为什么有了SAS,WEKA等功能强大的数据分析挖掘软件,还需要R,PYTHON
2018-04-04 09:08:43
299
1
1月突出贡献榜
推荐主页
最新话题
-
- #八一建军节——科技铸盾,仪器护航#
- 如何选择到合适的磷青铜绞线?磷青铜绞线的质量...如何选择到合适的磷青铜绞线?磷青铜绞线的质量解析和如何选择到合适的绞线?磷青铜绞线是一种特殊的铜合金导线,由铜、锡和磷等元素组成,具有很好的机械性能、电气性能和耐腐蚀性。磷青铜绞线基本定义与特性:磷青铜是铜与锡、磷的合金,质地坚硬,可制弹簧。典型成分为铜(90%)、锡(6-9%)及磷(0.03-0.6%)锡元素提升合金的强度和耐腐蚀性,磷则细化晶粒、增强耐磨性铸造性能。耐磨性:表面氧化层使其在特殊环境下耐腐蚀,使用寿命长导电性:保持铜很好导电性能的同时有化电子传输路径非铁磁性:不含铁元素,避免在强磁场环境中产生额外能量损耗弹性:受到外力作用时能迅速恢复原状
- 八一建军节 铁血铸军魂













参与评论
登录后参与评论