Python自带许多很好的模块(libraries),能够非常方便的解决一些实际问题。而选对合适的函数,使用好某些模块,能帮助我们少写很多行代码。
collections这个模块,提供与容器相关的、更高性能的数据类型,它们比通用容器dict、list、set和tuple更强大。
以下介绍collections模块常用的3种数据类型。
如下,把14个对象属性,放入到一个list类型的变量features中:
1 features = ['id' ,'age' ,'height' ,'name' ,'address' ,'province' ,'city' ,'town' ,'country' ,'birth_address' ,'father_name' , 'monther_name' ,'telephone' ,'emergency_telephone' ]
假设,我现在负责维护某乡村,上千行的剧名信息。现在有个新任务,新调查村名户口信息,有一份数据,现在要比较下,哪些居民的居住地址(对应字段 address)、联系电话(对应字段 telephone)、出生地信息(对应字段 birth address)发生了变化,统计出这些居民。
方式,有三个字段在features中的索引;然后导入老数据,刚统计的剧名数据;比较三个字段,只要有一个不同,就认为有变化,并装入到信息变化的剧名列表中。
NamedTuple 对于数据分析或机器学习领域,用好NamedTuples会写出可读性更强、更易于维护的代码。
做开发常遇到的场景,例如:将对象的所有属性都放入到一个list中,然后在放到机器学习模型中,很快,就会意识到数百个属性都在此list中,这就是事情变糟糕的开始。
例如以上的例子,写出代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 def update_person_info (old_data, new_data ): changed_list = [] for line in new_data: new_props = line.split() for old in old_data: old_props = old.split() if old_props[11 ] != new_props[10 ]: changed_list.append(old_props[11 ]) elif old_props[4 ] != new_props[6 ]: changed_list.append(old_props[11 ]) elif old_props[9 ] != new_props[3 ]: changed_list.append(old_props[11 ]) return changed_list old_data = [ 'id age height name address province city town country birth_address father_name monther_name telephone emergency_telephone' ] new_data = [ 'id age height birth_address name province city address town country telephone father_name monther_name emergency_telephone' ] print (update_person_info(old_data, new_data))
以上代码,出现整数索引3、4、6、9、10、11的时候,代码可读性较差,如果没有注释,可能日后我都不知道这些索引代码什么意思。
如果使用NamedTuple去处理,会将乱为一团的事情,将会迅速变的井然有序。
再也不会为一系列的整数索引而犯愁!
先了解NamedTupled的基本使用,如下:
1 2 3 4 5 6 7 8 9 10 from collections import namedtupleperson = namedtuple('Person' , ['id' , 'age' , 'height' , 'name' , 'address' , 'province' , 'city' , 'town' , 'country' , 'birth_address' , 'father_name' , 'monther_name' , 'telephone' , 'emergency_telephone' ]) a = ['' ] * 11 print (person(3 , 19 , 'xiaoming' , *a))
实现结果
1 2 3 4 /Users/lvjing/PycharmProjects/python_base_project/venv/bin/python /Users/lvjing/PycharmProjects/python_base_project/demo05.py Person(id=3, age=19, height='xiaoming', name='', address='', province='', city='', town='', country='', birth_address='', father_name='', monther_name='', telephone='', emergency_telephone='') Process finished with exit code 0
看到使用NamedTupled的优势,下面将使用NamedTupled重改写上面的任务
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 from collections import namedtupleperson = namedtuple('Person' , ['id' , 'age' , 'height' , 'name' , 'address' , 'province' , 'city' , 'town' , 'country' , 'birth_address' , 'father_name' , 'monther_name' , 'telephone' , 'emergency_telephone' ]) def update_person_info (old_data, new_data ): changed_list = [] for line in new_data: new_props = line.split() new_person = person(*new_props) for old in old_data: old_props = old.split() old_person = person(*old_props) if old_person.id != new_person.id : changed_list.append(old_person.id ) elif old_person.address != new_person.address: changed_list.append(old_person.address) elif old_person.birth_address != new_person.birth_address: changed_list.append(old_person.birth_address) return changed_list old_data = [ 'id age height name address province city town country birth_address father_name monther_name telephone emergency_telephone' ] new_data = [ 'id age height birth_address name province city address town country telephone father_name monther_name emergency_telephone' ] print (update_person_info(old_data, new_data))
效果对比明显,改后的代码,3 处条件比较地方,没有用到一个整数索引,提高了代码可读性。
同时,也增强了代码的可维护性。当新导入的文件,特征列的顺序与原来不一致时,无需改动那 3 处条件比较之处,但是原来版本就必须要修改,相对更繁琐,不好被维护。
以上所述,NamedTuple 优点明显,但是同样缺点也较为明显,一个 NamedTuple 创建后,它的属性取值不允许被修改,也就是属性只能是可读的。
如下,xiaoming 一旦创建后,所有属性都不允许被修改。
1 2 3 4 5 6 7 8 9 10 from collections import namedtupleperson = namedtuple('Person' , ['id' , 'age' , 'height' , 'name' , 'address' , 'province' , 'city' , 'town' , 'country' , 'birth_address' , 'father_name' , 'monther_name' , 'telephone' , 'emergency_telephone' ]) a = ['' ] * 11 xiaoming = person(3 , 19 , 'xiaoming' , *a) xiaoming.age = 20
执行结果
1 2 3 4 5 6 7 /Users/lvjing/PycharmProjects/python_base_project/venv/bin/python /Users/lvjing/PycharmProjects/python_base_project/demo05.py Traceback (most recent call last): File "/Users/lvjing/PycharmProjects/python_base_project/demo05.py", line 10, in <module> xiaoming.age = 20 AttributeError: can't set attribute Process finished with exit code 1
Counter Counter 正如名字那样,它的主要功能就是计数。我们在分析数据时,会常常涉及到计数。
我们使用list的时候,往往会这样统计,如下实例,统计列表中元素出现的次数:
1 2 3 4 5 6 7 8 9 10 11 12 lst = [3 , 8 , 3 , 10 , 3 , 3 , 1 , 3 , 7 , 6 , 1 , 2 , 7 , 0 , 7 , 9 , 1 , 5 , 1 , 0 ] d = {} for i in lst: if d.get(i) is None : d[i] = 1 else : d[i] += 1 d_most = dict (sorted (d.items(), key=lambda item: item[1 ], reverse=True )) print (d_most)
执行结果
1 2 3 4 /Users/lvjing/PycharmProjects/python_base_project/venv/bin/python /Users/lvjing/PycharmProjects/python_base_project/demo06.py {3: 5, 1: 4, 7: 3, 0: 2, 8: 1, 10: 1, 6: 1, 2: 1, 9: 1, 5: 1} Process finished with exit code 0
如果使用Counter的话,能写出更加简化的代码。
1 2 3 4 5 6 from collections import Counterlst = [3 , 8 , 3 , 10 , 3 , 3 , 1 , 3 , 7 , 6 , 1 , 2 , 7 , 0 , 7 , 9 , 1 , 5 , 1 , 0 ] result_lst = Counter(lst).most_common() print (result_lst)
执行结果
1 2 3 4 /Users/lvjing/PycharmProjects/python_base_project/venv/bin/python /Users/lvjing/PycharmProjects/python_base_project/demo06.py [(3, 5), (1, 4), (7, 3), (0, 2), (8, 1), (10, 1), (6, 1), (2, 1), (9, 1), (5, 1)] Process finished with exit code 0
仅仅一行代码,便输出统计结果。并且,输出按照购买次数的由大到小排序好的列表,比如,元素3,一共出现了5次。
除此之外,使用 Counter 能快速统计,一句话中单词出现次数,一个单词中字符出现次数。如下所示:
1 2 3 4 5 from collections import Counters = 'I love python so much' result = Counter(s).most_common() print (result)
执行结果
1 2 3 4 /Users/lvjing/PycharmProjects/python_base_project/venv/bin/python /Users/lvjing/PycharmProjects/python_base_project/demo06.py [(' ', 4), ('o', 3), ('h', 2), ('I', 1), ('l', 1), ('v', 1), ('e', 1), ('p', 1), ('y', 1), ('t', 1), ('n', 1), ('s', 1), ('m', 1), ('u', 1), ('c', 1)] Process finished with exit code 0
DefaultDict DefaultDict 能自动创建一个被初始化的字典,也就是每个键都已经被访问过一次。
1 2 3 4 5 6 7 8 9 10 from collections import defaultdicti = defaultdict(int ) print (i)l = defaultdict(list ) print (l)
执行结果
1 2 3 4 5 /Users/lvjing/PycharmProjects/python_base_project/venv/bin/python /Users/lvjing/PycharmProjects/python_base_project/demo07.py defaultdict(<class 'int'>, {}) defaultdict(<class 'list'>, {}) Process finished with exit code 0
统计下面字符串,每个字符出现的位置索引:
1 2 3 4 5 6 7 8 9 from collections import defaultdictl = defaultdict(list ) s = 'from collections import defaultdict' for index, i in enumerate (s): l[i].append(index) print (l)
执行结果
1 2 3 4 /Users/lvjing/PycharmProjects/python_base_project/venv/bin/python /Users/lvjing/PycharmProjects/python_base_project/demo07.py defaultdict(<class 'list'>, {'f': [0, 26], 'r': [1, 21], 'o': [2, 6, 13, 20], 'm': [3, 18], ' ': [4, 16, 23], 'c': [5, 10, 33], 'l': [7, 8, 29], 'e': [9, 25], 't': [11, 22, 30, 34], 'i': [12, 17, 32], 'n': [14], 's': [15], 'p': [19], 'd': [24, 31], 'a': [27], 'u': [28]}) Process finished with exit code 0
当尝试访问一个不在字典中的键时,将会抛出一个异常。但是,使用 DefaultDict 帮助我们初始化。
如果不使用 DefaultDict,就需要写 if -else 逻辑。
如果键不在字典中,手动初始化一个列表 [],并放入第一个元素——字符的索引 index。就像下面这样:
1 2 3 4 5 6 7 8 d = {} s = 'from collections import defaultdict' for index, i in enumerate (s): if i in d: d[i].append(index) else : d[i] = [index] print (d)
执行结果
1 2 3 4 /Users/lvjing/PycharmProjects/python_base_project/venv/bin/python /Users/lvjing/PycharmProjects/python_base_project/demo08.py {'f': [0, 26], 'r': [1, 21], 'o': [2, 6, 13, 20], 'm': [3, 18], ' ': [4, 16, 23], 'c': [5, 10, 33], 'l': [7, 8, 29], 'e': [9, 25], 't': [11, 22, 30, 34], 'i': [12, 17, 32], 'n': [14], 's': [15], 'p': [19], 'd': [24, 31], 'a': [27], 'u': [28]} Process finished with exit code 0
虽然也能得到同样结果,但是很显然,使用 DefaultDict,代码更加简洁。
实例 排序词 排序词(permutation):两个字符串含有相同字符,但字符顺序不同。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from collections import defaultdictdef is_permutation (str1, str2 ): if str1 is None or str2 is None : return False if len (str1) != len (str2): return False unq_str1 = defaultdict(int ) unq_str2 = defaultdict(int ) for c1 in str1: unq_str1[c1] += 1 for c2 in str2: unq_str2[c2] += 1 return unq_str1 == unq_str2
defaultdict,字典值默认类型初始化为int,计数默认次数都为0
统计出的两个defaultdict:unq_str1、unq_str2,如果相等就表明str1、str2互为排序词。
下面,执行测试
1 2 3 4 5 6 7 8 9 10 11 12 result = is_permutation('nice' , 'cine' ) print (result) result = is_permutation('' , '' ) print (result) result = is_permutation('' , None ) print (result) result = is_permutation('work' , 'woo' ) print (result)
单词频次 使用 yield 解耦数据读取 python_read 和数据处理 process。
python_read:逐行读入
process:正则替换掉空字符,并使用空格,分隔字符串,保存到 defaultdict 对象中。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import refrom collections import defaultdictdef python_read (filename ): with open (filename, 'r' , encoding='utf-8' ) as f: for line in f: yield line d = defaultdict(int ) def process (line ): for word in re.sub('\W+' , " " , line).split(): d[word] += 1
调用两个函数,使用Counter类统计出频次的排序:
a.txt
1 2 3 4 5 6 7 8 hello world!!!! nice to meet you la la la yes no1 jack yes yes no no you you check
1 2 3 4 5 for line in python_read('./data/a.txt' ): process(line) frequency = Counter(d).most_common() print (frequency)
执行结果
1 2 3 4 /Users/lvjing/PycharmProjects/python_base_project/venv/bin/python /Users/lvjing/PycharmProjects/python_base_project/demo10.py [('you', 3), ('la', 3), ('yes', 3), ('no', 2), ('hello', 1), ('world', 1), ('nice', 1), ('to', 1), ('meet', 1), ('no1', 1), ('jack', 1), ('check', 1)] Process finished with exit code 0