- 2.2 按文体计数词汇
2.2 按文体计数词汇
在1中,我们看到一个条件频率分布,其中条件为布朗语料库的每一节,并对每节计数词汇。FreqDist()以一个简单的列表作为输入,ConditionalFreqDist() 以一个配对列表作为输入。
>>> from nltk.corpus import brown>>> cfd = nltk.ConditionalFreqDist(... (genre, word)... for genre in brown.categories()... for word in brown.words(categories=genre))
让我们拆开来看,只看两个文体,新闻和言情。对于每个文体
,我们遍历文体中的每个词
,以产生文体与词的配对
:
>>> genre_word = [(genre, word) ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)... for genre in ['news', 'romance'] ![[2]](/projects/nlp-py-2e-zh/Images/6efeadf518b11a6441906b93844c2b19.jpg)... for word in brown.words(categories=genre)] ![[3]](/projects/nlp-py-2e-zh/Images/e941b64ed778967dd0170d25492e42df.jpg)>>> len(genre_word)170576
因此,在下面的代码中我们可以看到,列表genre_word的前几个配对将是 ('news', word)
的形式,而最后几个配对将是 ('romance', word)
的形式。
>>> genre_word[:4][('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] # [_start-genre]>>> genre_word[-4:][('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] # [_end-genre]
现在,我们可以使用此配对列表创建一个ConditionalFreqDist,并将它保存在一个变量cfd中。像往常一样,我们可以输入变量的名称来检查它
,并确认它有两个条件
:
>>> cfd = nltk.ConditionalFreqDist(genre_word)>>> cfd ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)<ConditionalFreqDist with 2 conditions>>>> cfd.conditions()['news', 'romance'] # [_conditions-cfd]
让我们访问这两个条件,它们每一个都只是一个频率分布:
>>> print(cfd['news'])<FreqDist with 14394 samples and 100554 outcomes>>>> print(cfd['romance'])<FreqDist with 8452 samples and 70022 outcomes>>>> cfd['romance'].most_common(20)[(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502),('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993),('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690),('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)]>>> cfd['romance']['could']193
