1.3 布朗语料库

1.3 布朗语料库

布朗语料库是第一个百万词级的英语电子语料库的，由布朗大学于 1961 年创建。这个语料库包含 500 个不同来源的文本，按照文体分类，如：新闻、社论等。表1.1给出了各个文体的例子（完整列表，请参阅http://icame.uib.no/brown/bcm-los.html）。

表 1.1：

布朗语料库每一部分的示例文档

>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
>>> brown.words(categories='news')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.words(fileids=['cg22'])
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
>>> brown.sents(categories=['news', 'editorial', 'reviews'])
[['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]

布朗语料库是一个研究文体之间的系统性差异——一种叫做文体学的语言学研究——很方便的资源。让我们来比较不同文体中的情态动词的用法。第一步是产生特定文体的计数。记住做下面的实验之前要import nltk：

>>> from nltk.corpus import brown
>>> news_text = brown.words(categories='news')
>>> fdist = nltk.FreqDist(w.lower() for w in news_text)
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> for m in modals:
...     print(m + ':', fdist[m], end=' ')
...
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

注意

我们需要包包含结束 = ' ' 以让 print 函数将其输出放在单独的一行。

注意

轮到你来： 选择布朗语料库的不同部分，修改前面的例子，计数包含 wh 的词，如：what, when, where, who 和 why。

下面，我们来统计每一个感兴趣的文体。我们使用 NLTK 提供的带条件的频率分布函数。在第2节中会系统的把下面的代码一行行拆开来讲解。现在，你可以忽略细节，只看输出。

>>> cfd = nltk.ConditionalFreqDist(
...           (genre, word)
...           for genre in brown.categories()
...           for word in brown.words(categories=genre))
>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfd.tabulate(conditions=genres, samples=modals)
 can could  may might must will
 news   93   86   66   38   50  389
 religion   82   59   78   12   54   71
 hobbies  268   58  131   22   83  264
science_fiction   16   49    4   12    8   16
 romance   74  193   11   51   45   43
 humor   16   30    8    8    9   13

请看，新闻文体中最常见的情态动词是 will，而言情文体中最常见的情态动词是 could。你能预言这些吗？这种可以区分文体的词计数方法将在chap-data-intensive中再次谈及。