- 3.3 训练基于分类器的词块划分器
3.3 训练基于分类器的词块划分器
无论是基于正则表达式的词块划分器还是 n-gram 词块划分器,决定创建什么词块完全基于词性标记。然而,有时词性标记不足以确定一个句子应如何划分词块。例如,考虑下面的两个语句:
class ConsecutiveNPChunkTagger(nltk.TaggerI): ![[1]](/projects/nlp-py-2e-zh/Images/f4891d12ae20c39b685951ad3cddf1aa.jpg)def __init__(self, train_sents):train_set = []for tagged_sent in train_sents:untagged_sent = nltk.tag.untag(tagged_sent)history = []for i, (word, tag) in enumerate(tagged_sent):featureset = npchunk_features(untagged_sent, i, history) ![[2]](/projects/nlp-py-2e-zh/Images/e5fb07e997b9718f18dbf677e3d6634d.jpg)train_set.append( (featureset, tag) )history.append(tag)self.classifier = nltk.MaxentClassifier.train( ![[3]](/projects/nlp-py-2e-zh/Images/6372ba4f28e69f0b220c75a9b2f4decf.jpg)train_set, algorithm='megam', trace=0)def tag(self, sentence):history = []for i, word in enumerate(sentence):featureset = npchunk_features(sentence, i, history)tag = self.classifier.classify(featureset)history.append(tag)return zip(sentence, history)class ConsecutiveNPChunker(nltk.ChunkParserI): ![[4]](Images/8b4bb6b0ec5bb337fdb00c31efcc1645.jpg)def __init__(self, train_sents):tagged_sents = [[((w,t),c) for (w,t,c) innltk.chunk.tree2conlltags(sent)]for sent in train_sents]self.tagger = ConsecutiveNPChunkTagger(tagged_sents)def parse(self, sentence):tagged_sents = self.tagger.tag(sentence)conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]return nltk.chunk.conlltags2tree(conlltags)
留下来唯一需要填写的是特征提取器。首先,我们定义一个简单的特征提取器,它只是提供了当前词符的词性标记。使用此特征提取器,我们的基于分类器的词块划分器的表现与一元词块划分器非常类似:
>>> def npchunk_features(sentence, i, history):... word, pos = sentence[i]... return {"pos": pos}>>> chunker = ConsecutiveNPChunker(train_sents)>>> print(chunker.evaluate(test_sents))ChunkParse score:IOB Accuracy: 92.9%Precision: 79.9%Recall: 86.7%F-Measure: 83.2%
我们还可以添加一个特征表示前面词的词性标记。添加此特征允许词块划分器模拟相邻标记之间的相互作用,由此产生的词块划分器与二元词块划分器非常接近。
>>> def npchunk_features(sentence, i, history):... word, pos = sentence[i]... if i == 0:... prevword, prevpos = "<START>", "<START>"... else:... prevword, prevpos = sentence[i-1]... return {"pos": pos, "prevpos": prevpos}>>> chunker = ConsecutiveNPChunker(train_sents)>>> print(chunker.evaluate(test_sents))ChunkParse score:IOB Accuracy: 93.6%Precision: 81.9%Recall: 87.2%F-Measure: 84.5%
下一步,我们将尝试为当前词增加特征,因为我们假设这个词的内容应该对词块划有用。我们发现这个特征确实提高了词块划分器的表现,大约 1.5 个百分点(相应的错误率减少大约 10%)。
>>> def npchunk_features(sentence, i, history):... word, pos = sentence[i]... if i == 0:... prevword, prevpos = "<START>", "<START>"... else:... prevword, prevpos = sentence[i-1]... return {"pos": pos, "word": word, "prevpos": prevpos}>>> chunker = ConsecutiveNPChunker(train_sents)>>> print(chunker.evaluate(test_sents))ChunkParse score:IOB Accuracy: 94.5%Precision: 84.2%Recall: 89.4%F-Measure: 86.7%
最后,我们尝试用多种附加特征扩展特征提取器,例如预取特征
、配对特征
和复杂的语境特征
。这最后一个特征,称为tags-since-dt,创建一个字符串,描述自最近的限定词以来遇到的所有词性标记,或如果没有限定词则在索引i之前自语句开始以来遇到的所有词性标记。
>>> def npchunk_features(sentence, i, history):... word, pos = sentence[i]... if i == 0:... prevword, prevpos = "<START>", "<START>"... else:... prevword, prevpos = sentence[i-1]... if i == len(sentence)-1:... nextword, nextpos = "<END>", "<END>"... else:... nextword, nextpos = sentence[i+1]... return {"pos": pos,... "word": word,... "prevpos": prevpos,... "nextpos": nextpos, ![[1]](/projects/nlp-py-2e-zh/Images/f4891d12ae20c39b685951ad3cddf1aa.jpg)... "prevpos+pos": "%s+%s" % (prevpos, pos), ![[2]](/projects/nlp-py-2e-zh/Images/e5fb07e997b9718f18dbf677e3d6634d.jpg)... "pos+nextpos": "%s+%s" % (pos, nextpos),... "tags-since-dt": tags_since_dt(sentence, i)} ![[3]](/projects/nlp-py-2e-zh/Images/6372ba4f28e69f0b220c75a9b2f4decf.jpg)
>>> def tags_since_dt(sentence, i):... tags = set()... for word, pos in sentence[:i]:... if pos == 'DT':... tags = set()... else:... tags.add(pos)... return '+'.join(sorted(tags))
>>> chunker = ConsecutiveNPChunker(train_sents)>>> print(chunker.evaluate(test_sents))ChunkParse score:IOB Accuracy: 96.0%Precision: 88.6%Recall: 91.0%F-Measure: 89.8%
注意
轮到你来:尝试为特征提取器函数npchunk_features增加不同的特征,看看是否可以进一步改善 NP 词块划分器的表现。
