연어(collocation) 찾기¶
NLTK 를 같이 활용하여 연어(collocation)을 찾을 수 있습니다.
3 음절 연어를 찾기 위해서는 BigramAssocMeasures 를 TrigramAssocMeasures 로 바꾸고, BigramCollocationFinder 를 TrigramCollocationFinder 로 바꾸시면 됩니다.
#! /usr/bin/python2.7
# -*- coding: utf-8 -*-
from konlpy.tag import Kkma
from konlpy.corpus import kolaw
from konlpy.utils import pprint
from nltk import collocations
measures = collocations.BigramAssocMeasures()
doc = kolaw.open('constitution.txt').read()
print('\nCollocations among tagged words:')
tagged_words = Kkma().pos(doc)
finder = collocations.BigramCollocationFinder.from_words(tagged_words)
pprint(finder.nbest(measures.pmi, 10)) # top 5 n-grams with highest PMI
print('\nCollocations among words:')
words = [w for w, t in tagged_words]
ignored_words = [u'안녕']
finder = collocations.BigramCollocationFinder.from_words(words)
finder.apply_word_filter(lambda w: len(w) < 2 or w in ignored_words)
finder.apply_freq_filter(3) # only bigrams that appear 3+ times
pprint(finder.nbest(measures.pmi, 10))
print('\nCollocations among tags:')
tags = [t for w, t in tagged_words]
finder = collocations.BigramCollocationFinder.from_words(tags)
pprint(finder.nbest(measures.pmi, 5))
출력 결과:
Collocations among tagged words: [((가부, NNG), (동수, NNG)), ((강제, NNG), (노역, NNG)), ((경자, NNG), (유전, NNG)), ((고, ECS), (채취, NNG)), ((공무, NNG), (담임, NNG)), ((공중, NNG), (도덕, NNG)), ((과반, NNG), (수가, NNG)), ((교전, NNG), (상태, NNG)), ((그러, VV), (나, ECE)), ((기본적, NNG), (인권, NNG))] Collocations among words: [(현행, 범인), (형의, 선고), (내부, 규율), (정치적, 중립성), (누구, 든지), (회계, 연도), (지체, 없이), (평화적, 통일), (형사, 피고인), (지방, 자치)] Collocations among tags: [(XR, XSA), (JKC, VCN), (VCN, ECD), (ECD, VX), (ECD, VXV)]