연어(collocation) 찾기

NLTK 를 같이 활용하여 연어(collocation)을 찾을 수 있습니다.

3 음절 연어를 찾기 위해서는 BigramAssocMeasuresTrigramAssocMeasures 로 바꾸고, BigramCollocationFinderTrigramCollocationFinder 로 바꾸시면 됩니다.

#! /usr/bin/python2.7
# -*- coding: utf-8 -*-

from konlpy.tag import Kkma
from konlpy.corpus import kolaw
from konlpy.utils import pprint
from nltk import collocations


measures = collocations.BigramAssocMeasures()
doc = kolaw.open('constitution.txt').read()

print('\nCollocations among tagged words:')
tagged_words = Kkma().pos(doc)
finder = collocations.BigramCollocationFinder.from_words(tagged_words)
pprint(finder.nbest(measures.pmi, 10)) # top 5 n-grams with highest PMI

print('\nCollocations among words:')
words = [w for w, t in tagged_words]
ignored_words = [u'안녕']
finder = collocations.BigramCollocationFinder.from_words(words)
finder.apply_word_filter(lambda w: len(w) < 2 or w in ignored_words)
finder.apply_freq_filter(3) # only bigrams that appear 3+ times
pprint(finder.nbest(measures.pmi, 10))

print('\nCollocations among tags:')
tags = [t for w, t in tagged_words]
finder = collocations.BigramCollocationFinder.from_words(tags)
pprint(finder.nbest(measures.pmi, 5))
  • 출력 결과:

    Collocations among tagged words:
    [((가부, NNG), (동수, NNG)),
     ((강제, NNG), (노역, NNG)),
     ((경자, NNG), (유전, NNG)),
     ((고, ECS), (채취, NNG)),
     ((공무, NNG), (담임, NNG)),
     ((공중, NNG), (도덕, NNG)),
     ((과반, NNG), (수가, NNG)),
     ((교전, NNG), (상태, NNG)),
     ((그러, VV), (나, ECE)),
     ((기본적, NNG), (인권, NNG))]
    
    Collocations among words:
    [(현행, 범인),
     (형의, 선고),
     (내부, 규율),
     (정치적, 중립성),
     (누구, 든지),
     (회계, 연도),
     (지체, 없이),
     (평화적, 통일),
     (형사, 피고인),
     (지방, 자치)]
    
    Collocations among tags:
    [(XR, XSA),
     (JKC, VCN),
     (VCN, ECD),
     (ECD, VX),
     (ECD, VXV)]