Chunking

After tagging a sentence with part of speech, we can segment it into several higher level multitoken sequences, or “chunks”.

Here we demonstrate a way to easily chunk a sentence, and find noun, verb and adjective phrases in Korean text, using nltk.chunk.regexp.RegexpParser.

#! /usr/bin/python2.7
# -*- coding: utf-8 -*-

import konlpy
import nltk

# POS tag a sentence
sentence = u'만 6세 이하의 초등학교 취학 전 자녀를 양육하기 위해서는'
words = konlpy.tag.Twitter().pos(sentence)

# Define a chunk grammar, or chunking rules, then chunk
grammar = """
NP: {<N.*>*<Suffix>?}   # Noun phrase
VP: {<V.*>*}            # Verb phrase
AP: {<A.*>*}            # Adjective phrase
"""
parser = nltk.RegexpParser(grammar)
chunks = parser.parse(words)
print("# Print whole tree")
print(chunks.pprint())

print("\n# Print noun phrases only")
for subtree in chunks.subtrees():
    if subtree.label()=='NP':
        print(' '.join((e[0] for e in list(subtree))))
        print(subtree.pprint())

# Display the chunk tree
chunks.draw()

According to the chunk grammer defined above, we have three rules to extracted phrases from our sentence. First, we have a rule to extract noun phrases (NP), where our chunker finds a serial of nouns, followed with an optional Suffix. (Note that these rules can be modified for your purpose, and that they should differ for each morphological analyzer.) Then we have two more rules, each defining verb phrases (VP) and adjective phrases (AP).

The result is a tree, which we can print on the console, or display graphically as follows.

  • Console:

    # Print whole tree
    (S
      (NP /Noun 6/Number /Noun 이하/Noun)
      /Josa
      (NP 초등학교/Noun 취학/Noun /Noun 자녀/Noun)
      /Josa
      (NP 양육/Noun)
      (VP 하기/Verb 위해서/Verb)
      /Eomi)
    
    # Print noun phrases only
     6  이하
    (NP /Noun 6/Number /Noun 이하/Noun)
    초등학교 취학  자녀
    (NP 초등학교/Noun 취학/Noun /Noun 자녀/Noun)
    양육
    (NP 양육/Noun)
    
  • chunking.png
    ../../_images/chunking.png