The following corpora are currently available:
kolaw: Korean law corpus.
kobill: Korean National Assembly bill corpus. The file ID corresponds to the bill number.
- 1809890.txt - 1809899.txt
For more detailed usage of the corpora, see the corpus Package.
>>> from konlpy.corpus import kolaw >>> c = kolaw.open('constitution.txt').read() >>> print c[:10] 대한민국 헌법 유구한 역사와 >>> from konlpy.corpus import kobill >>> d = kobill.open('1809890.txt').read() >>> print d[:15] 지방공무원법 일부개정법률안
A dictionary created with the KAIST corpus. (4.7MB)
Part of this file is shown below.:
... 나라경제 ncn 나라기획 nqq 나라기획회장 ncn 나라꽃 ncn 나라님 ncn 나라도둑 ncn 나라따르 pvg 나라링링프로덕션 ncn 나라말 ncn 나라망신 ncn 나라박물관 ncn 나라발전 ncpa 나라별 ncn 나라부동산 nqq 나라사랑 ncn 나라살림 ncpa 나라시 nqq 나라시마 ncn ...
You can add your own terms, modify
A dictionary created with the Sejong corpus. (32MB)
It is included within the Kkma
so in order to see dictionary files, check out the KKMA’s mirror.
kcc.dic is shown below.:
아니/IC 후우/IC 그래서/MAC 그러나/MAC 그러니까/MAC 그러면/MAC 그러므로/MAC 그런데/MAC 그리고/MAC 따라서/MAC 하지만/MAC ...
A CSV formatted dictionary created with the Sejong corpus. (346MB)
The compiled version is located at
/usr/local/lib/mecab/dic/mecab-ko-dic (or the path you assigned during installation),
and you can see the original files in the source code.
CoinedWord.csv is shown below.:
가오티,0,0,0,NNG,*,F,가오티,*,*,*,*,* 갑툭튀,0,0,0,NNG,*,F,갑툭튀,*,*,*,*,* 강퇴,0,0,0,NNG,*,F,강퇴,*,*,*,*,* 개드립,0,0,0,NNG,*,T,개드립,*,*,*,*,* 갠소,0,0,0,NNG,*,F,갠소,*,*,*,*,* 고퀄,0,0,0,NNG,*,T,고퀄,*,*,*,*,* 광삭,0,0,0,NNG,*,T,광삭,*,*,*,*,* 광탈,0,0,0,NNG,*,T,광탈,*,*,*,*,* 굉천,0,0,0,NNG,*,T,굉천,*,*,*,*,* 국을,0,0,0,NNG,*,T,국을,*,*,*,*,* 귀요미,0,0,0,NNG,*,F,귀요미,*,*,*,*,* ...
To add your own terms, see here.
You can add new words either to the system dictionaries or user dictionaries. However, there is a slight difference in the two choices.:
- Adding to the system dictionary: When dictionary updates are not frequent, when you do not want to drop the analysis speed.
- Adding to the user dictionary: When dictionary updates are frequent, when you do not have