While reading an official document for NLTK(Natural Language Toolkit), I tried extracting words which are frequently used in a sample text. This time, I tried to let the most frequency three words be in a display.
Development
- Python
- NLTK
Install NLTK
$ pip install nltk
Extract High-frequency words
Let me the coding begins. You should download punkt
and averaged_perception_tagger
initially for running word-tokenizing a part-of-speech acquisition. Next, read a sample text, and convert it to word-separation from text. And remove non-Noun things from this result. Finally, get the most frequent words.
Download
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
Import nltk, and then download punkt
and averaged_perception_trigger
. Once downloaded in the environment, you don't have to do it again.
Convert texts to word-tokenizing
raw = open('sample.txt').read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
tokens_l = [w.lower() for w in tokens]
Prepare some essays or long texts. After reading this, it should be word-tokenized. Then, set up capital cases to lower cases, they should be recognized as the same.
Extract only Noun
only_nn = [x for (x,y) in pos if y in ('NN')]
freq = nltk.FreqDist(only_nn)
Remove non-noun words from this result. And calculate how frequency these words are included.
Get the most frequent three words
print(freq.most_common(3))
After counting frequent words, you can get the top three ones by most_common()
.
Top comments (0)