I've been using Python 3.8.2
Dependencies - pandas and matplotlib.pyplot
This program can read a huge text file and count number of unique words and their occurrences and create a pie chart with the top word list. The created pie chart is saved to a file.
Text files used here are downloaded from,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | #!/usr/bin/python # -*- coding: utf-8 -*- # text file word count, simple stats and graphs # crysanthus@gmail.com # 15/5/2020 import pandas as pd import matplotlib.pyplot as plt # some global vars text_file_path = '../db/Bible-KJV.txt' # https://www.gutenberg.org/ebooks/10.txt.utf-8 title = text_file_path[(text_file_path.rfind('/')+1):-4] # only the file name without ext graph_file_path = f'../graphs/{title}.png' top_words = 30 # function to read text file and create graph def text_count_graph_by_word(): word_list = {} nos_words = 0 nos_unique_words = 0 # open text file fp = open(text_file_path, 'r') for line in fp: tmp_words = line.split(' ') for w in tmp_words: # remove all the unwanted chars. Keep the words w = "".join(c for c in w if c.isalpha()).lower().strip() if w in ['\n', '']: continue # create word list if w in word_list: word_list[w] += 1 nos_words += 1 continue word_list[w] = 1 nos_unique_words += 1 fp.close() # create a Pandas df using dict df = pd.DataFrame.from_dict(word_list, orient='index', columns=['word count']) # sort df and get only the top words df = df.sort_values('word count', ascending=False)[:top_words] # create the pir chart/graph fig = df.plot.pie(y='word count', figsize=(8, 8), legend=None) fig.set_ylabel('') # remove left side label plt.suptitle(f'{title} - Top {top_words} word occurrence') plt.title(f'{nos_words} Total words - {nos_unique_words} Unique words', fontsize=8) plt.savefig(graph_file_path) plt.clf() if __name__ == '__main__': text_count_graph_by_word() |
The result are,
Bible king James version
Pride and Prejudice,
War and Peace,
No comments:
Post a Comment