Thursday, May 14, 2020

Text File Word Count and Simple Statistics

This Python program is to process given text file and display some interesting and simple statistics.

I've been using Python 3.8.2
Dependencies - pandas and matplotlib.pyplot

This program can read a huge text file and count number of unique words and their occurrences and create a pie chart with the top word list.  The created pie chart is saved to a file.

Text files used here are downloaded from,
https://www.gutenberg.org/wiki/Main_Page

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#!/usr/bin/python
# -*- coding: utf-8 -*-

# text file word count, simple stats and graphs
# crysanthus@gmail.com
# 15/5/2020

import pandas as pd
import matplotlib.pyplot as plt


# some global vars
text_file_path = '../db/Bible-KJV.txt'  # https://www.gutenberg.org/ebooks/10.txt.utf-8
title = text_file_path[(text_file_path.rfind('/')+1):-4]  # only the file name without ext 
graph_file_path = f'../graphs/{title}.png'
top_words = 30


# function to read text file and create graph
def text_count_graph_by_word():

    word_list = {}
    nos_words = 0
    nos_unique_words = 0

    # open text file
    fp = open(text_file_path, 'r')
    for line in fp:

        tmp_words = line.split(' ')

        for w in tmp_words:

            # remove all the unwanted chars. Keep the words
            w = "".join(c for c in w if c.isalpha()).lower().strip()
            if w in ['\n', '']:
                continue

            # create word list
            if w in word_list:
                word_list[w] += 1
                nos_words += 1
                continue

            word_list[w] = 1
            nos_unique_words += 1

    fp.close()

    # create a Pandas df using dict
    df = pd.DataFrame.from_dict(word_list, orient='index', columns=['word count'])

    # sort df and get only the top words
    df = df.sort_values('word count', ascending=False)[:top_words]

    # create the pir chart/graph
    fig = df.plot.pie(y='word count', figsize=(8, 8), legend=None)
    fig.set_ylabel('')  # remove left side label

    plt.suptitle(f'{title} - Top {top_words} word occurrence')
    plt.title(f'{nos_words} Total words - {nos_unique_words} Unique words', fontsize=8)

    plt.savefig(graph_file_path)

    plt.clf()


if __name__ == '__main__':
    text_count_graph_by_word()

The result are,

Bible king James version 

Pride and Prejudice,

War and Peace,

No comments: