Crysanthus: Text File Word Count and Simple Statistics

This Python program is to process given text file and display some interesting and simple statistics.

I've been using Python 3.8.2
Dependencies - pandas and matplotlib.pyplot

This program can read a huge text file and count number of unique words and their occurrences and create a pie chart with the top word list. The created pie chart is saved to a file.

Text files used here are downloaded from,

https://www.gutenberg.org/wiki/Main_Page

#!/usr/bin/python
# -*- coding: utf-8 -*-

# text file word count, simple stats and graphs
# crysanthus@gmail.com
# 15/5/2020

import pandas as pd
import matplotlib.pyplot as plt


# some global vars
text_file_path = '../db/Bible-KJV.txt'  # https://www.gutenberg.org/ebooks/10.txt.utf-8
title = text_file_path[(text_file_path.rfind('/')+1):-4]  # only the file name without ext 
graph_file_path = f'../graphs/{title}.png'
top_words = 30


# function to read text file and create graph
def text_count_graph_by_word():

    word_list = {}
    nos_words = 0
    nos_unique_words = 0

    # open text file
    fp = open(text_file_path, 'r')
    for line in fp:

        tmp_words = line.split(' ')

        for w in tmp_words:

            # remove all the unwanted chars. Keep the words
            w = "".join(c for c in w if c.isalpha()).lower().strip()
            if w in ['\n', '']:
                continue

            # create word list
            if w in word_list:
                word_list[w] += 1
                nos_words += 1
                continue

            word_list[w] = 1
            nos_unique_words += 1

    fp.close()

    # create a Pandas df using dict
    df = pd.DataFrame.from_dict(word_list, orient='index', columns=['word count'])

    # sort df and get only the top words
    df = df.sort_values('word count', ascending=False)[:top_words]

    # create the pir chart/graph
    fig = df.plot.pie(y='word count', figsize=(8, 8), legend=None)
    fig.set_ylabel('')  # remove left side label

    plt.suptitle(f'{title} - Top {top_words} word occurrence')
    plt.title(f'{nos_words} Total words - {nos_unique_words} Unique words', fontsize=8)

    plt.savefig(graph_file_path)

    plt.clf()


if __name__ == '__main__':
    text_count_graph_by_word()

The result are,

Bible king James version

Pride and Prejudice,

War and Peace,

Crysanthus

Thursday, May 14, 2020

Text File Word Count and Simple Statistics

No comments: