<a href="https://colab.research.google.com/github/icculp/Learning-Bitcoin-from-the-Command-Line/blob/master/Chapter_word_counts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook counts through the translatable words of each chapter, including chapter links; ignores code blocks, markdown characters, and other non-translatable characters.

Run in colab by clicking the link above to view the results as a paginated table with word counts for each chapter near the bottom of the notebook. Total word count at the very bottom. Markdown friendly table created in Chapter_word_counts.md

In [1]:
import os
import pandas as pd

In [2]:
!git clone https://github.com/BlockchainCommons/Learning-Bitcoin-from-the-Command-Line.git

Cloning into 'Learning-Bitcoin-from-the-Command-Line'...
remote: Enumerating objects: 6705, done.[K
remote: Counting objects: 100% (309/309), done.[K
remote: Compressing objects: 100% (192/192), done.[K
remote: Total 6705 (delta 147), reused 267 (delta 112), pack-reused 6396[K
Receiving objects: 100% (6705/6705), 7.55 MiB | 25.61 MiB/s, done.
Resolving deltas: 100% (4106/4106), done.


In [3]:
def count_words():
    """ Counts words ignoring code blocks and digits
        To test for quality:
            lines 16-18 to test a single chapter
            uncomment line 31 to view skipped code block sections
            uncomment line 55 to view rejected word tokens (not including code blocks)
            uncomment line 57 to view accepted word tokens
            but not at the same time, and best one chapter at a time
    """
    counts = []
    repo_path = '/content/Learning-Bitcoin-from-the-Command-Line/'
    for chapter in os.listdir(repo_path):
        ''' uncomment lines 16-18 to test a single chapter, replacing
            ch_name with the name you want to test
        '''
        #ch_name = '04_2__Interlude_Using_JQ.md'
        #if chapter != ch_name:
        #    continue
        ignore_list = ['bitcoin.conf-annotated.txt', 'TODO.md', 'TODO-30.md']
        if chapter in ignore_list or\
                not chapter.endswith('md'):
            continue
        count = 0
        flag = 0  # ignores words between code markdown ```
        with open(repo_path + chapter) as ch:
            for line in ch.readlines():
                if flag:
                    if '```' in line[:3].replace(' ', ''):  # chars can't precede code closing markdown
                        flag = 0
                        continue
                    # print(line)  # view uncounted code blocks
                    continue
                if '```' in line:
                    flag = 1
                    continue
                for word in line.split():
                    if '.md' in word:  # indicates trailing link with chapter name;
                        ch_link_tokens = word.split('_')
                        if ']' in word:  # counts last word of trailing link before chapter name
                            count += 1
                        link_tokens_count = len(ch_link_tokens[2:])  # ignoring chapter numbers
                        count += link_tokens_count
                        # print(word, '[TOK]', link_tokens_count, end='[SEPTOK]')
                        continue
                    ignore =  ['*', '**', '#', '##', '###', '####',
                              '-', '—', '>', '`', '/', '&', '|', '~']
                    if any(ch.isdigit() for ch in word) or\
                            word in ignore or\
                            '`' in word or\
                            '~/' in word or\
                            '/.' in word or\
                            '|-' in word or\
                            (word[0] == ':' and word[-1] == ':') or\
                            (word[0] == '"' and word[-1] == '"'):
                        # print(word)  # , end='[SEP]')  # view rejected tokens
                        continue
                    # print(word, count)  # , end='[SEP]')  # view accepted tokens
                    count += 1
        counts.append((chapter, count))
        # print(chapter, count)
    return pd.DataFrame(counts, columns=['Chapter', 'Word Count'])

In [4]:
chapter_word_counts = count_words()
# view accepted or rejected tokens below if line 55 or 53 uncommented in count_words(), respectively

In [5]:
chapter_word_counts.sort_values(by=['Chapter'], inplace=True)
total_count_translatable = chapter_word_counts['Word Count'].sum()
chapter_word_counts.loc[len(chapter_word_counts.index)] = ['TOTAL', total_count_translatable] 

View in colab for paginated table

In [6]:
from google.colab import data_table
data_table.DataTable(chapter_word_counts, include_index=False)

Unnamed: 0,Chapter,Word Count
66,01_0_Introduction.md,1144
64,01_1_Introducing_Bitcoin.md,2735
82,02_0_Setting_Up_a_Bitcoin-Core_VPS.md,226
71,02_1_Setting_Up_a_Bitcoin-Core_VPS_with_StackS...,2746
59,02_2_Setting_Up_Bitcoin_Core_Other.md,254
...,...,...
0,CONTRIBUTING.md,529
34,LICENSE-CC-BY-4.0.md,2716
29,README.md,1705
22,TRANSLATING.md,686


Converts the table to a markdown format and save as 'Chapter_word_counts.md'

In [7]:
from IPython.display import Markdown, display
from tabulate import tabulate


# borrowed from https://stackoverflow.com/questions/33181846/programmatically-convert-pandas-dataframe-to-markdown-table

def pandas_df_to_markdown_table(df):
    fmt = ['---' for i in range(len(df.columns))]
    df_fmt = pd.DataFrame([fmt], columns=df.columns)
    df_formatted = pd.concat([df_fmt, df])
    return Markdown(df_formatted.to_csv(sep="|", index=False))

def df_to_markdown(df, y_index=False):
    blob = tabulate(df, headers='keys', tablefmt='pipe')
    if not y_index:
        return '\n'.join(['| {}'.format(row.split('|', 2)[-1]) for row in blob.split('\n')])
    return blob

In [8]:
mkdt = pandas_df_to_markdown_table(chapter_word_counts)

with open('Chapter_word_counts.md', 'w') as m:
    m.write(str(mkdt.data))

In [9]:
total_count_translatable

89069