Word counter for chat (Chinese)

https://editor.p5js.org/sh7361/sketches/qC0ks0-Vj

Inspired by Annual Report 2013 by Nicholas Feltron, I wanted count how many times a word is said in chat with my friend on Wechat(a Chinese messaging app).

“哈” is similar to “lol” in Chinese

“哈” is similar to “lol” in Chinese

Process

I came cross a tool memotrace.cn that help export chat as txt files. Here’s what the txt file looks like:

The exported file contains date, time, username and message.

The exported file contains date, time, username and message.

Due to the structural differences between English and Chinese, I modified the coding train word counter example and get the result.

// Word Counting
// The Coding Train / Daniel Shiffman
// <https://thecodingtrain.com/CodingChallenges/040.1-wordcounts-p5.html>

let txt;
let counts = {};
let keys = [];

function preload() {
  txt = loadStrings('Freyaff.txt');//chat file
}

function setup() {
  let allwords = txt.join("\\n");

  // Split text into individual characters instead of words
  let tokens = allwords.split(''); 

  for (let i = 0; i < tokens.length; i++) {
    let char = tokens[i];
    // Check if the character is a Chinese character according to chatGPT
    if (char.match(/[\\u4e00-\\u9fff]/)) {
      if (counts[char] === undefined) {
        counts[char] = 1;
        keys.push(char);
      } else {
        counts[char] = counts[char] + 1;
      }
    }
  }

  keys.sort(compare);

  function compare(a, b) {
    var countA = counts[a];
    var countB = counts[b];
    return countB - countA;
  }

  for (let i = 0; i < keys.length; i++) {
    let key = keys[i];
    createDiv(key + " " + counts[key]);
  }

  noCanvas();
}

Upon reviewing the results, I realized that counting individual Chinese characters doesn't make sense, as their meanings often change when combined into multi-character words. So instead of counting characters, I used Intl.Segmenter to count words.

let txt;
let counts = {};
let keys = [];

function preload() {
  txt = loadStrings('Freyaff.txt');
}

function setup() {
  let allwords = txt.join("\\n");

  // Use Intl.Segmenter to segment text into words
  const segmenter = new Intl.Segmenter('zh', { granularity: 'word' });
  const segments = segmenter.segment(allwords);
  
  //console.table(Array.from(segments));

  // Loop over the segments and count each word
  for (const segment of segments) {
    let word = segment.segment;

    if (word.match(/[\\u4e00-\\u9fff]/) || !word.match(/\\d/) && !word.match(/\\W/)) {
      if (counts[word] === undefined) {
        counts[word] = 1;
        keys.push(word);
      } else {
        counts[word] = counts[word] + 1;
      }
    }
  }

  keys.sort(compare);

  function compare(a, b) {
    var countA = counts[a];
    var countB = counts[b];
    return countB - countA;
  }

  for (let i = 0; i < keys.length; i++) {
    let key = keys[i];
    createDiv(key + " " + counts[key]);
  }

  noCanvas();
}

Output of console.table(Array.from(segments));

Takeaways

There's potential to use the chat for creating more complex visualizations as both date and user information are included.

Text segmentation accuracy might be improved by utilizing specialized libraries for Chinese word segmentation like Jieba (commonly used in Python).

I quite like the foreword by the Memotrace developer (translated into English):

I firmly believe that what is meaningful is not WeChat itself, but the profound stories hidden behind each chat box. In the future, everyone will have the companionship of AI, and your data will give it precious memories of your past. I hope that everyone has the right to preserve the traces of their life 👨‍👩‍👦👚🥗🏠️🚴🧋⛹️🛌🛀, rather than being forgotten 💀.

Word counter for chat (Chinese)

Process

Takeaways

Video