MoreRSS

site iconBrandon SkerrittModify

A tech expert who invents open source projects, writes, makes videos, and worked as a Monzo security engineer.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Brandon Skerritt

JL vs Yomitan for Japanese Learning

2025-11-13 19:10:45

JL vs Yomitan for Japanese Learning

Yomitan is a famous dictionary app people use to learn Japanese.

JL is an alternative desktop dictionary app for Windows.

Let's jump right into it.

Why Switch?

Yomitan is very slow

Yomitan takes around 6 seconds for me to check duplicates in my mining deck.

JL vs Yomitan for Japanese Learning

My mining deck is only 3500 cards, so if you have a larger deck I imagine it's terribly slow.

Secondly, if you take your Yomitan cursor and drag it across some Japanese text it tends to lag.

JL vs Yomitan for Japanese Learning
0ms scan delay btw 😄

That's 6 whole seconds to load Yomitan and run a duplication check.

🤔
"But my Yomitan isn't slow! Have you tried X?"

Yes. I've tried everything. Believe me. Yomitan is this slow on every browser I try, regardless of settings.

There is also an API difference. JL uses canAddNotes which is deck specific, Yomitan uses noteinfo which is not deck specific.

Yomitan is browser only, and it corrupts - often.

You can only use Yomitan with a browser.

Yomitan relies on local storage in the browser.

This corrupts – often.

JL vs Yomitan for Japanese Learning
Small subset of people complaining about corruption

When Yomitan corrupts you have to reinstall all of your dictionaries and settings.

😔
This happens to me a lot, and every time it happens I spend 30+ minutes reinstalling all my dictionaries and settings.

It's good it's a browser extension as it can work on any platform, but it also comes with downsides.

JL

Firstly, let's talk about speed.

JL is really, really fast.

It's actually one of the main reasons you should use it.

JL vs Yomitan for Japanese Learning

Look at this gif. Absolutely no lag. Not to mention that it is technically showing more dictionaries and doing longer scan length (32) than my Yomitan does.

Also, duplicate detection is blazingly fast.

Do you see that small X next to the word? That indicates I can add it to Anki.

If it's red, it means it's a duplicate.

Rewatch the gif. Look at how fast that small red X appears again. Seriously.

This takes minimum 6 seconds in Yomitan for me btw.

It's so unbelievably fast I genuinely have no idea how they are doing this.

I was hesitant to believe they're using AnkiConnect at all, maybe they're talking directly to the DB which would explain why Yomitan is so slow?

But no, they're using AnkiConnect just like Yomitan does.

🙇
Thank you to Beangate, developer of GSM for adding duplicate detection to JL.

In my newbie stages where I read super slowly and I only have 3.5k Anki cards, I have to look up multiple words a sentence.

With Yomitan taking 10+ seconds every lookup and mine, this slows me down a lot.

Look at this:

JL vs Yomitan for Japanese Learning

Top is my average chars / hour using Yomitan.

Bottom is with JL.

Unironically I have a +700 chars / hour buff by just using another program, simply because it's faster.

Kanji Dicts

I really like this kanji dict:

The Best Beginner Kanji Dictionary
Over in TheMoeWay user road_to_redemption shared their Kanji dictionary: It has everything you need from a Kanji dict: * Meaning * Frequency * Top 3 common words with readings + translations * Readings + distributions * Components with keywords I really like this a lot, in particular the frequency + meaning + readings of the Kanji. Current
JL vs Yomitan for Japanese Learning

But in Yomitan it kinda sucks.

It can't be an official Kanji dict, so it has to be a word dict. This is because of the format of Kanji dicts in Yomitan.

But because it's a word dict, it has to compete with all the other Yomitan dicts.

If your friend is inviting you to 飲み会 and you highlight this with Yomitan, you will have to scroll really far to find the Kanji dict for 飲.

To fix this, you have to use profiles and switch profiles per my blog post everytime you want to look up a kanji with this dict.

But in JL, you don't have to do that!

JL vs Yomitan for Japanese Learning

In mining mode, simply highlight the kanji you want to look at and click the Kanji dict and you get to see it instantly.

You can even look up words with jmdict, and then highlight the kanji in that definition to get specific Kanji definitions.

Being able to easily click what dict you want to see is so powerful. I believe this will make it easier for monolingual transitions too.

Have JP -> JP dicts show up first, then just use child windows and switch to JP -> EN if you need to using the tabs.

Overlay

You can kinda make an overlay in JL.

JL vs Yomitan for Japanese Learning

With this mode, the text appears as if it was overlaid but not. You can then "click through" the text itself and it will click the window behind it, allowing you to progress in the story.

You have to enable this with a hotkey:

JL vs Yomitan for Japanese Learning

This kinda breaks for me on some full screen visual novels.

Works great with GSM

JL works great with GSM.

JL vs Yomitan for Japanese Learning

Set the addresses to use port 55002 and set it to auto reconnect to WebSocket.

Done 🥳

If you have a JL window over your visual novel, GSM uses OBS Window Capture so it won't show up in your screenshots.

Does this replace GSM overlay?

Nah, not really. Overlay overlays the text perfect on top of the game. It's 100% natural. This is a black box over your visual novel.

You can also use GSM to OCR a game or visual novel, and display it in JL.

Highlight longest match

In the settings you can enabled highlight longest match.

JL vs Yomitan for Japanese Learning

This just makes it easier to go through the text.

JL vs Yomitan for Japanese Learning

Custom search

There is a custom search feature, allowing you to easily Google sentences or words.

JL vs Yomitan for Japanese Learning

You can change it from Google to whatever you want.

JL vs Yomitan for Japanese Learning

I have it set to an AI prompt that just breaks down words for me. Sometimes I really struggle with accents / slang that aren't in dictionaries, so this helps a lot.

Sadly right now you can only search a specific word and not a whole sentence.

🤓
actually if you turn off "highlight longest match" you can highlight the whole sentence and search

Custom Dictionaries

Names, places, spells, and more are very custom to the media you are consuming.

There's things like VNDB name dictionaries, but it's not perfect.

JL has custom dictionaries.

If the word you want to look up does not have a definition, right click it and add it:

JL vs Yomitan for Japanese Learning
JL vs Yomitan for Japanese Learning

Now everytime you look up that word you'll see your custom definition:

JL vs Yomitan for Japanese Learning

This is super easy to edit later on. For example, I made a mistake here.

It's not the island name, but the name of a girl on the island.

JL stores these custom dicts in plaintext format. No JSON. Just open it and edit it!

JL vs Yomitan for Japanese Learning

You can even make profiles in JL and have custom dicts per profile!

Stats

JL has stats.

Not as good as GSM if I may say so ;)

JL vs Yomitan for Japanese Learning

But what's cool is that you can see how many times you have looked up a word.

JL vs Yomitan for Japanese Learning

I wish this showed up in the popup window so I knew if I should mine a word that appears often or not.

Keyboard only

JL vs Yomitan for Japanese Learning

You can also control a caret in the JL window and go keyboard only.

JL vs Yomitan for Japanese Learning

If you enable this setting, you can also just click enter on your keyboard to advance in a visual novel.

Only show black box on hover

If you hate seeing the black box in your game, you can change these settings:

JL vs Yomitan for Japanese Learning

Now you'll only see it when you hover over it!

Downsides

  • Windows only

JL is a Windows only program. This is where Yomitan is still great.

  • Requires a text input event. Easy with GSM.

You need some sort of text input event, like textractor / Lunahook / GSM OCR.

This is where GSM still works well, it acts as a middleman between getting the text and using dictionary software.

In my opinion JL is perfect for video games / visual novels, but for other things Yomitan still reigns supreme.

Setup

Downloading

Download JL from here:

Releases · rampaa/JL
JL is a program for looking up Japanese words and expressions. - rampaa/JL
JL vs Yomitan for Japanese Learning

Extract it and run the .exe everytime you want to start JL.

I pinned it to my toolbar.

Right-click to open the settings menu etc.

Downloading Dictionaries

When you first start JL it'll ask to download dictionaries.

Say yes, they're pretty good.

You can use Yomitan formatted dictionaries with JL.

I used some from Marv's starter pack:

GitHub - MarvNC/yomitan-dictionaries: 📚 Japanese and Chinese dictionaries for Yomitan.
📚 Japanese and Chinese dictionaries for Yomitan. Contribute to MarvNC/yomitan-dictionaries development by creating an account on GitHub.
JL vs Yomitan for Japanese Learning

Adding audio sources

Right click and click "manage audio sources"

If you are using Local Audio Server for Yomitan, enter this:

http://127.0.0.1:5050/?sources=jpod,jpod_alternate,nhk16,forvo&term={Term}&reading={Reading}

Otherwise it's the same as Yomitan.

Anki setup

Go to the Anki tag and enabled it.

JL vs Yomitan for Japanese Learning

Here's what I've got for Lapis card type.

JL vs Yomitan for Japanese Learning
JL vs Yomitan for Japanese Learning

That's it! Enjoy playing with JL!

Why GSM Stats is different from other Japanese stats apps

2025-11-12 06:38:39

Why GSM Stats is different from other Japanese stats apps

TLDR - It has the raw data

Other stat apps simply collect data such as how many characters you read and when.

They normally collect data like:

  • When you started reading
  • When you stopped
  • How much you read

This allows them to calculate stats like:

  • Characters / hour
  • Characters read
  • Hours / day

etc

GSM collects the actual sentences you read. You don't tell GSM anything, GSM stores the actual sentences.

Specifically this data is stored:

  • The time the sentence came in
  • What the sentence was
  • Whether it was mined to Anki or not
  • What game the sentence is from

This allows GSM to calculate all the same stats as before, but we can get some extra data like:

  • The words or Kanji in that sentence
  • Charcters / hour per game

etc

More importantly many statistic apps use an AFK timer to work out when you're AFK.

If you don't read for say 30 seconds, it considers you AFK.

Because they do not have the raw data, they calculate it once and that's it for life.

In GSM because we have the raw data, you can change your opinion about this anytime.

  • Want a shorter AFK time?
  • Want to remove English words from the sentences you read?

Conclusion

Other stat apps -> One time statistics, usually without the raw data, which cannot be changed and are inflexible

GSM Stats -> Has raw data, allows you to change your opinion whenever you want about your data

Poe - Yomitan for Android

2025-11-11 14:46:36

Poe - Yomitan for Android

I've been playing with Poe recently:

Poe: Language Lens - Apps on Google Play
Pop-up dictionary and language learning tool for Japanese, Chinese, and more.
Poe - Yomitan for Android

This is Yomitan for Androids but anywhere on the screen.

Poe - Yomitan for Android

It's really easy to install. You just have to:

  1. Install via Google Play
  2. Open the app
  3. Click "Enable"
  4. It'll take you to the Android accessibility settings, add Poe as an allowed app.
  5. Done 🥳

Good

  • It lets you look up words anywhere on your Android screen using OCR
  • It works exactly like Yomitan etc does, just hover and you can see a words definition
  • It has pitch accent
  • It's the only thing that works this simply on Android

Bad

  • There is an $10 a month subscription for better OCR / Anki support
  • The OCR is kinda slow if you don't pay $10 a month
  • No custom dictionaries

Actually I have no clue what dictionary is used lol. For sure there is no proper noun / name dictionary though.

I got this to help me understand complicated place names as I live in Japan and Google Maps is hard here if you don't know Kanji

I don't know why they only have a singular dictionary

  • Instead of promising to add more dictionaries, they plan to add AI definitions of words.

Conclusion

May as well get it. It's free, it's the only app that does this with such an easy install (no termux etc needed).

if you are a developer please build something that works just as easily but is more like yomitan thanks :)

Priority Reorder Anki Addon

2025-11-11 07:24:56

Priority Reorder Anki Addon

I use this Anki Addon to reorder my Japanese cards

GitHub - tomahtoes/priority-reorder
Contribute to tomahtoes/priority-reorder development by creating an account on GitHub.
Priority Reorder Anki Addon

Specifically I want to reorder them based on 2 things:

  1. Cards I recently mined, as they are still fresh in my memory.
  2. Cards that appear a lot in media I am currently consuming, so I can reinforce them in that media.

This is my current config file:

{
    "normal_prioritization": null,
    "normal_search": "deck:例文マイニング",
    "priority_cutoff": null,
    "priority_limit": null,
    "priority_search": [
        "deck:例文マイニング added:2",
        "deck:例文マイニング occurrences:reflectionblue>5",
    ],
    "priority_search_mode": "sequential",
    "reorder_before_sync": true,
    "search_fields": {
        "expression_field": "Expression",
        "expression_reading_field": "ExpressionReading"
    },
    "shift_existing": true,
    "sort_field": "Frequency",
    "sort_reverse": false
}
  1. Prioritise cards added in last 2 days
  2. Then prioritise cards that appear more than 5 times in Reflection Blue

Reflection Blue is a Jiten Yomitan freq dict

To get a freq dict go here:

Summer Pockets REFLECTION BLUE - Detail - Jiten
Anki deck, vocabulary list and statistics for Summer Pockets REFLECTION BLUE (Summer Pockets REFLECTION BLUE).
Priority Reorder Anki Addon

Download deck.

Yomitan Occurences.

Priority Reorder Anki Addon

Add this to the user_files folder in the addon (see GitHub readme)

The name reflectionblue in the config comes from the folder name.

Priority Reorder Anki Addon

Capitalisation doesn't matter here.

The folder should contain the unzipped Yomitan occurrence dictionary.

Inside the folder it should look like:

Priority Reorder Anki Addon

Since publishing I changed my config, but its pretty much exactly the same.

{
    "normal_prioritization": null,
    "normal_search": "deck:例文マイニング",
    "priority_cutoff": null,
    "priority_limit": null,
    "priority_search": [
        "deck:例文マイニング added:7 occurrences:reflectionblue>30",
        "deck:例文マイニング added:2",
        "deck:例文マイニング occurrences:reflectionblue>30",
        "deck:例文マイニング occurrences:steinsgate>30",
        "deck:例文マイニング occurrences:limelight>30"
    ],
    "priority_search_mode": "sequential",
    "reorder_before_sync": true,
    "search_fields": {
        "expression_field": "Expression",
        "expression_reading_field": "ExpressionReading"
    },
    "shift_existing": true,
    "sort_field": "Frequency",
    "sort_reverse": false
}

If I have seen a word in the last 7 days and it's highly frequent, prioritise that.

Else prioritise words added in the last 2 days.

Else prioritise my visual novels I want to play / am playing.

I do 20 new cards a day currently, but sometimes I mine say 40 cards a day. I try to only mine things I know already.

Japanese Progress November 2025

2025-11-10 20:06:34

Japanese Progress November 2025

It's nearly been 2 years since I started Japanese, and 3 months since my last Japanese update:

How I’m Learning Japanese August 2025
My morning routine looks like: Anki I do 20 new cards a day. The cards are ones I have created from reading, like this: I only make cards if: * It’s within 5k frequency, so it’s common to me * OR I know all the Kanji in the word * OR it’s a
Japanese Progress November 2025

Anki

Firstly, let's look at Anki!

In my August 2025 update I had 928 mature cards. I was doing 20 new cards a day.

3 months later and I have:

Japanese Progress November 2025

Almost 3000 more mature cards! That's 34 cards matured a day.

This comes down to reading heavily.

Due to reading way more, my retention is a lot better.

Here's my retention almost 1 year ago:

Japanese Progress November 2025
December 2024

And now...

Japanese Progress November 2025

Generally speaking my Young cards sit around 77% and my mature cards around 93%

You can see my retention is decreasing, likely because I am encountering harder words now and 20 words / day is a bit of an insane pace.

Japanese Progress November 2025

My average difficulty is 39%

Bunpro

Last update in August I speedran N5 and was doing N4:

Japanese Progress November 2025

And since then:

Japanese Progress November 2025

I've completed N4, and am halfway through N3 grammar!

GSM

Since August I started using GSM heavily:

GitHub - bpwhelan/GameSentenceMiner: An All-in-One immersion toolkit for learning Languages through games and other visual media.
An All-in-One immersion toolkit for learning Languages through games and other visual media. - bpwhelan/GameSentenceMiner
Japanese Progress November 2025

I've played 8 total Visual Novels and I am working on my first visual novel of length 1 milly characters.

Here are my stat pages:

I also ended up contributing to GSM heavily, such as the entire stats page.

Japanese Progress November 2025

I also added this goals page to track my daily reading and show me how much I need to read to achieve my arbitrary goals.

Japanese Progress November 2025

What I do daily

My daily routine is this:

  1. Anki first thing, 20 new cards.
  2. Bunpro grammar review, 3 new items.
  3. Read a visual novel for 2 hours 20 (or whatever GSM says)

Everything after this is optional.

Sometimes I watch anime or YouTube, sometimes I don't do anymore, sometimes I read a visual novel more.

On Sundays I go through all my leech cards in Anki and if I see interesting Kanji I add them using the Kanji dict I talked about:

The Best Beginner Kanji Dictionary
Over in TheMoeWay user road_to_redemption shared their Kanji dictionary: It has everything you need from a Kanji dict: * Meaning * Frequency * Top 3 common words with readings + translations * Readings + distributions * Components with keywords I really like this a lot, in particular the frequency + meaning + readings of the Kanji. Current
Japanese Progress November 2025

Speeding up Game Sentence Miner (GSM) Statistics by 200%

2025-10-19 10:35:33

Speeding up Game Sentence Miner (GSM) Statistics by 200%

Over the course of a weekend in-between job interviews I decided to speed up the loading of statistics in one of my favourite apps, GSM.

GSM is an application designed to make it easy to turn games into flashcards. It records the screen with OBS and uses OCR / Whisper to get text from it. You then hover over a word with a dictionary, click "add to Anki" and GSM sends the full spoken line from the game + a gif of the game to your Anki card.

Speeding up Game Sentence Miner (GSM) Statistics by 200%
GSM does a lot more, but this is as succinct as I can make it
GitHub - bpwhelan/GameSentenceMiner: An All-in-One immersion toolkit for learning Languages through games and other visual media.
An All-in-One immersion toolkit for learning Languages through games and other visual media. - bpwhelan/GameSentenceMiner
Speeding up Game Sentence Miner (GSM) Statistics by 200%

GSM has a statistics page contributed by me, every time you read something in-game it adds it to a database which I then generate statistics from.

These stats take a while to load.

  • /stats takes 6 seconds
  • /anki takes around 40 seconds
  • /overview takes around 4 seconds

And I added /overview because /stats was too slow!

👾
Note: this whole app is a Windows .exe, that serves Flask entirely locally. These times are absurd for a local app!

This blog post talks about how I spent my weekend improving the loading speed of the website by around 200%

Speeding up Game Sentence Miner (GSM) Statistics by 200%
Left == new, right == old

Why does statistics take so long to load?

The entire database is one very long table called game_lines.

Every single time a game produces a line of text, that is recorded in game_lines with some statistics.

Each line looks like this:

e726c5f5-7d59-11f0-b39e-645d86fdbc49 NEKOPARA vol.3 「もう1回、同じように……」 C:\Users\XXX\AppData\Roaming\Anki2\User 1\collection.media\GSM 2025-08-20 10-35-15.avif C:\Users\XXX\AppData\Roaming\Anki2\User 1\collection.media\NEKOPARAvol.3_2025-08-20-10-35-28-515.opus 1755648553.21247 ebd4b051-27aa-4957-9b50-3495d1586ec1

Or in a more readable version:

🗂 Entry ID: e726c5f5-7d59-11f0-b39e-645d86fdbc49
🕒 Timestamp: 2025-08-20 10:35:15
🔊 Audio: NEKOPARAvol.3_2025-08-20-10-35-28-515.opus
🖼 Screenshot: GSM 2025-08-20 10-35-15.avif

🦜 Game Line: "「もう1回、同じように……」"

📁 File Paths:
C:\XXX\AppData\Roaming\Anki2\User 1\collection.media\GSM 2025-08-20 10-35-15.avif
C:\XXX
\AppData\Roaming\Anki2\User 1\collection.media\NEKOPARAvol.3_2025-08-20-10-35-28-515.opus

🧩 Original Game Name: NEKOPARA vol.3
🧠 Translation Source: NULL
🪶 Internal Ref ID: ebd4b051-27aa-4957-9b50-3495d1586ec1
📆 Epoch Timestamp: 1755648553.21247

Then to calculate statistics, we query every gameline.

For me this takes around 10 seconds.

If you play a lot of games it can take around 1 minute....

All the statistics you have seen so far are calculated from this data alone, there's some easy things like:

  • How many characters of this game have I read?
  • How long have I spent playing it?
  • What's the most I've read in a day?
Speeding up Game Sentence Miner (GSM) Statistics by 200%

But in the Japanese learning community there is 1 important bit of data everyone wants.

How many characters do I read per hour on average? What is my reading speed?

This is important because we know how many characters is in a game, if we know our reading speed we can work out how much of a slog it will be.

On a site like Jiten.moe we can insert our reading speed into the settings and see how long it'll take to read something.

At my very nooby reading speed of 2900 characters / hour, it'll take me 550 hours of non-stop reading to play Fate/Stay Night.

Speeding up Game Sentence Miner (GSM) Statistics by 200%

Although this is one of the most famous visual novels of all time and has been made into numerous anime, spending 550 hours slogging through it does not seem good.

Knowing my reading speed allows me to pick games / visual novels that I can do in a few weeks rather than a year or more.

🤔
Most people choose to read smaller visual novels / games to keep their interest high, higher interest means you will read more, which means you will improve more and your reading speed will go up 📈

Now looking at our data there is no easy way to calculate this, right? Games do not tell you "Oh yeah in this Call of Duty dialogue you read at this pace".

Other similar sites like ExStatic calculate this:

Speeding up Game Sentence Miner (GSM) Statistics by 200%
ExStatic is like GSM, except you need to use another program to hook into a games memory to get the game lines sent to ExStatic. This is called a "texthooker". You can also do this with GSM, but explaining how texthookers and OCR work is a blog post for another day.

But interestingly they have sorta the same data as us.

  • Game: Reverb
  • Line: 「ホタカさーん!ホタカさーん!」
  • Timestamp: 1755612879.053

But let's say we get 4 game lines come in. Each one is of length 10.

They come in every 15 minutes.

So our average reading speed is 40 characters per hour.

But then the next day, 24 hours later, we read another line of 10 characters.

Now our averaging reading speed is skewed to be much lower because in our code it looks like it took us 24 hours to read 10 characters.

The absence of data is data itself here, but how is everyone in the Japanese learning community handling this?

Speeding up Game Sentence Miner (GSM) Statistics by 200%

Everyone sets an AFK Timer.

If you do not get a new line within the timer, it assumes you are AFK and stops counting towards your stats.

This may seem uninteresting now, but this powers many of our design choices later on.

What should we do?

We have a couple of things we can do to speed up the loading of the stats site.

  • Batch all API calls into one

Currently we get all game lines multiple times calculate the stats that way. It's not as clear cut as 1 bar graph == 1 DB call. It's more like one section grabs all game_line and alters it to work for that section.

This makes a lot of sense, but sadly it doesn't work so well.

I've already tried this:

unify api calls to one by bee-san · Pull Request #192 · bpwhelan/GameSentenceMiner
Speeding up Game Sentence Miner (GSM) Statistics by 200%

Ignore my bad PR etiquette. We talked more about this in the Discord. I don't want to write conventional commits + nice PRs for a very niche tool 😅

Firstly, it only saves around a second of time. We still have to pull all the game lines no matter what.

Secondly, this makes it much harder and more rigid to calculate statistics. We had one API call, and then we calculated every possible statistic out of that one call and put it into a dictionary.

It's a bit... hardcore...

We basically had one 1200 line function which calculated every stat and then fed it to each statistic.

We could have broken it up, but to save 1 second of time only? For all that work? Surely there's a faster way.

  • Move statistics out of the page

We've already done this as a little hack. We moved many important statistics from the /statistics page to the /overview page.

This improves loading because instead of loading every stat, we now only load important ones.

Speeding up Game Sentence Miner (GSM) Statistics by 200%

Obviously a hack... but it worked.... Load speed went from 7 seconds to 4... Still bad... 🤢

  • Pre-calculate stats

Do we really need to calculate stats on the fly?

What if we were to pre-calculate all of our statistics and then present them to the user?

The final option, pre-calculating stats, is what we will be doing.

🥞 Rolling up stats

Every time GSM runs, let's pre-calculate all previous days stats for the user and then calculate just todays.

This will save us a lot of time.

Specifically our algorithm will now look like:

  • When GSM runs, roll up all previous days stats to a table
  • When we query statistics, use the pre-rolled up table for all previous data
  • And then we calculate today's stats and add it to the rolled up stats
"Why calculate today's stats on the fly at all? Why not turn each game_line into a rolled up stats and add it to today's rollup?"

By Jove, a great question!

When GSM receives a line of text from a game it does a lot of processing to make it appear on screen etc, so why not precalculate stats there and then?

This makes a lot of sense!

BUTTTT.....

The absence of data is data!

Each game line looks exactly like this:

🗂 Entry ID: e726c5f5-7d59-11f0-b39e-645d86fdbc49
🕒 Timestamp: 2025-08-20 10:35:15
🔊 Audio: NEKOPARAvol.3_2025-08-20-10-35-28-515.opus
🖼 Screenshot: GSM 2025-08-20 10-35-15.avif
🦜 Game Line: "「もう1回、同じように……」"
📁 File Paths:
C:\XXX\AppData\Roaming\Anki2\User 1\collection.media\GSM 2025-08-20 10-35-15.avif
C:\XXX\AppData\Roaming\Anki2\User 1\collection.media\NEKOPARAvol.3_2025-08-20-10-35-28-515.opus
🧩 Original Game Name: NEKOPARA vol.3
🧠 Translation Source: NULL
🪶 Internal Ref ID: ebd4b051-27aa-4957-9b50-3495d1586ec1
📆 Epoch Timestamp: 1755648553.21247

In the moment this imaginary rollup function only has this data.

When we calculate stats, we are looking at the past. We can see where the absences are to calculate the AFK time.

But in the moment, we don't know if the next game line will be 120 seconds or more later.

So therefore we cannot roll up today's stats because we cannot tell when a user takes an extended break away from the text or not.

What stats do we pre-calculate?

The next big question is "okay, what do we actually calculate?"

There's 2 types of stats:

  • Raw stats like characters read
  • Calculated stats that require more than just a single bit of data, like average characters per hour per a specific game.

I made an original list, booted up Claude and asked it to confirm my list and see if it thinks anything else is important.

Together we made this list:

_fields = [
    'date',                           # str — date
    'total_lines',                    # int — total number of lines read
    'total_characters',               # int — total number of characters read
    'total_sessions',                 # int — number of reading sessions
    'unique_games_played',            # int — distinct games played
    'total_reading_time_seconds',     # float — total reading time (seconds)
    'total_active_time_seconds',      # float — total active reading time (seconds)
    'longest_session_seconds',        # float — longest session duration
    'shortest_session_seconds',       # float — shortest session duration
    'average_session_seconds',        # float — average session duration
    'average_reading_speed_chars_per_hour',  # float — average reading speed (chars/hour)
    'peak_reading_speed_chars_per_hour',     # float — fastest reading speed (chars/hour)
    'games_completed',                # int — number of games completed
    'games_started',                  # int — number of games started
    'anki_cards_created',             # int — Anki cards generated
    'lines_with_screenshots',         # int — lines that include screenshots
    'lines_with_audio',               # int — lines that include audio
    'lines_with_translations',        # int — lines that include translations
    'unique_kanji_seen',              # int — unique kanji encountered
    'kanji_frequency_data',           # str — kanji frequency JSON
    'hourly_activity_data',           # str — hourly activity (JSON)
    'hourly_reading_speed_data',      # str — hourly reading speed (JSON)
    'game_activity_data',             # str — per-game activity (JSON)
    'games_played_ids',               # str — list of game IDs (JSON)
    'max_chars_in_session',           # int — most characters read in one session
    'max_time_in_session_seconds',    # float — longest single session (seconds)
    'created_at',                     # float — record creation timestamp
    'updated_at'                      # float — last update timestamp
]

Then using this list we can calculate stats like:

  • Average session length
  • Reading time per game
  • etc etc...

We don't need to calculate every single thing, just have enough data to calculate it all in the moment.

If we calculate things like total_active_time_seconds / total_sessions the abstraction becomes kinda too much.

Like come on, we don't need a whole database column just to divide two numbers 😂

In GSM you can also see your stats data in a date range:

Speeding up Game Sentence Miner (GSM) Statistics by 200%

So we have all these columns, and each row is 1 day of stats. That way we can easily calculate stats for any date range.

And we just need a special case for today to calculate today's stats.

How do we run this?

GSM is a Windows executable. Not a fully fledged server.

It could be ran every couple minutes, or ran once every couple months.

We need this code to successfully roll up stats regardless of when it runs, and we need it to be conservative in when it runs.

What we need is some kind of Cron system...

I added a new Database table called cron.

This table just stores information about tasks that GSM wants to run regularly.

Speeding up Game Sentence Miner (GSM) Statistics by 200%

We store some simple data:

  • ID
  • Name
  • Description
  • The last time it ran
  • The next time it runs
  • If it's enabled or not
  • When the cron was created
  • And the schedule it runs on

Then when we start GSM, it:

  • Runs a query to get all cron jobs that needs to run now
SELECT * FROM {cls._table} WHERE enabled=1 AND next_run <= ? ORDER BY next_run ASC

Loop through our list and run a basic if statement to see if one of our crons needs to run:

   for cron in due_crons:
        detail = {
            'name': cron.name,
            'description': cron.description,
            'success': False,
            'error': None
        }
        try:
            if cron.name == 'jiten_sync':
                from GameSentenceMiner.util.cron.jiten_update import update_all_jiten_games
                result = update_all_jiten_games()
                
                # Mark as successfully run
                CronTable.just_ran(cron.id)
                executed_count += 1
                detail['success'] = True
                detail['result'] = result
                
                logger.info(f"✅ Successfully executed {cron.name}")
                logger.info(f"   Updated: {result['updated_games']}/{result['linked_games']} games")

If it needs to run, we import that file (cron files are just python files we import and run. It's really simple)

We then run the command just_ran.

This command:

  1. sets last_run to current time
  2. calculates next_run based on the schedule type (weekly, monthly etc)
if cron.schedule == 'once':
            # For one-time jobs, disable after running
            cron.enabled = False
            cron.next_run = now  # Set to now since it won't run again
            logger.debug(f"Cron job '{cron.name}' completed (one-time job) and has been disabled")
        elif cron.schedule == 'daily':
            next_run_dt = now_dt + timedelta(days=1)
            cron.next_run = next_run_dt.timestamp()
            logger.debug(f"Cron job '{cron.name}' completed, next run scheduled for {next_run_dt}")
        elif cron.schedule == 'weekly':
            next_run_dt = now_dt + timedelta(weeks=1)
            cron.next_run = next_run_dt.timestamp()
            logger.debug(f"Cron job '{cron.name}' completed, next run scheduled for {next_run_dt}")
  1. Updates the Cron entry

This is just a super simple way to make GSM run tasks on a schedule without running every single time the app starts.

With all of these changes, our API speed is now....

  • 6 seconds -> 0.5 seconds!

But the webpage itself still loads in 3.5 seconds.

Google Lighthouse

Google Lighthouse rates our website as a 37.

It complains about some simple things like:

  • Preloading CSS / HTML
  • No compression
  • No caching

So what I did was:

  • Set rel=preload for important css
  • Added flask-compress dependency to compress the Flask payload, using Brotli. I read this HN comment thread on Brotli vs zstd and I believe Brotli makes the most sense for now.
😅
Brotli in a local open source program? Isn't that overkill?

🤓 achtkually no! The /api/stats endpoint returns a massive JSON payload containing all the stats (rolled up and todays) that's parsed by the frontend into pretty charts. Compressing it makes total sense.

Also, GSM works on a network level too. You may wish to host it on a beefy computer and use something like Moonlight to play the game on your phone, and then look up stats on your phone too.
  • Cached the CSS, since that changes very infrequently. Since GSM is not a server, users have to manually click "update" to update the app. At most this happens once every 3 days, so we use a 3 day cache here.

This led to our lighthouse score becoming 89, with the speed going from 3.5 seconds to 1.4 seconds.

Speeding up Game Sentence Miner (GSM) Statistics by 200%
Speeding up Game Sentence Miner (GSM) Statistics by 200%

Very speedy!

Conclusion

We successfully doubled the loading speed of the statistics sites, but more importantly here are some key takeaways.

  • The use of data is so vast, one persons "meh" data is another persons core product. We need to look at our data flow and our application to decide the best approach. For example, not rolling up today's stats to keep AFK metrics.
  • There's a lot of arguments on Brotli vs zstd. For a local open source program either works.
  • Lighthouse has become a lot more useful since I last used it back in 2018.