Imagine the history contained in a thousand years of books. What stories are in those books? What knowledge can we learn from the world before our time?
Senior Research Engineer
Oleg has 5+ years of experience in machine learning, deep learning and data science, with a background in biomedical signal processing. He took the 5th place with his team in the Epileptic Seizure Prediction competition organised by Melbourne University.
What was the weather like 500 years ago? What happened when Mt. Fuji erupted? How can one fold 100 cranes using only one piece of paper? The answers to these questions are in those books.
Japan has millions of books and over a billion historical documents such as personal letters or diaries preserved nationwide. Most of them cannot be read by the majority of Japanese people living today because they were written in “Kuzushiji”.
Even though Kuzushiji, a cursive writing style, had been used in Japan for over a thousand years, there are very few fluent readers of Kuzushiji today (only 0.01% of modern Japanese natives). Due to the lack of available human resources, there has been a great deal of interest in using Machine Learning to automatically recognise these historical texts and transcribe them into modern Japanese characters. Nevertheless, several challenges in Kuzushiji recognition have made the performance of existing systems extremely poor. (More information About Kuzushiji)
Kuzushiji, a cursive writing style, was used in Japan for over a thousand years, beginning in the 8th century. Over 3 million books, on a diverse array of topics such as literature, science, mathematics and cooking are preserved today. However, the standardisation of Japanese textbooks known as the “Elementary School Order” in 1900, removed Kuzushiji from the regular school curriculum, as modern Japanese print became popular. As a result, most Japanese natives today cannot read books written or printed just 120 years ago.
Since Chinese characters entered Japan in the 8th century, the Japanese language has been written using Kanji (Chinese characters in the Japanese language) in official records. However, from the late 9th century, the Japanese began to add their own character sets: Hiragana and Katakana, which derive from different ways of simplifying Kanji. Individual Hiragana and Katakana characters don’t contain independent semantic meaning, but instead, carry phonetic information (like letters in the English alphabet).
In Kuzushiji documents, Kanji, Hiragana and Katakana are all used. However, the number of character types in each document varies by genre. For example, storybooks are mostly written in Hiragana while formal records are written mainly in Kanji.
The central and foremost challenge that the experts had was to transcribe Kuzushiji into contemporary Japanese characters. That would help the Center for Open Data in the Humanities (CODH) be able to develop better algorithms for Kuzushiji recognition. The model is not only a great contribution to the machine learning community but also a great help for making millions of documents more accessible and leading to new discoveries in Japanese history and culture. The task was also complicated due to occasional visibility through especially thin paper and the characters from the opposite side of the page. Those characters should have also been ignored.
Duration: 2 months
The total number of unique characters in the Kuzushiji dataset is over 4300. However, the frequency distribution is very long-tailed and a large fraction of the characters (Kanji with very specific meaning) may only appear once or twice in a book. Therefore, the dataset is highly unbalanced.
One characteristic of Classical Hiragana or Hentaigana (“character variations”) is that many characters which can only be written a single way in modern Japanese can be written in many different ways in Kuzushiji.
A few characters in Kuzushiji look very similar and it is hard to tell what character it is without considering the above character as context.
Kuzushiji was written in a cursive script, and hence in many cases, characters are connected or overlap which can make the recognition task difficult.
As the task of character recognition is quite complex and there are more than 4300 different characters, Ciklum team developed two models for this problem – the neural network for character detection and a separate model for character classification.
The final result was evaluated using the F1-score. A perfect model would have a performance of 1 – that would mean that all characters were detected and classified correctly. The Ciklum team developed a model that detected and classified Kuzushiji characters with an F1-score of 0.873 and the 24th place among 293 teams.