Wolfram Research

Function Repository Resource:

JapaneseTextTokenizer

Source Notebook

Separate a piece of Japanese text into grammatical parts

Contributed by: Richard Hennigan (Wolfram Research)

ResourceFunction["JapaneseTextTokenizer"]["string"]

returns a list of associations containing data about the grammatical parts of the Japanese text "string".

ResourceFunction["JapaneseTextTokenizer"]["string","property"]

returns the specified property for each of the parts found in "string".

Details and Options

In ResourceFunction["JapaneseTextTokenizer"]["text"], an Association is returned for each word found in "text" with the following properties:
"SurfaceForm" the string corresponding to the word as it appears in the text
"BaseForm" the dictionary form of the word
"Reading" the pronounciation of the word
"PartsOfSpeech" a list of possible parts of speech for the word
In ResourceFunction["JapaneseTextTokenizer"]["text","property"], the value for "property" can be any item from the first column of the table above or "Dataset", which returns a Dataset of all properties.
ResourceFunction["JapaneseTextTokenizer"] has the following options:
"EnglishPartsOfSpeech" True whether to translate the list of parts of speech to English
Language $Language which language to use when specifying "TranslateWords"True
MissingString None a string to be used in place of Missing values
"Reading" "Katakana" which writing system to use to represent readings
"TranslateWords" False whether to include word translations in the output
Some possible values for "Reading" are "Hiragana", "Katakana" and "Romaji".

Examples

Basic Examples

Get information about the structure of a piece of Japanese text:

In[1]:=
ResourceFunction["JapaneseTextTokenizer"]["こんにちは世界"]
Out[1]=

Get a specific property:

In[2]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["私の猫は私に日本語を教えています", "Reading"]
Out[2]=

Return a Dataset:

In[3]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["私の日本語能力はじゃがいもですか?", "Dataset"]
Out[3]=

Specify a list of properties:

In[4]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["はい。でも、頑張っています!", {"BaseForm", "Reading"}]
Out[4]=

Properties and Relations

Japanese text is typically written without spaces, so typical structural segmentation does not work:

In[5]:=
TextWords["私が書いているのか分からない"]
Out[5]=

JapaneseTextTokenizer can identify individual words without spaces:

In[6]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["私が書いているのか分からない", "BaseForm"]
Out[6]=

Readings for kanji and particles are context sensitive:

In[7]:=
ResourceFunction["JapaneseTextTokenizer"]["私は昼ご飯を食べたい", "Reading", "Reading" -> "Romaji"]
Out[7]=

Compare to Transliterate on individual particles:

In[8]:=
Transliterate["は", "English"]
Out[8]=
In[9]:=
Transliterate["を", "English"]
Out[9]=

The readings of “wa” and “o” correspond to the pronunciation (this example requires a Japanese voice to be installed):

In[10]:=
AudioPlay[
 SpeechSynthesize["昼ご飯を食べたい", First[VoiceStyleData[#Language === "Japanese" &]]]]
Out[10]=

See how the reading for a character changes depending on context:

In[11]:=
ResourceFunction["JapaneseTextTokenizer"]["食べ", "Reading", "Reading" -> "Romaji"]
Out[11]=
In[12]:=
ResourceFunction["JapaneseTextTokenizer"]["食堂", "Reading", "Reading" -> "Romaji"]
Out[12]=

Listen to the difference (this example requires a Japanese voice to be installed):

In[13]:=
AudioPlay[
 SpeechSynthesize["食べ", First[VoiceStyleData[#Language === "Japanese" &]]]]
Out[13]=
In[14]:=
AudioPlay[
 SpeechSynthesize["食堂", First[VoiceStyleData[#Language === "Japanese" &]]]]
Out[14]=

JapaneseTextTokenizer can include translations of individual words in the output:

In[15]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["私の日本語能力はじゃがいもですか?", "WordTranslation", MissingString -> Nothing]
Out[15]=

Compare to using WordTranslation:

In[16]:=
words = ResourceFunction["JapaneseTextTokenizer"]["私の日本語能力はじゃがいもですか?",
   "BaseForm"]
Out[16]=
In[17]:=
WordTranslation[words, "Japanese" -> "English"] // DeleteMissing
Out[17]=
In[18]:=
First /@ %
Out[18]=

TextTranslation provides better results for text since it takes context into consideration:

In[19]:=
TextTranslation["私の日本語能力はじゃがいもですか?"]
Out[19]=
In[20]:=
TextTranslation["はい。でも、頑張っています!"]
Out[20]=

FuriganaForm uses JapaneseTextTokenizer to parse text:

In[21]:=
ResourceFunction["FuriganaForm"]["私の猫は私に日本語を教えています"]
Out[21]=

Get similar results using JapaneseTextTokenizer:

In[22]:=
data = ResourceFunction["JapaneseTextTokenizer"][
  "私の猫は私に日本語を教えています", {"SurfaceForm", "Reading"}, "Reading" -> "Hiragana"]
Out[22]=
In[23]:=
Grid[Transpose[{If[
      StringContainsQ[#SurfaceForm, _?ResourceFunction[
        "KanjiQ"]], #Reading, ""], #SurfaceForm} & /@ data]]
Out[23]=

Options

EnglishPartsOfSpeech

By default, parts of speech are given as English string tokens:

In[24]:=
ResourceFunction["JapaneseTextTokenizer"]["私の猫は魔法使いです。", "Dataset"]
Out[24]=

Get the parts of speech in Japanese instead of English:

In[25]:=
ResourceFunction["JapaneseTextTokenizer"]["私の猫は魔法使いです。", "Dataset", "EnglishPartsOfSpeech" -> False]
Out[25]=

Language

By default, when specifying "TranslateWords"True, individual words will be translated to the language defined by $Language:

In[26]:=
ResourceFunction["JapaneseTextTokenizer"]["猫の魔法使い", "Dataset", "TranslateWords" -> True]
Out[26]=

Get translations in another language:

In[27]:=
ResourceFunction["JapaneseTextTokenizer"]["猫の魔法使い", "Dataset", "TranslateWords" -> True, Language -> "Spanish"]
Out[27]=

Entity objects are also supported:

In[28]:=
ResourceFunction["JapaneseTextTokenizer"]["猫の魔法使い", "Dataset", "TranslateWords" -> True, Language -> Entity["Language", "English::385w8"]]
Out[28]=

MissingString

By default, a Missing object is returned when values are not found:

In[29]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["'バード\[CenterDot]セイ' は名作です!", "Reading"]
Out[29]=

Specify a string to use instead:

In[30]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["'バード\[CenterDot]セイ' は名作です!", "Reading", MissingString -> " "]
Out[30]=
In[31]:=
StringJoin[%]
Out[31]=

Missing values can also typically be avoided by using full-width Japanese characters:

In[32]:=
ResourceFunction["JapaneseTextTokenizer"]["「バード・セイ」は名作です!", "Reading"]
Out[32]=

Reading

Specify which writing system to show the readings in:

In[33]:=
ResourceFunction["JapaneseTextTokenizer"]["私は日本語を話せません", "Reading", "Reading" -> "Romaji"]
Out[33]=
In[34]:=
ResourceFunction["JapaneseTextTokenizer"]["私は日本語を話せません", "Reading", "Reading" -> "Hiragana"]
Out[34]=
In[35]:=
ResourceFunction["JapaneseTextTokenizer"]["私は日本語を話せません", "Reading", "Reading" -> "Katakana"]
Out[35]=

Generate phonetic romaji text from Japanese:

In[36]:=
StringRiffle[
 ResourceFunction["JapaneseTextTokenizer"]["鳥が言います", "Reading", "Reading" -> "Romaji"]]
Out[36]=

Entity objects are also supported:

In[37]:=
ResourceFunction["JapaneseTextTokenizer"]["私は日本語を話せません", "Reading", "Reading" -> Entity["WritingScript", "Hiragana::jx343"]]
Out[37]=

TranslateWords

Include word translations when available:

In[38]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["私の日本語能力はじゃがいもですか?", "Dataset", "TranslateWords" -> True]
Out[38]=

Applications

Create a function to compute word counts in Japanese:

In[39]:=
japaneseWordCounts[text_String] := Counts[DeleteMissing[
   ResourceFunction["JapaneseTextTokenizer"][text, "BaseForm"]]]
In[40]:=
japaneseWordCounts[
  ExampleData[{"Text", "UNHumanRightsJapanese"}]] // Short
Out[40]=

Compare the total number of words for the same document in English using WordCounts:

In[41]:=
WordCounts[ExampleData[{"Text", "UNHumanRightsEnglish"}]] // Total
Out[41]=
In[42]:=
japaneseWordCounts[
  ExampleData[{"Text", "UNHumanRightsJapanese"}]] // Total
Out[42]=

Create a variant of TextCases that can find parts of speech in Japanese text:

In[43]:=
japaneseTextCases[text_String, form_String] := Cases[ResourceFunction["JapaneseTextTokenizer"][
    text, {"BaseForm", "PartsOfSpeech"}], KeyValuePattern[{"BaseForm" -> word_, "PartsOfSpeech" -> {___, form, ___}}] :> word];
In[44]:=
text = ExampleData[{"Text", "UNHumanRightsJapanese"}];
Snippet[text]
Out[35]=

Get a list of the verbs used in the text:

In[45]:=
Union[japaneseTextCases[text, "Verb"]]
Out[45]=

Sort grammatical particles by their usage:

In[46]:=
ReverseSort[Counts[japaneseTextCases[text, "Particle"]]]
Out[46]=

Possible Issues

Results can be different for the same words depending on usage of kana/kanji:

In[47]:=
ResourceFunction["JapaneseTextTokenizer"]["ごめんなさい"]
Out[47]=
In[48]:=
ResourceFunction["JapaneseTextTokenizer"]["御免なさい"]
Out[48]=

Non-Japanese text will not return interesting results:

In[49]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["I wonder what this does?", "Dataset"]
Out[49]=

Punctuation should be given as full-width characters for more accurate part-of-speech tagging:

In[50]:=
Last[ResourceFunction["JapaneseTextTokenizer"]["やったー!"]]
Out[50]=
In[51]:=
Last[ResourceFunction["JapaneseTextTokenizer"]["やったー!"]]
Out[51]=

JapaneseTextTokenizer only provides translations for individual words (and does not account for context), so it should not be used to translate text:

In[52]:=
ResourceFunction["JapaneseTextTokenizer"]["私の猫は魔法使いです。", "WordTranslation", "TranslateWords" -> True, MissingString -> " "] // StringJoin
Out[52]=

Use TextTranslation for better results:

In[53]:=
TextTranslation["私の猫は魔法使いです。"]
Out[53]=

Requirements

Wolfram Language 11.3 (March 2018) or above

Resource History

Source Metadata

Related Resources

License Information