Wolfram Research

Function Repository Resource:

JapaneseTextTokenizer

Source Notebook

Separate a piece of Japanese text into grammatical parts

Contributed by: Richard Hennigan (Wolfram Research)

ResourceFunction["JapaneseTextTokenizer"]["string"]

returns a list of associations containing data about the grammatical parts of the Japanese text "string".

ResourceFunction["JapaneseTextTokenizer"]["string","property"]

returns the specified property for each of the parts found in "string".

Details and Options

In ResourceFunction["JapaneseTextTokenizer"]["text"], an Association is returned for each word found in "text" with the following properties:
"SurfaceForm" the string corresponding to the word as it appears in the text
"BaseForm" the dictionary form of the word
"Reading" the pronounciation of the word
"PartsOfSpeech" a list of possible parts of speech for the word
In ResourceFunction["JapaneseTextTokenizer"]["text","property"], the value for "property" can be any item from the first column of the table above or "Dataset", which returns a Dataset of all properties.
ResourceFunction["JapaneseTextTokenizer"] has the following options:
"EnglishPartsOfSpeech" True whether to translate the list of parts of speech to English
"Reading" "Katakana" which writing system to use to represent readings
MissingString None a string to be used in place of Missing values
Possible values for "Reading" are "Hiragana", "Katakana" and "Romaji".

Examples

Basic Examples

Get information about the structure of a piece of Japanese text:

In[1]:=
ResourceFunction["JapaneseTextTokenizer"]["こんにちは世界"]
Out[1]=

Get a specific property:

In[2]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["私の猫は私に日本語を教えています", "Reading"]
Out[2]=

Return a Dataset:

In[3]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["私の日本語能力はじゃがいもですか?", "Dataset"]
Out[3]=
In[4]:=
ResourceFunction["JapaneseTextTokenizer"]["はい。でも、頑張っています!", "Dataset"]
Out[4]=

Properties and Relations

Japanese text is typically written without spaces, so structural segmentation does not work:

In[5]:=
TextWords["私が書いているのか分からない"]
Out[5]=

JapaneseTextTokenizer can identify individual words without spaces:

In[6]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["私が書いているのか分からない", "BaseForm"]
Out[6]=

Readings for kanji and particles are context sensitive:

In[7]:=
ResourceFunction["JapaneseTextTokenizer"]["私は昼ご飯を食べたい", "Reading", "Reading" -> "Romaji"]
Out[7]=

Compare to Transliterate on individual particles:

In[8]:=
Transliterate["は", "English"]
Out[8]=
In[9]:=
Transliterate["を", "English"]
Out[9]=

The readings of “wa” and “o” correspond to the pronunciation:

In[10]:=
AudioPlay[
 SpeechSynthesize["昼ご飯を食べたい", First[VoiceStyleData[#Language === "Japanese" &]]]]
Out[10]=

See how the reading for a character changes depending on context:

In[11]:=
ResourceFunction["JapaneseTextTokenizer"]["食べ", "Reading", "Reading" -> "Romaji"]
Out[11]=
In[12]:=
ResourceFunction["JapaneseTextTokenizer"]["食堂", "Reading", "Reading" -> "Romaji"]
Out[12]=

Listen to the difference:

In[13]:=
AudioPlay[
 SpeechSynthesize["食べ", First[VoiceStyleData[#Language === "Japanese" &]]]]
Out[13]=
In[14]:=
AudioPlay[
 SpeechSynthesize["食堂", First[VoiceStyleData[#Language === "Japanese" &]]]]
Out[14]=

Options

EnglishPartsOfSpeech

Get the parts of speech in Japanese instead of English:

In[15]:=
ResourceFunction["JapaneseTextTokenizer"]["私の猫は魔法使いです。", "Dataset", "EnglishPartsOfSpeech" -> False]
Out[15]=

Reading

Specify which writing system to show the readings in:

In[16]:=
ResourceFunction["JapaneseTextTokenizer"]["私は日本語を話せません", "Reading", "Reading" -> "Romaji"]
Out[16]=
In[17]:=
ResourceFunction["JapaneseTextTokenizer"]["私は日本語を話せません", "Reading", "Reading" -> "Hiragana"]
Out[17]=
In[18]:=
ResourceFunction["JapaneseTextTokenizer"]["私は日本語を話せません", "Reading", "Reading" -> "Katakana"]
Out[18]=

Generate phonetic romaji text from Japanese:

In[19]:=
StringRiffle[
 ResourceFunction["JapaneseTextTokenizer"]["鳥が言います", "Reading", "Reading" -> "Romaji"]]
Out[19]=

MissingString

By default, a Missing object is returned when values are not found:

In[20]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["'バード\[CenterDot]セイ' は名作です!", "Reading"]
Out[20]=

Specify a string to use instead:

In[21]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["'バード\[CenterDot]セイ' は名作です!", "Reading", MissingString -> " "]
Out[21]=
In[22]:=
StringJoin[%]
Out[22]=

Missing values can also typically be avoided by using full-width Japanese characters:

In[23]:=
ResourceFunction["JapaneseTextTokenizer"]["「バード・セイ」は名作です!", "Reading"]
Out[23]=

Possible Issues

Results can be different for the same words depending on usage of kana/kanji:

In[24]:=
ResourceFunction["JapaneseTextTokenizer"]["ごめんなさい"]
Out[24]=
In[25]:=
ResourceFunction["JapaneseTextTokenizer"]["御免なさい"]
Out[25]=

Non-Japanese text will not return interesting results:

In[26]:=
ResourceFunction[
 "JapaneseTextTokenizer"]["I wonder what this does?", "Dataset"]
Out[26]=

Punctuation should be given as full-width characters for more accurate part-of-speech tagging:

In[27]:=
Last[ResourceFunction["JapaneseTextTokenizer"]["やったー!"]]
Out[27]=
In[28]:=
Last[ResourceFunction["JapaneseTextTokenizer"]["やったー!"]]
Out[28]=

Requirements

Wolfram Language 11.3 (March 2018) or above

Resource History

Source Metadata

See Also

License Information