A note on how to use BERT learned from Japanese Wikipedia, now available -

Preliminaries: Installing mecab

The morphological analysis engine, mecab, is required to use BERT’s Japanese model.

The tokenizer will probably ask for mecab.

This time, we will use homebrew to install Mecab and ipadic.

brew install mecab mecab-ipadic

To update the dictionaries, create an appropriate directory for dictionaries and run the following command.

git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git  
. /bin/install-mecab-ipadic-neologd -n -a
``` .

Install the mecab wrapper for python.

```sh
pip install mecab-python3

Install transformers.

huggingface/transformers

Follow the official documentation to install transformers.

pip install transformers

Refer to the following example on how to use BERT.

cl-tohoku

import torch  
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer  
from transformers.modeling_bert import BertForMaskedLM  
tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese-whole-word-masking

bert-base-japanese-whole-word-masking specifies the tokenization method (word-delimited or character-delimited) used during training.

model = BertForMaskedLM.from_pretrained('bert-base-japanese-whole-word-masking')

Load the BERT model corresponding to the task.

This time it will be MLM (Masked Language Model ). It returns the probability of a word falling into the position where the mask token is located.

input_ids = tokenizer.encode(f''')  
   青葉山で{tokenizer.mask_token}の研究をしています。  
''', return_tensors='pt'')

Use the tokenizer to convert tokens to their corresponding dictionary numbers.

print(input_ids)  
print(tokenizer.convert_ids_to_tokens(input_ids[0].tolist()))

Replace id numbers with tokens from the corresponding dictionary.

masked_index = torch.where(input_ids == tokenizer.mask_token_id)[1].tolist()[0])  
print(masked_index)  
result = model(input_ids)  
pred_ids = result[0][:, masked_index].topk(5).indices.tolist()[0]  
for pred_id in pred_ids:  
   output_ids = input_ids.tolist()[0].  
   output_ids[masked_index] = pred_id  
   print(tokenizer.decode(output_ids))

Print k=5 tokens that have a high probability of entering the position that the masked token is in.

By specifying the BERT model, we can calculate not only the word probabilities, but also the variance representation.

hidden_states = result.hidden_states

Range of applications after obtaining a distributed representation of a natural language

The following link shows the range of applications of the distributed representation of words

[What are some applications of the document classification problem?] (https://www.subcul-science.com/post/20200618blog-post_54/)
How to use distributed representations Classification and generalization performance by bagging

Learn how distributed representation works.

Let’s take a look at distributed representation, which learns the meaning of words. Why don’t you understand how distributed representations work by actually executing a program?

If you can get an idea of how distributed representations work, you will be able to better understand what and how the BERT system learns.

For more information, please refer to the following link

First Introduction to Natural Language Processing with Googlecolaboratory and python

This document will help the reader understand how distributed representation works in natural language processing, and will help the reader develop new natural language processing services.

technical natural language processing bert technology python distributed representation

A note on how to use BERT learned from Japanese Wikipedia, now available

目次

Preliminaries: Installing mecab

Install transformers.

Refer to the following example on how to use BERT.

Range of applications after obtaining a distributed representation of a natural language

Learn how distributed representation works.

First Introduction to Natural Language Processing with Googlecolaboratory and python

See also