目次
Summary of what I’ve done with Fasttext to the document classification problem.
- Facebook research has published a document classification library using Fasttext.
- Fasttext is easy to install in a python environment.
- Run time is fast.
Preliminaries
I decided to tackle the task of document classification, and initially thought.
NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit. However, it was not very accurate.
My boss taught me how to do it.
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
In.
Features of the Fasttext library
This will be a library that solves the document classification problem using Fasttext in an end-to-end manner.
Therefore, it is designed to optimize document vectors for classification tasks.
It is very fast, taking less than a few seconds to learn, and is a good starting point.
Also, the performance is not bad. Hyperparameter tuning](https://fasttext.cc/docs/en/autotune.html) is also available.
The basics are as above.
Just specify the parameters you want to fix (e.g., the number of dimensions of the variance representation), and run as follows
model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid')
where ‘cooking.train’ and ‘cooking.valid’ are text files in the specified format.
How to use the Fasttext library.
You can use the fasttext library by running ````p pip install fasttext
The above command is all you need to set up a fastText environment for python as of June 2020. It is very easy.
## How to use it for document classification problems
We need to create a 'data.train.txt' file as the teacher data.
The format of the data is a text file with "**label**" and a tokenized document on each line.
train.txt
__label__1 Love is heavy
__label__2 I love you
In the same way, we will create test data, etc. in the same format as the teacher data.
To train the model, run the following code.
```` import fasttext
import fasttext
model = fasttext.train_supervised('train.txt')
The training time depends on the amount of teacher data, but can be handled by the CPU, and with the data at hand (about 1000 cases), training was completed in a few seconds.
Estimation results using the learned model can be obtained as follows.
model.predict("Do you believe in love?")
``` model.predict("Do you believe in love?")
In this case, an array of classes to be classified and the predicted probabilities will be returned.
This solves the document classification problem using fastText.
Evaluation
You can use sklearn to do mixing matrices and accuracy comparisons.
The following is a quote from the official scikit-learn website.
```py
from sklearn.metrics import confusion_matrix
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
````py
```py
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2].
y_pred = [0, 0, 2, 2, 1]]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
precision recall f1-score support
<BLANKLINE
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
<BLANKLINE
accuracy 0.60 5
macro avg 0.50 0.56 0.49 5
weighted avg 0.70 0.60
Summary
By using fastText and scikit-learn, we can use
Using fastText and scikit-learn, we can easily tackle document classification problems in the python environment.
Using fastText and scikit-learn, we can easily tackle document classification problems in python environment.
Since python is also useful for creating datasets, it is a good option if you want to try document classification for the first time.
For practical applications, please refer to the following links.
[What are some applications of the document classification problem?] (https://www.subcul-science.com/post/20200618blog-post_54/)
Learn about how distributed representation works.
We will learn about distributed representations, which are used to learn the meanings of words. Why don’t you try to understand how distributed expressions work by actually running the program?
If you can get an idea of how distributed representations are learned, you will have a better understanding of what and how they are learned in the BERT system.
For more information, please refer to the following link
First Introduction to Natural Language Processing with Googlecolaboratory and python
This document will help readers to understand how distributed representation works in natural language processing, and help readers to develop new natural language processing services.
References
See also
- 日本語の分散表現の計算方法まとめ
- Why is fasttext so fast?
- Creating data in Natural Language Inference (NLI) format for Sentence transformer
- On the use of distributed representations bagging for class classification and generalization performance
- How to train a Japanese model with Sentence transformer to get a distributed representation of a sentence