目次
Applying the Document Classification Problem
You’ve learned about machine learning, but you don’t know how to use it! Isn’t it?
It is easy to overlook this if you don’t pay attention to it when you study it, but if you don’t keep your antennas up, you won’t know how to use it.
If you don’t keep your antennae up, you won’t know how to use it. Since a tool is only a tool if it is used, you should make a note of how you use your newly acquired tool.
Scope of the Document Classification Problem
If you have studied document classification in natural language processing, you should be able to do the following.
- Spam mail detection.
- News topic classification
- Extraction of important parts
- Document summarization
- Recommendations to users
- Clustering
- Sentiment analysis, etc.
It seems to have a surprisingly wide range of applications, doesn’t it?
What is the document classification problem?
It is the process of assigning one or more labels to a single document. In machine learning, we build a model to predict the labels.
Here, documents can be as short as a word or as long as a news article, and the length is not so important.
Labels can be important or not (binary), topic, sentiment (multiclass, multi-label), etc.
There are supervised and unsupervised methods.
Supervised
In the supervised case, the labels need to be prepared by humans.
This can be done by crowdsourcing annotations, collecting tags from social networking sites or reviews from e-commerce sites.
In this case, it is often impossible to prepare a sufficient amount of data on our own. For this reason, machine learning methods may be useful.
Unsupervised
Unsupervised is easy to prepare because you only need the data of documents.
For example, the data from Wikipedia can be used.
Features
We need to extract features from documents.
TF-IDF and distributed representation are used as features.
In recent research (as of 2020), deep learning is the main topic.
Based on the obtained features, classification can be done using machine learning methods such as SVM in the supervised case.
Actual usage examples
The actual usage is noted in the following link.
- Memo on how to use Universal Sentence Encoder
- A note on how to use BERT learned from Japanese Wikipedia, since it was published
- From Word2Vec mechanism to model training using Google colaboratory
- How to solve document classification problems with Fasttext
- A note on how to use NeuralClassifier, which provides a model for solving document classification problems
- Notes on learning the Sentence BERT Japanese model
- Usage of Distributed Representations A note on classifier and generalization performance with bagging
- Using BART (sentence summary model) with hugging face
Learn about how distributed representation works.
We will learn about distributed representations, which learn the meanings of words. Why don’t you try to understand how distributed expressions work by actually running the program?
If you can get an idea of how distributed representations are learned, you will have a better understanding of what and how they are learned in the BERT system.
For more information, please refer to the following link
First Introduction to Natural Language Processing with Googlecolaboratory and python
This document will help readers to understand how distributed representation works in natural language processing, and help readers to develop new natural language processing services.
Reference books
See also
- Why is fasttext so fast?
- Creating data in Natural Language Inference (NLI) format for Sentence transformer
- On the use of distributed representations bagging for class classification and generalization performance
- How to train a Japanese model with Sentence transformer to get a distributed representation of a sentence
- Using BART (sentence summary model) with hugging face