LEE CHANWOO

LEE CHANWOO

📝 Blog 2025 version 2.0.4
🔖 Oct 27, 2025

[NLP] Text Classification

2 분 소요

Text Classification

Artificial Intelligence

The intelligence exhihited by machines

How to create computers and computer software that are capable of intelligent behavior

Machine Learning

Subfield of artificial intelligence
Study of pattern recognition and computational learning theory
Creating programs that can automatically learn rules from data

“Field of study that gives computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959)

Traditional: Write programs using hard-coded (fixed) rules

Machine Learning: Learn rules by looking at some training data

Supervised Learning
- Predictive approach
- To learn a mapping from inputs to outputs
- Example) classification, regression
Unsupervised Learning
- Descriptive approach
- To find interesting patterns in the data
- Example) clustering, dimensionality reduction

Supervised Learning

Given: Training data as labeled instances {(x^(1), y^(1)), …, (x^(N), y^(N))}
Goal: Learn a rule (f:x → y) to predict outputs y for new inputs x
Example)
- Data: ((Blue, Square, 10), yes), …,((Red, Elipse, 20.7), yes)
- Task: For new inputs (Blue, Crescent, 10), (Yellow, Circle, 12), are they yes/no?

Classification: discrete-valued outputs
Examples)
- Data: Size and label {(Height, Weight), Cat/Dog}
- Task: Predict whether an animal is a cat or dog given new size information
- Method: Finding a linear or nonlinear separator

Classification

A mapping h from input data x to a lael y
- x ∈ X, X is instance space (i.e. all documnets)
- y ∈ Y, Y is enumerable output spae (i.e. categories)
- x: a single document
- y: politics
Image ⇒ Digit

Mail ⇒ spam or not
Text ⇒ Gender of the author

Movie review ⇒ Rating

Document ⇒ Category

text Classification Problem Given a text w = (w1, w2, …, wr) ∈ V, predict a label y ∈ Y

Classifier

Naive Bayes
Perceptron
Logistic regression
Support Vector Machine
Random Forests
Deep learning models

Natural Language Processing

Word Representations
- Count-based
  - created by a simple function of the counts of nearby words
  - tf-idf, PPMI
- Class-based
  - created through hierarchical clustering
  - Brown clusters
- Distributed prediction-based embeddings
  - created by training a classifier to distinguish nearby and far-away words
  - Word2vec, Fasttext
- Distributed contextual embeddings from language models
  - Embeddings from language model
  - ELMo, BERT, GPT
Documnet Representations
- Count-based
  - Bag-of-words
- Neural network based
  - RNN
  - Neural language model

Bag-of-Words

One challenge is that the sequential representation (w1, w2, …, wr) may have a different length T for every document
The bag-of-words is a fixed-length representation, which consists of a vector word count:

The length of x is equal to the size of the vocabulary V
For each x, there may be many possible w, depending on word order

Neural Networks

공유하기

Twitter Facebook LinkedIn

댓글남기기

참고

[Docker] Docker Compose Cheatsheet

6 분 소요

[Docker] Docker Cheatsheet

3 분 소요

[MLOps] 가상화와 컨테이너

5 분 소요

[Machine Learning] Parquet vs Arrow

2 분 소요