Text Classification

Artificial Intelligence

  • The intelligence exhihited by machines


  • How to create computers and computer software that are capable of intelligent behavior


Machine Learning

  • Subfield of artificial intelligence
  • Study of pattern recognition and computational learning theory
  • Creating programs that can automatically learn rules from data

“Field of study that gives computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959)

  • Traditional: Write programs using hard-coded (fixed) rules


  • Machine Learning: Learn rules by looking at some training data


  • Supervised Learning
    • Predictive approach
    • To learn a mapping from inputs to outputs
    • Example) classification, regression
  • Unsupervised Learning
    • Descriptive approach
    • To find interesting patterns in the data
    • Example) clustering, dimensionality reduction

Supervised Learning

  • Given: Training data as labeled instances {(x^(1), y^(1)), …, (x^(N), y^(N))}
  • Goal: Learn a rule (f:x → y) to predict outputs y for new inputs x
  • Example)
    • Data: ((Blue, Square, 10), yes), …,((Red, Elipse, 20.7), yes)
    • Task: For new inputs (Blue, Crescent, 10), (Yellow, Circle, 12), are they yes/no?


  • Classification: discrete-valued outputs
  • Examples)
    • Data: Size and label {(Height, Weight), Cat/Dog}
    • Task: Predict whether an animal is a cat or dog given new size information
    • Method: Finding a linear or nonlinear separator



  • A mapping h from input data x to a lael y
    • xX, X is instance space (i.e. all documnets)
    • yY, Y is enumerable output spae (i.e. categories)
    • x: a single document
    • y: politics
  • Image ⇒ Digit


  • Mail ⇒ spam or not

  • Text ⇒ Gender of the author


  • Movie review ⇒ Rating


  • Document ⇒ Category


text Classification Problem Given a text w = (w1, w2, …, wr) ∈ V, predict a label yY


  • Naive Bayes
  • Perceptron
  • Logistic regression
  • Support Vector Machine
  • Random Forests
  • Deep learning models

Natural Language Processing


  • Word Representations
    • Count-based
      • created by a simple function of the counts of nearby words
      • tf-idf, PPMI
    • Class-based
      • created through hierarchical clustering
      • Brown clusters
    • Distributed prediction-based embeddings
      • created by training a classifier to distinguish nearby and far-away words
      • Word2vec, Fasttext
    • Distributed contextual embeddings from language models
      • Embeddings from language model
      • ELMo, BERT, GPT
  • Documnet Representations
    • Count-based
      • Bag-of-words
    • Neural network based
      • RNN
      • Neural language model



  • One challenge is that the sequential representation (w1, w2, …, wr) may have a different length T for every document
  • The bag-of-words is a fixed-length representation, which consists of a vector word count:


  • The length of x is equal to the size of the vocabulary V
  • For each x, there may be many possible w, depending on word order

Neural Networks
