Table of Contents
Learn natural language processing. Complete guide to NLP techniques, text processing, language models, and building NLP applications.
Introduction: Natural Language Processing (NLP)
Language is humanity’s most complex invention. Unlike images, which have inherent structure, text is symbolic—meaning comes from arbitrary associations learned through exposure.
Teaching computers to understand language has proven extraordinarily difficult. Yet recent breakthroughs—particularly transformers and large language models—have made NLP one of AI’s most impactful areas.
Today, NLP systems can translate between 100+ languages, write coherent essays, answer questions, and engage in conversations. This guide covers the landscape of NLP: how text processing works, major techniques, state-of-the-art approaches, and how to build NLP systems.
NLP Fundamentals
What Is NLP?
Natural Language Processing is the intersection of:
- Linguistics: Understanding language structure and meaning
- Computer Science: Processing data efficiently
- Machine Learning: Learning patterns from examples
Goal: Enable computers to process, understand, and generate human language meaningfully.
NLP vs Computational Linguistics
NLP (Applied Focus):
- Practical applications (chatbots, translation, summarization)
- Statistical and neural approaches
- Industry focus
Computational Linguistics (Theoretical Focus):
- Language structure and grammar
- Formal linguistic theory
- Academic research
NLP Tasks Hierarchy
Shallow Tasks (Tokenization → Parsing):
- Split text into tokens
- Identify parts of speech
- Build parse trees
- Extract relationships
Mid-Level Tasks (Semantic Analysis):
- Named entity recognition
- Sentiment analysis
- Semantic role labeling
- Coreference resolution
High-Level Tasks (Understanding → Generation):
- Question answering
- Machine translation
- Text summarization
- Dialogue systems
Text Preprocessing
Tokenization
Split text into meaningful units (tokens).
Word Tokenization:
Input: "Hello, world! How are you?"
Output: ["Hello", ",", "world", "!", "How", "are", "you", "?"]
Subword Tokenization (Modern):
Input: "unbelievable"
Output: ["un", "believ", "able"] # BPE tokens
Why Subword?
- Handles unknown words
- More efficient for varied vocabulary
- Modern models use this
Lowercasing and Normalization
Lowercasing:
Input: "The CAT sat"
Output: "the cat sat"
Reason: Reduce vocabulary, easier matching
Character Normalization:
- Remove accents: “café” → “cafe”
- Expand contractions: “don’t” → “do not”
- Normalize whitespace
Stopword Removal
Remove common words: “the”, “a”, “is”, “and”
Purpose:
- Reduce noise
- Focus on meaningful words
- Speed up processing
Caution:
- Can remove important information
- Often not needed with modern models
- Task-dependent
Stemming and Lemmatization
Reduce words to base form.
Stemming (Rule-based):
Input: running, runs, ran
Output: run, run, ran # Imperfect
Lemmatization (Dictionary-based):
Input: running, runs, ran
Output: run, run, run # Correct
Modern Approach: Often unnecessary with pre-trained models that understand morphology.
Feature Extraction
Bag of Words (BoW)
Represent text as word counts, ignoring order.
Example:
Doc 1: "cat sat on mat"
Doc 2: "dog sat on log"
Features:
cat sat on mat dog log
Doc 1: 1 1 1 1 0 0
Doc 2: 0 1 1 0 1 1
Pros: Simple, interpretable, fast
Cons: Loses word order, loses context
TF-IDF (Term Frequency-Inverse Document Frequency)
Weight words by importance.
Idea:
- Common words in corpus → low importance
- Rare words → high importance
- Words unique to document → high weight
Formula:
TF-IDF = (frequency in document) × log(total docs / docs with word)
Example:
- “the” appears in 90% of documents → low TF-IDF
- “quantum” appears in 2% of documents → high TF-IDF
One-Hot Encoding
Create binary vector for each word.
Example:
Vocabulary: ["cat", "dog", "sat"]
"cat" → [1, 0, 0]
"dog" → [0, 1, 0]
"sat" → [0, 0, 1]
Problem: High dimensionality for large vocabularies
Word Embeddings
Represent words as dense vectors capturing meaning.
Word2Vec (Word to Vector)
One of the most important NLP advances.
Key Idea: Train neural network with simple task (predict next word), extract learned representations.
Two Approaches:
Skip-gram:
Input: "the quick brown fox"
Training pair: ("quick", "the")
("quick", "brown")
("quick", "fox")
Model learns: represent words that appear together similarly
Continuous Bag of Words (CBOW):
Input: surrounding words ["the", "brown", "fox"]
Target: "quick"
Predict middle word from context
Word2Vec Properties
Analogy Reasoning:
King - Man + Woman ≈ Queen
Paris - France + Italy ≈ Rome
Embeddings capture semantic relationships!
GloVe (Global Vectors)
Alternative to Word2Vec.
Approach:
- Count co-occurrence statistics
- Matrix factorization on co-occurrence matrix
- Combines count-based and prediction-based methods
Advantage: More stable on small datasets than Word2Vec
FastText
Extension of Word2Vec using subword information.
Innovation:
- Represents words as sum of character n-grams
- Handles misspellings and rare words
- Better for morphologically rich languages
Example:
"hello" = <h, he, hel, hell, hello, ello, llo, lo, o>
Contextual Embeddings (Modern)
Earlier embeddings give same representation regardless of context.
Problem:
"bat" in "baseball bat" ≠ "bat" in "flying bat"
But Word2Vec gives same embedding
Solution: Contextual embeddings
- Representation depends on surrounding context
- ELMo, BERT, and others do this
- Much more powerful
Sequence Models
Recurrent Neural Networks (RNNs)
Process sequences one element at a time, maintaining hidden state.
Process:
Input: [w1, w2, w3, w4, w5]
State: h0 → h1 → h2 → h3 → h4 → h5
Output: [o1, o2, o3, o4, o5]
Strength: Captures sequential dependencies
Weakness:
- Vanishing gradient problem (hard to learn long dependencies)
- Sequential processing (can’t parallelize)
LSTMs (Long Short-Term Memory)
RNN variant addressing vanishing gradient.
Innovation: Memory cells with gates
- Forget gate: what to forget
- Input gate: what to remember
- Output gate: what to output
Effect: Can capture longer-range dependencies
GRUs (Gated Recurrent Units)
Simplified LSTM with fewer parameters.
Same advantages: Longer-range dependencies
Different: Fewer gates, slightly faster
Bidirectional Models
Process sequences in both directions.
Unidirectional:
- Only context before current position
- Useful for generation
Bidirectional:
- Context before and after
- Better for understanding tasks
- Can’t be used for generation (knows the answer)
Transformer Models
The Transformer Breakthrough
2017 paper “Attention Is All You Need” revolutionized NLP.
Key Innovation: Self-attention allows parallel processing while capturing long-range dependencies.
Advantages Over RNNs:
- Parallel processing (fast training)
- Better long-range dependencies
- Scales to larger models
BERT (Bidirectional Encoder Representations)
Pre-trained transformer encoder for understanding.
Training:
- Mask random 15% of tokens
- Predict masked words
- Predict if next sentence follows
Strengths:
- Excellent for classification, tagging, understanding
- Works well with limited fine-tuning data
- Transfer learning performance
Limitations:
- Bidirectional (can’t generate text)
- Freezes at training time (can’t adapt to new info)
GPT (Generative Pre-trained Transformer)
Pre-trained transformer decoder for generation.
Training:
- Simple objective: predict next token
- No masking, purely causal
Strengths:
- Excellent at text generation
- Few-shot learning ability
- Scaling improves ability
Versions:
- GPT-2: 1.5B parameters
- GPT-3: 175B parameters
- GPT-4: ~1T parameters (estimated)
T5 (Text-to-Text Transfer Transformer)
Unified framework treating all tasks as text-to-text.
Philosophy:
Classification: "classify sentiment: positive"
Translation: "translate English to French: hello"
Summarization: "summarize: [text]"
Q&A: "answer: what is 2+2?"
All use same architecture, input/output format
Advantage: Single model for many tasks
NLP Tasks and Applications
Sentiment Analysis
Determine emotional tone of text.
Levels:
- Binary: Positive/Negative
- Multi-class: Very Negative, Negative, Neutral, Positive, Very Positive
- Aspect-based: “Great food but terrible service”
Applications:
- Brand monitoring
- Customer feedback analysis
- Content moderation
Named Entity Recognition (NER)
Identify and classify named entities.
Input: "Apple CEO Tim Cook announced..."
Output: [Apple: Company] [Tim Cook: Person]
Entity Types: Person, Organization, Location, Date, etc.
Applications:
- Information extraction
- Knowledge graphs
- Resume parsing
Machine Translation
Translate between languages.
Modern Approach (Seq2Seq with Attention):
- Encode source language
- Decode into target language
- Attention mechanism helps alignment
Challenges:
- Preserving meaning
- Idioms and cultural context
- Low-resource languages
Question Answering
Answer questions based on context.
Types:
- Extractive: Answer is span from context
- Generative: Generate answer from scratch
- Conversational: Multi-turn dialogue
Machine Reading Comprehension
Understand text and answer questions about it.
Example:
Context: "The Eiffel Tower is located in Paris, France.
It was built in 1889."
Question: "Where is the Eiffel Tower?"
Answer: "Paris, France"
Summarization
Condense text while preserving key information.
Abstractive: Generate summary, not copying text
Extractive: Select important sentences
Modern Language Models
Scale Changes Everything
Remarkable pattern: Larger models develop unexpected abilities.
Scaling Curves:
- Error rate drops predictably with scale
- Specific capabilities emerge at thresholds
- In-context learning, reasoning, apparent knowledge
Implications:
- Bigger is better
- Emergent abilities unpredictable
- Current understanding incomplete
Instruction Following
Fine-tuning with instruction-response pairs.
Effect: Models follow human instructions better
Process:
- Train on diverse instructions
- Optimize for user satisfaction (RLHF)
- Helpful, harmless, honest responses
In-Context Learning
Learning from examples in prompt, without fine-tuning.
Few-shot Learning:
Prompt:
"Classify sentiment:
Example 1: 'Great movie!' → Positive
Example 2: 'Terrible experience' → Negative
New: 'Amazing service' →"
Response: "Positive"
Emergent Ability: Small models can’t do this; large models can.
Building NLP Systems
Pipeline Approach
- Text Input → 2. Preprocessing → 3. Feature Extraction → 4. Model → 5. Post-Processing → 6. Output
Modern Approach (Pre-trained + Fine-tune)
- Start with Pre-trained Model (BERT, GPT, T5)
- Fine-tune on Your Task (small dataset sufficient)
- Evaluate and Iterate
Production Considerations
Latency:
- Sub-second response needed for real-time
- Use smaller models or caching
- Batch processing for offline tasks
Scalability:
- Load balancing
- Model serving infrastructure
- Cost management
Monitoring:
- Track accuracy in production
- Detect distribution shift
- Monitor data quality
Challenges and Future
Remaining Challenges
Robust Understanding:
- Models still make silly mistakes
- Adversarial examples confuse models
- Out-of-distribution generalization poor
Interpretability:
- Why did model predict this?
- Hard to explain transformer decisions
Bias and Fairness:
- Training data reflects historical biases
- Models amplify existing biases
- Fair representation in data needed
Efficiency:
- Large models expensive to run
- Compression and distillation help
- Trade-off between power and efficiency
Future Directions
Multimodal Understanding:
- Text + images + audio together
- CLIP, Dall-E, GPT-4o leading
- More comprehensive understanding
Knowledge Integration:
- Combine neural with symbolic approaches
- Integrate with knowledge bases
- Reduce hallucinations
Interactive Learning:
- Learn from user feedback
- Humans-in-the-loop
- Continuous improvement
Reasoning:
- Multi-step reasoning
- Mathematical problem-solving
- Causal inference
Key Takeaways
✓ Text preprocessing – Tokenization, normalization, handling special cases
✓ Feature extraction – BoW, TF-IDF, embeddings capture meaning
✓ Word embeddings – Word2Vec, GloVe, contextual embeddings
✓ RNNs/LSTMs – Process sequences, capture dependencies
✓ Transformers – Self-attention, parallel processing, powerful
✓ Pre-trained models – BERT, GPT, T5 foundation for modern NLP
✓ Transfer learning – Fine-tune on your task, minimal data needed
✓ Many tasks – Classification, translation, summarization, Q&A
✓ Emerging abilities – Scale reveals unexpected capabilities
✓ Challenges remain – Robustness, interpretability, efficiency, bias
Frequently Asked Questions
Q: Do I need linguistics background for NLP?
A: Helpful but not required. Modern approaches are empirical; linguistic knowledge useful but not necessary.
Q: Should I use BERT or GPT?
A: BERT for understanding tasks (classification, tagging). GPT for generation. Choose based on your task.
Q: How do I handle domain-specific language?
A: Fine-tune on domain data. Domain-specific pre-training if large corpus. Transfer learning usually sufficient.
Q: Why do language models hallucinate?
A: Trained to predict next token, not verify correctness. When uncertain, continue plausibly (often false).
Q: Can models truly understand language?
A: Debated. They demonstrate understanding on many tasks but fail on others. Likely different from human understanding.

