PyTorch nn.Module, Loss Functions and Optimizers

PyTorch models are usually built with nn.Module. A module owns parameters, defines a forward pass, and can contain other modules. Once you understand modules, loss functions, and optimizers, most training scripts become readable.

This lesson fills the gap between tensors/autograd and full training loops. You will learn what model.parameters() returns, why losses must be scalar values, how optimizers update weights, and how parameter groups let you train different parts of a model differently.

PyTorch is expanded here with a practical explanation, multiple examples, and beginner-focused checks so the idea is easier to learn from this page alone.

Read the concept first, then trace the example line by line. The important habit is to connect the rule to visible behavior instead of memorizing only the name.

Mental Model

A training step computes predictions, turns error into a scalar loss, backpropagates gradients into parameters, and lets the optimizer update those parameters.

How nn.Module Organizes Models

Every layer assigned as an attribute of an nn.Module is registered automatically. That registration is why model.parameters(), model.to(device), model.train(), model.eval(), and state_dict() work across nested modules.

Define layers in __init__.
Define tensor flow in forward.
Avoid creating trainable layers inside forward.
Use model.children(), named_modules(), and named_parameters() for inspection.

Choosing Loss and Optimizer

Classification, regression, ranking, segmentation, and language modeling use different loss functions. The optimizer then uses gradients from that loss to update model weights. A wrong loss function can make a correct architecture fail.

Use CrossEntropyLoss for multi-class classification with raw logits.
Use BCEWithLogitsLoss for binary or multi-label classification.
Use MSELoss or L1Loss for regression.
Start with AdamW for many deep learning projects, then tune from there.

Detailed Explanation of PyTorch

PyTorch becomes much easier when you separate the concept from the tool syntax. First identify the problem being solved, then identify the data or resource being changed, and finally identify the proof that the change worked.

In PyTorch, this topic should be studied through tensor shape, dtype, device, gradient flow, loss movement, and reproducibility. Those points explain not only how to use the feature, but also why it fails when the wrong assumption is made.

The previous audit note was: under 650 content words . This expanded section adds a fuller explanation, concrete examples, and practice guidance so the page can stand on its own for beginners.

A good way to learn this page is to read the normal path once, run or trace the example, then intentionally change one input to observe the different result. That one change teaches more than memorizing several definitions.

Write the goal of PyTorch before touching code or configuration.
Identify the normal case, edge case, and failure case.
Trace what changes before and after the operation.
Use a command, output, compiler message, log, metric, or table to verify the result.
Record the mistake that would confuse a beginner and the exact fix.

Beginner-Friendly Walkthrough for PyTorch

Start with a tiny project scenario. For example, imagine one user action, one request, one resource, one function call, or one batch of data. Keep the scenario small enough that every step can be explained without skipping details.

Next, describe the movement of information. Where does the input start? Which rule or component handles it? What result should appear? If the result is wrong, where would you inspect first?

Finally, compare two outcomes. The correct outcome proves that you understand the main rule. The incorrect outcome teaches the symptom, which is what you will recognize later during debugging or interviews.

Normal path: valid input produces the expected result.
Boundary path: the smallest, largest, empty, or unusual input still behaves predictably.
Error path: a realistic mistake creates a visible symptom.
Fix path: one focused correction removes the symptom without changing unrelated code.

Custom Module with Shape Comments

A readable forward pass makes debugging much easier.

Custom Module with Shape Comments

import torch
from torch import nn

class TabularClassifier(nn.Module):
    def __init__(self, input_dim: int, num_classes: int):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Dropout(p=0.2),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, num_classes),
        )

    def forward(self, x):
        # x: [batch_size, input_dim]
        return self.network(x)  # logits: [batch_size, num_classes]

model = TabularClassifier(input_dim=20, num_classes=4)
logits = model(torch.randn(8, 20))
print(logits.shape)

Return logits from the model; apply softmax only for display or inference probabilities.
Dropout behaves differently in train and eval mode.

Optimizer Parameter Groups

Parameter groups let you apply different learning rates or weight decay to different parts of a model.

Optimizer Parameter Groups

optimizer = torch.optim.AdamW([
    {"params": model.network[0].parameters(), "lr": 1e-4},
    {"params": model.network[3:].parameters(), "lr": 3e-4},
], weight_decay=1e-4)

loss_fn = nn.CrossEntropyLoss()

features = torch.randn(16, 20)
labels = torch.randint(0, 4, (16,))

optimizer.zero_grad(set_to_none=True)
logits = model(features)
loss = loss_fn(logits, labels)
loss.backward()
optimizer.step()

Parameter groups are common in transfer learning and fine-tuning.
Always clear gradients before the backward pass for a new batch.

PyTorch PyTorch shape-first example

import torch

x = torch.randn(4, 3)
print('topic:', 'PyTorch')
print('shape:', x.shape)
print('dtype:', x.dtype)
print('device:', x.device)

# Shape, dtype, and device checks catch many PyTorch mistakes early.

PyTorch PyTorch train-step example

import torch
from torch import nn

model = nn.Sequential(nn.Linear(3, 4), nn.ReLU(), nn.Linear(4, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.MSELoss()

x = torch.randn(8, 3)
y = torch.randn(8, 1)
loss = loss_fn(model(x), y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(float(loss))

Key Takeaways

nn.Module registers nested layers and parameters automatically.
Loss functions must match the task and target shape.
Optimizers update parameters using gradients from backward().
Explain the purpose of PyTorch in your own words.
Run or trace a small PyTorch example for PyTorch.
Test a normal case, a boundary case, and a broken case.
Verify the result with visible output, logs, metrics, compiler feedback, or a table.
Summarize the common mistake and the correction.

Common Mistakes to Avoid

WRONG Create nn.Linear inside forward.

RIGHT Create trainable layers in __init__.

Layers created in forward are recreated every call and are not trained correctly.

WRONG Apply softmax before CrossEntropyLoss.

RIGHT Pass raw logits to CrossEntropyLoss.

CrossEntropyLoss already handles log-softmax internally.

WRONG Learning PyTorch only as a term.

RIGHT Learn it through a working example, a boundary case, and a failure case.

Concept plus behavior is easier to remember than definition alone.

WRONG Skipping verification.

RIGHT Always check output, state, logs, metrics, query results, or compiler feedback.

Verification turns confidence into evidence.

WRONG Changing many things at once while debugging.

RIGHT Change one setting, input, or line, then inspect the result.

Small changes reveal the real cause.

Practice Tasks

Print all named parameters in a custom model.
Swap CrossEntropyLoss for BCEWithLogitsLoss and adjust labels for multi-label data.
Create two optimizer parameter groups with different learning rates.
Create a small demo that shows PyTorch clearly.
Add one edge case and write the expected result before running it.
Break the demo intentionally and document the error symptom.
Fix the broken version and explain why the fix works.

Frequently Asked Questions

Why does model.to(device) move all layers?

Because nn.Module tracks registered parameters and buffers recursively.

Can a model return multiple values?

Yes. Many models return logits plus auxiliary outputs, but your loss and training loop must handle that structure explicitly.

What is the fastest way to understand PyTorch?

Start with one tiny example, trace every step, then compare it with a broken version.

What should I verify after using PyTorch?

Verify the visible result: output, state, log entry, metric, query result, compiler feedback, or rendered behavior.

Why does PyTorch feel confusing at first?

It often combines vocabulary with behavior. The confusion drops when you trace the input, rule, result, and failure path.

Previous Next

PyTorch nn.Module, Loss Functions and Optimizers