WikiGalaxy

Personalize

Key Concepts of Deep Learning

Neural Networks:

Composed of layers of neurons that process input data.
Each neuron applies a linear transformation followed by a non-linear activation function.
Networks are trained using backpropagation to minimize the error.

Activation Functions:

Introduce non-linearity into the network.
Common functions include ReLU, Sigmoid, and Tanh.
ReLU is popular due to its simplicity and effectiveness.

Loss Function:

Measures the difference between predicted and actual values.
Common loss functions include Mean Squared Error and Cross-Entropy.
Choice of loss function affects the performance of the model.

Optimization Algorithms:

Used to update the weights of the network to minimize the loss function.
Popular algorithms include Gradient Descent, Adam, and RMSprop.
Choice of optimizer can affect convergence speed and accuracy.

Regularization Techniques:

Help prevent overfitting by penalizing large weights.
Common techniques include L1, L2 regularization, and Dropout.
Regularization improves model generalization on unseen data.

Neural Networks

Feedforward Network:

In a feedforward neural network, the information moves in one direction—from input nodes, through hidden nodes (if any), to output nodes. There are no cycles or loops in the network.

Convolutional Neural Network (CNN):

CNNs are primarily used for image processing tasks. They use convolutional layers to automatically detect patterns and features in images.

Recurrent Neural Network (RNN):

RNNs are designed for sequence prediction tasks. They have connections that form directed cycles, allowing information to persist.

Long Short-Term Memory (LSTM):

LSTMs are a type of RNN capable of learning long-term dependencies. They are designed to avoid the long-term dependency problem.

Generative Adversarial Network (GAN):

GANs consist of two networks, a generator and a discriminator, that compete against each other to produce realistic synthetic data.

Activation Functions

ReLU (Rectified Linear Unit):

ReLU is defined as f(x) = max(0, x). It is computationally efficient and helps mitigate the vanishing gradient problem.

Sigmoid:

The sigmoid function maps input values to a range between 0 and 1. It is used in the output layer for binary classification tasks.

Tanh (Hyperbolic Tangent):

Tanh maps input values to a range between -1 and 1. It is zero-centered, which can help with optimization.

Leaky ReLU:

Leaky ReLU allows a small, non-zero gradient when the unit is not active, which helps in avoiding dead neurons.

Softmax:

Softmax is used in the output layer of a classifier to assign decimal probabilities to each class.

Loss Functions

Mean Squared Error (MSE):

MSE is used for regression tasks. It calculates the average of the squares of the errors between predicted and actual values.

Cross-Entropy Loss:

Cross-entropy loss measures the performance of a classification model whose output is a probability value between 0 and 1.

Hinge Loss:

Hinge loss is used for "maximum-margin" classification, most notably for support vector machines.

Huber Loss:

Huber loss is less sensitive to outliers in data than squared error loss. It is used in robust regression models.

Kullback-Leibler Divergence:

KL Divergence measures how one probability distribution diverges from a second, expected probability distribution.

Optimization Algorithms

Gradient Descent:

Gradient Descent is an iterative optimization algorithm for finding the minimum of a function. It updates parameters by moving in the direction of the negative gradient.

Stochastic Gradient Descent (SGD):

SGD updates the parameters for each training example, which can lead to faster convergence but also more noise in the updates.

Adam Optimizer:

Adam combines the advantages of two other extensions of stochastic gradient descent. It uses running averages of both the gradients and the second moments of the gradients.

RMSprop:

RMSprop is an adaptive learning rate method that keeps the moving average of the square of gradients for each weight.

Adagrad:

Adagrad adapts the learning rate to the parameters, performing smaller updates for more frequent parameters and larger updates for infrequent ones.

Regularization Techniques

L1 Regularization:

L1 regularization adds a penalty equal to the absolute value of the magnitude of coefficients. It can lead to sparse models with few coefficients.

L2 Regularization:

L2 regularization adds a penalty equal to the square of the magnitude of coefficients. It encourages smaller, more distributed weights.

Dropout:

Dropout is a technique where randomly selected neurons are ignored during training, which helps prevent overfitting.

Early Stopping:

Early stopping involves monitoring the model's performance on a validation set and stopping training when performance degrades.

Batch Normalization:

Batch normalization normalizes the inputs of each layer to have zero mean and unit variance, improving convergence speed.