The ML Glossary Pt. 1
39
Glossary
AI
This page, and any other subsequent page, will contain most "Artificial Intelligence" terms you can think of, or at least it will be updated with time, to keep up with the rapid pace of change in the field. The definitions might be more easy-going, without trying to lose the main concept.
So let's start:
1. Artificial Intelligence (AI)
Artificial Intelligence refers to the simulation of human intelligence in machines that are programmed to think, learn, and problem-solve like humans. AI encompasses a variety of subfields, including machine learning (ML), natural language processing (NLP), robotics, and computer vision. The goal of AI is to create systems that can perform tasks autonomously, improving their performance through experience.
2.Algorithm
An algorithm is a step-by-step procedure or set of rules for solving a specific problem or task. In AI and machine learning, algorithms are the backbone of model development, guiding data processing, decision-making, and optimization. Common AI algorithms include decision trees, neural networks, and support vector machines.
3. Artificial Neural Network (ANN)
An Artificial Neural Network is a computational model inspired by the structure and function of biological neural networks. It consists of layers of interconnected nodes (or neurons) that process data through weighted connections. ANNs are particularly powerful for tasks like image recognition, natural language processing, and reinforcement learning. They are a core component of deep learning.
4. Autonomous Systems
Autonomous systems are machines or robots that can perform tasks or make decisions without human intervention. These systems rely on AI techniques, including reinforcement learning and computer vision, to sense their environment, plan actions, and execute decisions. Examples include self-driving cars and autonomous drones.
5. Activation Function
An activation function in artificial neural networks determines whether a neuron should be activated or not, based on the input it receives. It introduces non-linearity into the model, allowing it to learn complex patterns. Common activation functions include:
Sigmoid: f(x) = 1 / (1 + e^(-x))
ReLU (Rectified Linear Unit): f(x) = max(0, x)
Tanh: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
These functions are critical for deep learning models, enabling them to model non-linear relationships in the data.
6. Agent
In reinforcement learning, an agent is an entity that interacts with its environment in order to maximize some notion of cumulative reward. The agent takes actions based on the current state of the environment, and the environment provides feedback in the form of rewards or penalties. This feedback guides the agent's learning process.
7. Active Learning
Active learning is a machine learning approach where the model selects the most informative data points to be labeled by humans. This process helps improve model accuracy with fewer labeled samples, often used when labeling data is expensive or time-consuming. The key idea is that the model is most uncertain about specific instances, and labeling those instances can maximize learning.
8. Adversarial Attack
An adversarial attack refers to a technique where small, intentional perturbations are added to input data in order to mislead machine learning models into making incorrect predictions. These attacks exploit vulnerabilities in models, particularly deep neural networks, and can significantly reduce model accuracy. For example, in image classification, slight pixel changes can cause a model to misclassify an image.
9. Artificial General Intelligence (AGI)
Artificial General Intelligence is a theoretical form of AI that can understand, learn, and apply intelligence across a broad range of tasks, similar to the cognitive abilities of humans. Unlike narrow AI, which is designed for specific tasks (e.g., playing chess or recognizing faces), AGI would be capable of solving any intellectual problem that a human being can, adapting to new tasks without human intervention.
10. Autoencoder
An autoencoder is a type of neural network used for unsupervised learning tasks like dimensionality reduction and feature learning. The network is trained to compress input data into a smaller representation (encoding) and then reconstruct the data from that encoding (decoding). The primary objective is to minimize the difference between the original input and the reconstructed output. Mathematically, the objective is to minimize the reconstruction error, typically using:
L = || x - x_hat ||^2
where x is the input and x_hat is the reconstructed output.
11. Backpropagation
Backpropagation is a supervised learning algorithm used for training artificial neural networks. It calculates the gradient of the loss function with respect to each weight by the chain rule, updating the weights iteratively to minimize the error. The process involves two phases: the forward pass, where the input data is passed through the network to compute the output, and the backward pass, where the gradients of the loss are computed and propagated back through the network to adjust the weights.
Mathematically, for a neural network with weights W, the update rule is:
W = W - η * ∇L(W)
where η is the learning rate, and ∇L(W) is the gradient of the loss with respect to the weights.
12. Bayesian Inference
Bayesian Inference is a statistical method used to update the probability of a hypothesis as more evidence or data becomes available. It is based on Bayes' Theorem, which is stated as:
P(H|D) = (P(D|H) * P(H)) / P(D)
where:
P(H∣D) is the posterior probability (the probability of the hypothesis HH given the data DD)
P(D∣H) is the likelihood (the probability of the data DD given the hypothesis HH)
P(H) is the prior probability (the initial belief about the hypothesis)
P(D) is the evidence (the probability of the data)
Bayesian methods are widely used in machine learning for tasks such as classification, regression, and probabilistic modeling.
13. Batch Gradient Descent
Batch Gradient Descent is an optimization algorithm used to minimize the loss function in machine learning models, particularly in training neural networks. In this approach, the entire dataset is used to compute the gradient of the loss function before updating the model parameters. While it can be computationally expensive for large datasets, it guarantees convergence to the global minimum for convex loss functions.
The update rule is:
θ = θ - η * (1/m) * Σ ∇L(θ, x_i, y_i)
where θ represents the model parameters, η is the learning rate, m is the batch size, and (xi,yi) are the training examples.
14. Bias-Variance Tradeoff
The bias-variance tradeoff is a fundamental concept in machine learning that describes the balance between two sources of error in a model:
Bias refers to errors introduced by approximating a real-world problem with a simplified model. High bias typically leads to underfitting, where the model is too simple to capture the underlying patterns in the data.
Variance refers to errors caused by sensitivity to small fluctuations in the training data. High variance can lead to overfitting, where the model becomes too complex and captures noise in the data rather than the actual patterns.
The tradeoff involves finding the optimal model complexity that minimizes both bias and variance, thus achieving good generalization to new data.
15. Boosting
Boosting is an ensemble learning technique that combines multiple weak learners (models that perform slightly better than random guessing) to create a strong learner. The process involves sequentially training models, where each new model corrects the errors of the previous ones. Common boosting algorithms include AdaBoost, Gradient Boosting, and XGBoost.
The core idea is to give more weight to misclassified instances, forcing the next model to focus on correcting these errors.
16. Classification
Classification is a supervised learning task where the goal is to assign input data into predefined categories or classes. It involves training a model on labeled data and using it to predict the class label of new, unseen data. Common classification algorithms include logistic regression, decision trees, and support vector machines (SVMs).
Mathematically, for a binary classification problem, the model typically outputs a probability:
P(class = 1 | x) = 1 / (1 + exp(-w*x))
where w is the weight vector, x is the input feature vector, and the function applies the logistic sigmoid to output a probability.<br>
17. Clustering
Clustering is an unsupervised learning task where the goal is to group data points into clusters based on their similarity. Unlike classification, clustering does not require labeled data. Algorithms such as K-means, hierarchical clustering, and DBSCAN are commonly used for this purpose. The challenge in clustering is defining a measure of similarity and determining the appropriate number of clusters.<br>
18. Convolutional Neural Network (CNN)
A Convolutional Neural Network is a deep learning model designed for processing structured grid data, such as images. CNNs consist of convolutional layers that apply filters to input data, pooling layers to reduce dimensionality, and fully connected layers for final classification or regression. These models are particularly effective for image recognition tasks.
Mathematically, the convolution operation for a 2D image I and filter F is defined as:
(I * F)(i, j) = Σ_m Σ_n I(m, n) * F(i - m, j - n)
where * denotes convolution, and the sum runs over the filter dimensions.<br>
19. Cost Function
A cost function, also known as a loss function, measures the error between the predicted output of a model and the actual target values. The goal of training a machine learning model is to minimize the cost function. For regression problems, a common cost function is the mean squared error (MSE):
MSE = (1/n) * Σ (y_i - ŷ_i)^2
where y_i are the true values, ŷ_i are the predicted values, and n is the number of data points.<br>
20. Cross-Validation
Cross-validation is a technique used to assess the performance of a machine learning model by splitting the data into multiple subsets. In k-fold cross-validation, the data is divided into k subsets. The model is trained on k-1 subsets and tested on the remaining subset. This process is repeated k times, with each subset being used as the test set once. Cross-validation helps to reduce the risk of overfitting and provides a more reliable estimate of model performance.<br>
21. Curse of Dimensionality
The curse of dimensionality refers to the phenomenon where the feature space becomes sparse as the number of dimensions (features) increases. In high-dimensional spaces, data points become more spread out, and distances between points become less meaningful. This can lead to poor model performance, particularly for algorithms that rely on distance metrics, such as k-nearest neighbors (K-NN).<br>
22. Deep Learning
Deep learning is a subset of machine learning that uses neural networks with many layers (hence "deep") to model complex patterns in large datasets. Deep learning models, such as deep neural networks (DNNs) and CNNs, are particularly effective for tasks like image and speech recognition, natural language processing, and game playing.
<br>
23. Dimensionality Reduction
Dimensionality reduction techniques are used to reduce the number of features in a dataset while retaining as much information as possible. Principal component analysis (PCA) is a widely used technique that transforms the data into a new set of variables (principal components) ordered by variance. The first few components typically capture most of the information in the original data.
Mathematically, PCA finds a linear transformation X' = XW such that the covariance matrix of X' is diagonal and the variance in each dimension is maximized.<br>
24. Dropout
Dropout is a regularization technique used to prevent overfitting in neural networks by randomly setting a fraction of the input units to zero during training. This forces the network to become less reliant on specific neurons, encouraging it to learn more robust features. Dropout is typically applied during the training phase and is turned off during testing.<br>
25. Epoch
An epoch in machine learning refers to one complete pass through the entire training dataset. During each epoch, the model's parameters are updated based on the data. Multiple epochs are often required to train a model effectively, with the model typically improving after each pass. The number of epochs is a hyperparameter that can be tuned to achieve optimal model performance.<br>
26. Exploding Gradient Problem
The exploding gradient problem occurs when the gradients used to update model weights during training become excessively large, causing the weights to grow uncontrollably. This is especially common in deep neural networks and can result in model instability or failure to converge. Techniques like gradient clipping are used to mitigate this issue.<br>
27. Feature Engineering
Feature engineering is the process of selecting, modifying, or creating new features from raw data to improve the performance of machine learning models. It involves tasks like normalizing data, handling missing values, encoding categorical variables, and constructing new features based on domain knowledge. Effective feature engineering can significantly enhance model accuracy and efficiency.<br>
28. Feature Selection
Feature selection is the process of identifying and selecting a subset of the most relevant features for use in model training. This helps improve model performance by reducing overfitting, decreasing computational costs, and increasing interpretability. Methods for feature selection include filter methods (e.g., correlation-based), wrapper methods (e.g., recursive feature elimination), and embedded methods (e.g., Lasso regression).<br>
29. Generative Adversarial Network (GAN)
A Generative Adversarial Network consists of two neural networks, a generator and a discriminator, that are trained simultaneously. The generator creates fake data that is designed to resemble real data, while the discriminator attempts to distinguish between real and fake data. The networks compete in a zero-sum game, where the generator improves its ability to create realistic data, and the discriminator improves its ability to differentiate between real and fake. GANs are widely used in image generation, style transfer, and other creative tasks.<br>
30. Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the cost function by iteratively moving in the direction of the steepest descent (the negative gradient). It is the most commonly used algorithm for training machine learning models, including neural networks. The update rule for the weights ww is:
w = w - η * ∇L(w)
where η is the learning rate and ∇L(w) is the gradient of the loss function L(w) with respect to the weights.<br>
31. Grid Search
Grid search is an exhaustive search technique used to find the best combination of hyperparameters for a machine learning model. It systematically tests all possible combinations of hyperparameters over a predefined search space and selects the combination that yields the best performance. While grid search can be computationally expensive, it guarantees finding the optimal hyperparameters within the specified grid.<br>
32. Hard Margin Support Vector Machine (SVM)
A Hard Margin Support Vector Machine is a classification algorithm used to find the hyperplane that best separates two classes in a feature space with a large margin. It assumes that the data is linearly separable, meaning that there is no overlap between the classes. The model aims to maximize the margin between the classes while ensuring that all points are correctly classified. This is done by solving the optimization problem:
maximize (1/2) * ||w||^2
subject to the constraint:
y_i * (w * x_i + b) >= 1
where w is the weight vector, x_i is the input feature vector, y_i is the class label, and b is the bias term.<br>
33. Hinge Loss
Hinge loss is a loss function used in machine learning models, particularly in Support Vector Machines (SVMs). It is used for classification tasks and penalizes predictions that are on the wrong side of the margin. The hinge loss for a given sample is defined as:
L(y, f(x)) = max(0, 1 - y * f(x))
where y is the true class label (+1 or -1), and f(x) is the model’s prediction. The function ensures that the model makes a correct prediction and is at least a margin distance away from the decision boundary.<br>
34. Hyperparameter
Hyperparameters are parameters that control the training process of machine learning models. Unlike model parameters, which are learned during training, hyperparameters are set prior to training. Examples of hyperparameters include the learning rate, the number of layers in a neural network, and the regularization strength. Tuning hyperparameters is crucial for optimizing model performance.<br>
35. Imbalanced Dataset
An imbalanced dataset refers to a situation where the classes in a classification problem are not equally represented. For instance, in a binary classification problem, one class might significantly outnumber the other. This imbalance can lead to biased models that are better at predicting the majority class. Techniques like oversampling, undersampling, and using different performance metrics (e.g., precision-recall curves) can help mitigate this issue.<br>
36. Incremental Learning
Incremental learning refers to the process of training a model on data that arrives in sequentially. Unlike traditional learning methods, where the model is trained on the entire dataset at once, incremental learning allows the model to adapt to new data without retraining from scratch. This is particularly useful in scenarios where the dataset is too large to fit in memory or when the data is constantly changing.<br>
37. K-Means Clustering
K-Means Clustering is an unsupervised learning algorithm that partitions a dataset into k clusters by minimizing the variance within each cluster. The algorithm assigns each data point to the nearest cluster centroid and then updates the centroids based on the mean of the points in each cluster. The process repeats until the centroids converge. The objective function is:
J = Σ Σ ||x_i - μ_j||^2
where μ_j is the centroid of cluster j and x_i is the data point in that cluster.<br>
38. K-Nearest Neighbors (K-NN)
K-Nearest Neighbors is a simple, instance-based learning algorithm used for classification and regression tasks. Given a new data point, the algorithm finds the k closest points from the training set (based on some distance metric, e.g., Euclidean distance) and assigns a label based on the majority class (in classification) or the average value (in regression) of those k neighbors. The Euclidean distance between two points x and y is:
d(x, y) = √Σ (x_i - y_i)^2<br>
39. Knowledge Graph
A Knowledge Graph is a structured representation of facts where entities (e.g., people, places, concepts) are nodes, and the relationships between them are edges. Knowledge graphs are used in AI systems to store and reason about information in a way that reflects real-world entities and their relationships. They are commonly used in search engines, recommendation systems, and natural language processing tasks.<br>
40. L1 Regularization
L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator), is a technique used to prevent overfitting by adding a penalty to the loss function proportional to the absolute value of the model parameters. The L1 regularization term is:
L1 = λ * Σ |w_i|
where λ is the regularization parameter, and w_i are the model parameters. L1 regularization can also be used for feature selection, as it tends to drive some coefficients to zero.<br>
Continued <a href="/post/121">here</a>.
- Shubham Anuraj, 03:04 AM, 23 Dec, 2024