The ML Glossary Pt. 2
43
Glossary
AI
Continuation of Pt 1.
41. L2 Regularization
L2 regularization, also known as Ridge Regression, is a technique used to prevent overfitting by adding a penalty to the loss function proportional to the square of the model parameters. The L2 regularization term is:
L2 = λ * Σ w_i^2
where λ is the regularization parameter, and w_i are the model parameters. Unlike L1 regularization, L2 does not drive coefficients to zero but instead shrinks them, reducing their impact.<br>
42. Latent Variable
A latent variable is a hidden variable that cannot be observed directly but influences the observed data. Latent variables are commonly used in probabilistic models such as Gaussian Mixture Models (GMMs) and Latent Dirichlet Allocation (LDA). For example, in topic modeling, the latent variables represent topics, which generate words in documents.<br>
43. Learning Rate
The learning rate is a hyperparameter in optimization algorithms like Gradient Descent that determines the size of the steps taken towards minimizing the loss function. A high learning rate can lead to faster convergence but risks overshooting the minimum, while a low learning rate ensures stability but may converge slowly. It is often represented as η in mathematical formulas.<br>
44. Logistic Regression
Logistic Regression is a classification algorithm used to predict binary outcomes. It models the probability of a class as a function of the input features using the logistic sigmoid function:
P(y=1|x) = 1 / (1 + exp(-w*x))
where x is the input vector, w is the weight vector, and the output is a probability between 0 and 1. Logistic regression is widely used for tasks like spam detection and medical diagnosis.<br>
45. Manifold Learning
Manifold learning is a type of unsupervised learning that aims to reduce the dimensionality of data while preserving its structure. It assumes that high-dimensional data lies on a lower-dimensional manifold. Techniques like Isomap, t-SNE (t-Distributed Stochastic Neighbor Embedding), and UMAP (Uniform Manifold Approximation and Projection) are commonly used for this purpose.<br>
46. Mean Squared Error (MSE)
Mean Squared Error is a common loss function used in regression tasks to measure the average squared difference between the predicted and actual values. The formula is:
MSE = (1/n) * Σ (y_i - ŷ_i)^2
where y_i is the true value, ŷ_i is the predicted value, and n is the number of data points.<br>
47. Mini-Batch Gradient Descent
Mini-Batch Gradient Descent is a variant of Gradient Descent that updates the model parameters using small, randomly selected subsets (mini-batches) of the training data. This balances the efficiency of Batch Gradient Descent and the noise reduction of Stochastic Gradient Descent, making it one of the most commonly used optimization techniques in deep learning.<br>
48. Naive Bayes Classifier
Naive Bayes is a probabilistic classifier based on Bayes' Theorem. It assumes that all features are conditionally independent given the class label. Despite its simplicity, it performs well for many tasks like spam detection and text classification. The predicted class is given by:
P(y|x) ∝ P(x|y) * P(y)
where P(x|y) is the likelihood of the data given the class, and P(y) is the prior probability of the class.<br>
49. Natural Language Processing (NLP)
Natural Language Processing is a field of AI focused on enabling computers to understand, interpret, and generate human language. Key tasks in NLP include tokenization, part-of-speech tagging, sentiment analysis, machine translation, and text summarization. Models like Transformer-based architectures (e.g., BERT, GPT) are widely used in modern NLP applications.<br>
50. Normalization
Normalization is a preprocessing technique used to scale features to a standard range, often [0, 1], to ensure that all features contribute equally to the learning process. It helps prevent dominance by features with larger ranges. For feature x with minimum min(x) and maximum max(x), normalization is performed as:
x' = (x - min(x)) / (max(x) - min(x))<br>
51. One-Hot Encoding
One-Hot Encoding is a technique to convert categorical data into a binary matrix where each unique category is represented as a vector with a single high value (1) and all other positions as low (0). This eliminates the possibility of the model assuming ordinal relationships in non-ordinal data. For example, encoding the categories {Red, Blue, Green} results in vectors [1, 0, 0], [0, 1, 0], and [0, 0, 1], respectively.<br>
61. Online Learning
Online Learning is a machine learning paradigm where the model is updated continuously as new data becomes available, instead of training on the entire dataset at once. This approach is particularly useful for dynamic environments, such as stock market prediction or recommendation systems, where data is generated in streams.<br>
62. Outlier
An Outlier is a data point significantly different from other observations in the dataset. Outliers can distort statistical metrics, like mean and standard deviation, and impact model performance. Detection techniques include statistical methods (e.g., Z-scores), clustering, and isolation forests. Proper handling involves either removal or transformation.<br>
63. Overfitting
Overfitting occurs when a model captures noise or irrelevant patterns in the training data, leading to poor generalization on unseen data. Overfitted models show low training error but high test error. Regularization techniques, dropout, and early stopping can mitigate overfitting.<br>
64. Optimization Algorithm
An Optimization Algorithm is a method used to adjust a model's parameters to minimize a loss function. Common optimization algorithms include Gradient Descent, Stochastic Gradient Descent (SGD), Adam, and RMSprop. These algorithms iteratively update weights based on the gradient of the loss function with respect to the model parameters.<br>
65. Ordinal Encoding
Ordinal Encoding is a method for converting categorical variables into numerical values based on their order or rank. Unlike one-hot encoding, ordinal encoding preserves ordinal relationships, such as the order of education levels (e.g., High School = 1, Bachelor’s = 2, Master’s = 3). However, this method should be used only when the order is meaningful.<br>
66. Objective Function
An Objective Function, often referred to as the loss function, is the mathematical expression the model seeks to minimize (or maximize, in some cases). For example, in regression, the objective function is commonly Mean Squared Error (MSE), while in classification, it could be Cross-Entropy Loss. Optimization algorithms are used to iteratively adjust the model to minimize the objective function.<br>
67. Out-of-Sample Error
Out-of-Sample Error is the error rate of a machine learning model on data it has never seen before. It reflects the model's ability to generalize to new data. A large gap between training error and out-of-sample error often indicates overfitting.<br>
68. Oversampling
Oversampling is a technique used to address class imbalance by increasing the representation of the minority class. Methods include duplicating minority samples or generating synthetic samples (e.g., using the SMOTE algorithm). Oversampling reduces bias towards the majority class but can lead to overfitting if not done carefully.<br>
69. P-Norm
The P-Norm, also known as the Lp norm, is a generalized metric used to measure the magnitude of a vector in p-dimensional space. It is defined as:
||x||_p = (Σ |x_i|^p)^(1/p)
where x is a vector, x_i are its components, and p is a positive integer or ∞. Common cases include:
L1 norm (p = 1): Sum of absolute values.
L2 norm (p = 2): Euclidean distance.
L∞ norm (p → ∞): Maximum absolute value of components.<br>
70. Parameter Tuning
Parameter Tuning involves optimizing hyperparameters of a machine learning model to maximize performance on validation data. Common methods include grid search, random search, and Bayesian optimization. Effective parameter tuning ensures that the model balances bias and variance, improving generalization.<br>
71. Partial Derivative
A Partial Derivative measures the rate of change of a multivariable function with respect to one variable, holding all others constant. In machine learning, partial derivatives are used to compute gradients for optimization algorithms like Gradient Descent. For a function f(x, y), the partial derivative with respect to x is:
∂f/∂x.<br>
72. Polynomial Regression
Polynomial Regression is a type of regression where the relationship between the independent variable x and the dependent variable y is modeled as an nth-degree polynomial. The model takes the form:
y = β0 + β1*x + β2*x^2 + ... + βn*x^n + ε
where β are the coefficients and ε is the error term. It is used to capture non-linear relationships.<br>
73. Precision-Recall Curve
A Precision-Recall Curve is a graphical representation of a classification model’s performance across different thresholds. It plots precision (y-axis) against recall (x-axis). It is particularly useful for evaluating models on imbalanced datasets, as it focuses on positive class predictions. The area under the curve (AUC-PR) indicates the model's overall performance.<br>
74. Principal Components
Principal Components are the transformed variables in Principal Component Analysis (PCA) that capture the maximum variance in the data. They are orthogonal and ranked by the amount of variance they explain. For a dataset, the principal components are computed using the eigenvectors of the covariance matrix.<br>
75. Probabilistic Graphical Model (PGM)
Probabilistic Graphical Models are a framework for representing and reasoning about uncertainties in complex systems using graphs. Nodes represent random variables, and edges represent probabilistic dependencies. Examples include Bayesian Networks (directed) and Markov Random Fields (undirected). PGMs are used in areas like natural language processing and bioinformatics.<br>
76. Projection
Projection in machine learning refers to mapping data points from a high-dimensional space to a lower-dimensional subspace, often to simplify data or reveal patterns. Projection is used in techniques like PCA, where data is projected onto principal components.<br>
77. Pseudo-Inverse
The Pseudo-Inverse of a matrix is a generalization of the matrix inverse, particularly for non-square or singular matrices. It is used in solving linear systems and least-squares problems. For a matrix A, the pseudo-inverse is denoted as A⁺ and computed as:
A⁺ = (AᵀA)^(-1)Aᵀ
when A has full column rank.<br>
78. PyTorch
PyTorch is an open-source machine learning library based on Python, widely used for deep learning research and development. It offers dynamic computation graphs, enabling flexible and efficient model building. PyTorch also includes utilities for tensor operations, automatic differentiation, and pre-trained models.<br>
79. Quadratic Programming (QP)
Quadratic Programming is a type of optimization problem where the objective function is quadratic, and the constraints are linear. It is formulated as:
minimize: (1/2)xᵀQx + cᵀx
subject to: Ax ≤ b
where x is the variable vector, Q is a symmetric matrix, c is a coefficient vector, and A and b define the constraints. QP is widely used in support vector machines (SVMs) for optimization.<br>
80. Quantization
Quantization in machine learning refers to reducing the precision of model parameters or data representations to improve computational efficiency. It is especially useful in deploying models on resource-constrained devices like mobile phones or edge devices. Techniques include uniform quantization and non-uniform quantization.<br>
Continued <a href="/post/122">here</a>.
- Shubham Anuraj, 07:52 PM, 24 Dec, 2024