Attention MechanismA neural network component that allows the model to focus on relevant parts of the input when producing output. In vision-language models, cross-attention mechanisms enable the model to attend to relevant image regions when processing text and vice versa.
Contrastive LearningA self-supervised learning approach that trains models by contrasting positive pairs (matching samples) against negative pairs (non-matching samples), encouraging the model to learn discriminative representations.
Embedding SpaceA continuous vector space where data points are represented as dense numerical vectors. In vision-language models, images and text are mapped into a shared embedding space where semantic similarity corresponds to geometric proximity.
Zero-Shot LearningThe ability of a model to perform tasks or recognize categories it was not explicitly trained on, by leveraging knowledge transferred from training on related tasks or data.
Fine-TuningThe process of taking a pre-trained model and further training it on a specific downstream task or dataset, adapting its learned representations to new requirements.
Encoder-Decoder ArchitectureA neural network structure consisting of an encoder that compresses input into a latent representation and a decoder that generates output from that representation. Used in image captioning where the encoder processes the image and the decoder generates text.
TokenizationThe process of breaking text into smaller units (tokens) such as words, subwords, or characters that can be processed by a neural network. Visual tokenization similarly divides images into patches.
Cross-Modal TransferThe ability to transfer knowledge learned in one modality (e.g., text) to improve performance in another modality (e.g., vision), leveraging shared semantic concepts across modalities.
Visual GroundingThe task of localizing or identifying specific regions in an image that correspond to a given natural language expression, connecting textual references to visual content.
Multimodal FusionTechniques for combining information from multiple modalities into a unified representation. Common approaches include early fusion (combining raw inputs), late fusion (combining high-level features), and cross-attention fusion.
Image PatchA small rectangular region of an image used as an input unit in Vision Transformers. The image is divided into a grid of non-overlapping patches, each treated as a token similar to words in NLP.
Pre-trainingThe initial phase of training a model on a large, general dataset before fine-tuning on specific tasks. Vision-language models are often pre-trained on millions of image-text pairs from the internet.
Prompt EngineeringThe practice of crafting input prompts to guide a model toward desired outputs. In VLMs, carefully designed text prompts can significantly improve zero-shot classification and other tasks.
Semantic SimilarityA measure of how closely related the meanings of two pieces of content are, regardless of their surface-level representation. In VLMs, an image of a dog and the text 'a dog' would have high semantic similarity.
Feature ExtractionThe process of automatically learning and identifying important patterns and characteristics from raw data. Vision encoders extract visual features like edges, textures, and object shapes from images.
Cosine SimilarityA metric used to measure how similar two vectors are by computing the cosine of the angle between them. In VLMs, cosine similarity between image and text embeddings determines how well they match semantically, with values ranging from -1 (opposite) to 1 (identical).
Batch NormalizationA technique that normalizes the inputs to each layer of a neural network, stabilizing and accelerating training. Widely used in vision encoders to improve gradient flow and enable training of deeper networks.
Transfer LearningA machine learning technique where a model trained on one task is repurposed for a different but related task. VLMs like CLIP excel at transfer learning because their general visual-linguistic representations can be applied to many downstream tasks without task-specific training.
Image CaptioningThe task of automatically generating a natural language description of an image. This requires the model to identify objects, their attributes, spatial relationships, and activities, then compose a grammatically correct sentence conveying this information.
Self-Supervised LearningA training paradigm where the model learns representations from unlabeled data by solving pretext tasks derived from the data itself. Contrastive learning on image-text pairs is a form of self-supervised learning that has proven highly effective for VLMs.
Multimodal EmbeddingA learned vector representation that captures information from multiple modalities (such as image and text) in a shared space. Multimodal embeddings enable cross-modal retrieval, where a text query can find relevant images or an image query can find relevant text descriptions.
Diffusion ModelA generative model that learns to create data (often images) by gradually denoising random noise through a learned reverse diffusion process. Models like DALL-E 2 and Stable Diffusion use CLIP text embeddings to guide image generation from text descriptions.
Region of Interest (ROI)A specific area within an image that is relevant for a particular task. In vision-language models, the model may attend to specific regions of interest when answering questions or generating descriptions about localized content within an image.
Instruction TuningTraining a language model to follow natural language instructions, making it more controllable and useful for diverse tasks. Visual instruction tuning extends this to image-text instruction pairs.
Adapter LayerA lightweight neural network module inserted into a pre-trained model to adapt it to new tasks or modalities with minimal parameter updates, preserving the original model's knowledge.
Vision-Language Pre-trainingThe process of training a model on large-scale image-text data to learn general cross-modal representations before fine-tuning on specific downstream tasks.
Generative Pre-trained Transformer (GPT)A family of autoregressive language models that generate text token by token. GPT-4V extended the architecture to also process visual inputs, creating a powerful vision-language model.
RLHF (Reinforcement Learning from Human Feedback)A training technique that uses human preferences to fine-tune AI models, improving their helpfulness and safety. Applied to multimodal models to improve image understanding quality.
Multimodal Large Language Model (MLLM)A large language model extended to process multiple types of input (text, images, audio, video). Examples include GPT-4V, Gemini, and Claude, which can understand and reason about visual content alongside text.
Few-Shot LearningThe ability of a model to learn a new task from just a few examples, without extensive retraining. VLMs like Flamingo demonstrated remarkable few-shot capabilities across diverse visual tasks.
Visual Instruction TuningTraining a vision-language model to follow natural language instructions about images, such as 'Describe this image in detail' or 'What is wrong in this picture?', pioneered by LLaVA.
Cross-AttentionA transformer mechanism that allows one modality to attend to another. In VLMs, cross-attention lets the language model attend to relevant image regions when generating text responses.
DALL-EAn AI system by OpenAI that generates images from text descriptions, using CLIP embeddings to guide the generation process. Demonstrates the reverse direction of vision-language understanding.
GroundingThe process of connecting abstract language concepts to specific visual elements in an image, such as identifying which object in a photo is being referred to by a descriptive phrase.
HallucinationWhen a VLM generates descriptions of objects, attributes, or relationships that do not actually exist in the input image. Reducing hallucination is a major ongoing research challenge.
Object DetectionThe task of identifying and localizing objects within an image by predicting bounding boxes and class labels. Modern VLMs extend this to open-vocabulary detection using natural language descriptions.
Image SegmentationDividing an image into meaningful regions at the pixel level. Semantic segmentation labels each pixel with a class, while instance segmentation distinguishes individual objects of the same class.
Caption GenerationThe task of automatically producing a natural language description of an image's content. Modern captioning systems use VLMs to generate detailed, contextually rich descriptions that go beyond simple object listing.
Multimodal ReasoningThe ability to perform logical inference that requires information from multiple modalities. For example, answering 'Is the cup likely to fall?' requires understanding both the visual scene geometry and physical reasoning.
LAIONLarge-scale Artificial Intelligence Open Network - a non-profit that created massive open-source image-text datasets (LAION-5B with 5.85 billion pairs) used to train many vision-language models.
Visual EncoderThe component of a VLM that processes images and extracts visual features. Common architectures include Vision Transformers (ViT), ConvNeXt, and CLIP's visual encoder.
Q-FormerA lightweight transformer module used in BLIP-2 that bridges a frozen image encoder and a frozen large language model, learning to extract the most informative visual features for language generation.
Masked Image ModelingA self-supervised pre-training technique where parts of an image are masked (hidden) and the model must predict the missing content, learning rich visual representations in the process.