encyclopedia

Computer Vision

Computer vision gives machines the ability to "see." Using deep learning models—particularly convolutional neural networks (CNNs) and, more recently, Vision Transformers (ViTs)—systems can classify objects, detect faces, read documents, analyze medical scans, and understand video streams. Combined with language models, computer vision forms the perceptual backbone of multimodal AI systems capable of understanding and generating across both text and imagery.

CategoryFoundations

Reading time6 min read

Last updatedFeb 28, 2025

PublishedFeb 28, 2025

Definition

A field of AI that enables machines to interpret and understand visual information from images and video.

Need this applied?

We help teams go from definitions to deployed workflows—safely and fast.

Get a tailored AI plan Try the AI Readiness Tool

Key tasks

Stanford University LeCun et al., 1998

Computer vision encompasses a wide range of tasks, each with different model architectures and training requirements.

• Image classification: assigning a label to an entire image (e.g., "cat" or "invoice").
• Object detection: locating and identifying multiple objects within an image with bounding boxes.
• Semantic segmentation: labeling every pixel in an image with a class.
• Optical character recognition (OCR): extracting text from images and documents.
• Video understanding: tracking objects and events across frames over time.

From CNNs to Vision Transformers

Dosovitskiy et al., 2020 LeCun et al., 1998

For a decade, convolutional neural networks (CNNs) dominated computer vision by exploiting the spatial structure of images through learnable filters. Vision Transformers (ViT), introduced in 2020, adapted the self-attention mechanism from NLP to image patches, achieving state-of-the-art results and enabling tighter integration with language models.

Business applications

Stanford University

Practical deployments of computer vision span industries: automated quality inspection in manufacturing, identity verification in fintech, receipt and invoice parsing in accounting, and visual search in e-commerce. Modern multimodal models like GPT-4V and Claude 3 allow businesses to query images with natural language without building custom vision pipelines.

FAQ

How much labeled data does computer vision require?

Traditional supervised approaches required tens of thousands of labeled images. Modern techniques like transfer learning (fine-tuning a pre-trained model) and few-shot learning dramatically reduce data requirements, often achieving strong results with hundreds or even dozens of labeled examples.

Stanford University

What is the difference between computer vision and multimodal AI?

Computer vision is specifically about understanding images and video. Multimodal AI is the broader capability of processing and generating across multiple modalities—text, images, audio, and video—often combining language models with vision models.

Dosovitskiy et al., 2020

Email this summary + checklist

Get a copy of “Computer Vision” and an AI readiness checklist in your inbox.

encyclopedia

Computer Vision

CategoryFoundations

Reading time6 min read

Last updatedFeb 28, 2025

PublishedFeb 28, 2025

Definition

A field of AI that enables machines to interpret and understand visual information from images and video.

Need this applied?

We help teams go from definitions to deployed workflows—safely and fast.

Get a tailored AI plan Try the AI Readiness Tool

Key tasks

Stanford University LeCun et al., 1998

Computer vision encompasses a wide range of tasks, each with different model architectures and training requirements.

• Image classification: assigning a label to an entire image (e.g., "cat" or "invoice").
• Object detection: locating and identifying multiple objects within an image with bounding boxes.
• Semantic segmentation: labeling every pixel in an image with a class.
• Optical character recognition (OCR): extracting text from images and documents.
• Video understanding: tracking objects and events across frames over time.

From CNNs to Vision Transformers

Dosovitskiy et al., 2020 LeCun et al., 1998

Business applications

Stanford University

FAQ

How much labeled data does computer vision require?

Stanford University

What is the difference between computer vision and multimodal AI?

Dosovitskiy et al., 2020

Email this summary + checklist

Get a copy of “Computer Vision” and an AI readiness checklist in your inbox.