You save a one-page document as a PDF and find it's 8MB. Or a 20-page report turns out to be 50MB. Where is all that data hiding? This article breaks down exactly what lives inside a PDF file and what drives its size.
Contents
What's Inside a PDF File
A PDF isn't just a picture of a page — it's a structured container holding several distinct types of data:
- Raster images — Photographs, screenshots, and scanned content stored as pixel grids
- Vector graphics — Charts, diagrams, logos, and illustrations stored as mathematical paths
- Text and fonts — The actual text content plus embedded font files so it displays correctly
- Metadata — Author name, creation date, software used, revision history
- Annotations and form fields — Comments, highlights, fillable fields
- Thumbnails and preview data — Small page previews some PDF creators embed
Images: The Biggest Culprit
By a wide margin, embedded raster images are the largest contributor to PDF file size. A single high-resolution photograph taken on a modern smartphone can be 5–15MB when embedded at full resolution in a PDF.
The problem is that many applications (Microsoft Word, Google Docs, Apple Pages) embed images at their full, original resolution when exporting to PDF, even if the document only displays them at a small size. The image data is there in full — the PDF just scales it down visually.
Scanned PDFs are particularly large because every single page is a full-resolution raster image. A 20-page scanned document could easily be 40–80MB before any compression.
Fonts and Text
Text itself is very compact — a novel in plain text is only about 1MB. But PDFs often embed entire font files to ensure the document looks exactly the same on every device. A single professional font family file can be 200–500KB, and complex documents may embed several fonts.
Most modern PDF creators use "font subsetting" — only embedding the characters actually used in the document rather than the entire font. This significantly reduces font-related bloat.
Metadata and Hidden Data
PDF files can carry a surprising amount of invisible data:
- Author and company names, creation software, and revision timestamps
- Multiple revisions of content that have been "deleted" but are still stored in the file
- Color profile data (ICC profiles) for accurate print color reproduction
- Thumbnail images of each page for quick preview rendering
- JavaScript for interactive PDFs
While metadata alone rarely accounts for more than a few hundred KB, removing unnecessary data as part of optimization can add up across a large document.
How to Reduce Each Component
- Images: Use a PDF compressor like compress-pdf.cc to resample embedded images at a lower resolution without affecting text or vectors.
- Fonts: Most modern PDF compressors handle font subsetting automatically. If your creator application has an option to subset fonts, enable it.
- Metadata: PDF optimization tools can strip unnecessary metadata. For most users, a standard compressor handles this.
- Scanned pages: A PDF compressor is particularly effective here — scanned pages are pure images and compress very well.
See the difference compression makes
Upload your PDF and see exactly how much space we can save.
Compress PDF Now →