Why is my PDF file so big?
PDF files are created by different applications and devices, not all of which are configured to produce documents that are optimized for small file sizes. In addition, there are some content types that bloat when converted to PDF format. It is common to find a PDF has a large file size, but it is very difficult to work out why - and then further what to do about it. In this article I will reveal how to see a breakdown of a PDF file’s contents, what is bloating the file size and what can be done in each case to reduce it.
How to see what’s taking the space in a PDF file
First, it is important to know that there are tools that can show you a breakdown of content within your PDF file. I only know of two ways to get this information though there may be other options.
WeCompress File Analyzer
A free option is to use WeCompress's online File Analyzer service. Just drag and drop your file onto the browser and it will upload, process your file and then show a breakdown of your PDF’s content. (Note: it works for Microsoft Office docs too!)
It is designed to be simple and straight-forward and even includes useful information like Page Count and Size Per Page.
Acrobat Pro’s Audit Space Usage Feature
The advanced PDF editor, Adobe Acrobat Pro, has a (hard to find) tool called Audit Space Usage which does a similar job. To access it:
- Open your PDF in Acrobat Pro
- Go to the File menu and choose Save As... menu
- In the Save as type dropdown menu choose Adobe PDF files, Optimized (*.pdf)
- Click the Settings… button below the dropdown menu
- In the PDF Optimizer dialog click the Audit Space Usage button in the top right
The file breakdown contains a lot more detail of the file contents, however, the terminology can be confusing for the average user.
Quick compression advice
When you obtain a breakdown of the content in your PDF file you should see which area is taking up space. Read the sections below to help you understand each in more detail alternatively here's a short summary of how best to compress the file size for each category:
In order to compress PDFs containing mostly Large Images, Embedded Fonts or Application Data (Piece info), use either an online PDF compressor like WeCompress or an offline one like NXPowerLite Desktop.
If most of your data is made up of Content Streams, or you have a PDF with a Large page size or a considerable Number of pages, read the sections below where we have identified some alternative methods to reduce the file size.
Image content is the most common reason for a PDF file to become so big. Here are the four main reasons for large images appearing within PDF files.
Excessive image resolution
Most image capturing devices now capture images at very high resolution and this means large image file sizes. If these images are then inserted into a PDF file then the file size will increase by approximately the same amount as the size of the image. Unless you have set a very large page size for your PDF file then it is very unlikely that you need the entire resolution of the image . These can be safely resized to a more appropriate resolution prior to inserting into the PDF which should greatly reduce the file size.
Poor image compression
Another reason that images may be overly large is that the images are captured with devices using image compression algorithms optimized for speed. It is understandable why devices such as cameras want to compress and save the images as quickly as possible, but in general means images are much larger than is necessary. When you have access to more computing power than a camera, it is possible to recompress images with a compression algorithm optimized for file size. This will likely reduce the file size 50-60% without any change in resolution and the image will be indistinguishable from the original.
Poorly configured scanners
Scanners often capture document pages as high resolution images without compression. However, it is common that users and organisations do not have access to the scanner settings in order to configure the scanner to produce more optimally sized PDF files. So even documents that appear to be just text can often be entirely images. This will drive up the size of the PDF that is created even if the images are black and white.
By reducing the resolution of these images and choosing the most efficient image encoding for the content, these files can be dramatically reduced
Text converted to outlines (images) when exported to PDF
To ensure the PDF displays well on every device some designers or PDF editors may opt to convert the fonts to outlines. For example, if the document is to be printed and it contains a typeface that your print company does not have on their system, then the font will be substituted for another font, which will likely alter the layout and mean it will be printed incorrectly.
The process of converting text to outlines means that the text is no longer text - it has become a graphic, and the text cannot be altered. Once the text is rendered as images then it will look the same on any device no matter which fonts are or are not installed. Greater fidelity does come at the expense of file size, which in most cases is likely to increase depending on the fonts used.
Microsoft Office also does a similar thing by default when you export to PDF. Any text using fonts that are not installed on the host machine or cannot be embedded at the point of exporting to PDF, will be converted to images.
One way to detect whether fonts have been converted to images is to try and select some text. As you can see from the images below, if you are presented with a frame around the text rather than a text editor cursor then you can be sure that the text is now an image representation of the text.
Selection is recognized as text by Acrobat Reader
Selection is recognized as an image or object, indicating the text has been converted to images
Resolution: Compress large images in PDF files
If you don’t have a PDF editor application, extracting images from PDFs to resize or recompress with a more efficient compression algorithm is not straight-forward. I advise using a PDF compressor like WeCompress (online) or NXPowerLite Desktop (offline software) which can automatically resize and recompress the images in the PDF. If you have a PDF with scanned content you may be able to reduce the file size by using an Optical Character Recognition (OCR) process on the content. I have used Soda PDFs online OCR service with success previously. This will convert those image representations of text back to actual text and image elements, which can significantly reduce the size. Be aware that this method won’t work on every scanned PDF and the OCR process can make the PDF file bigger! To easily identify whether the PDF has scanned content, try to select some text and if the whole page is highlighted rather than just one word then it is likely to be scanned content much like detailed above.
Most applications make use of image and text markup to create PDF content items, however, some applications create PDFs that use Content Streams. These are essentially the contents of the pages - the text and any line drawings. When content streams are used, a page in a PDF document has one or more content stream parts that together contain all the PDF page description commands for the page. The problem is that because all of the content is stored in a ‘Stream’ of data there is no real way of identifying which piece of content is driving up the file size.
Resolution: Compress PDFs with content streams
Unlike images which can be resized or recompressed with a more optimal quality to reduce them in size, content streams tend to be large and cannot be compressed. However, there are workarounds to compress the size of PDF files made from content streams. The main one is to try and reprint the PDF using a browser PDF printer and then use a PDF compressor to reduce the size of the resulting file, as discussed in this support article.
If you want to ensure fonts look the same on every device that the PDF can be shown it is a great idea to embed the fonts. Even if the host machine does not have the fonts installed this will guarantee the fonts display correctly and the document’s layout remains as the editor intended.
However, embedding the fonts comes at a cost of increasing file size. If a document has multiple fonts or double-byte fonts this can mean the file size increases by many megabytes. The best compromise is to subset the fonts in the PDF which will remove any unused fonts from the embedded font set. This will reduce the file size but will also mean that editing of the file in the future will require the fonts to be loaded on the host machine.
Resolution: Compress Embedded fonts in PDFs
Most PDF editors support removal or subsetting of embedded fonts. If you don’t have access to one then use a free online PDF compressor like WeCompress or an offline PDF compressor software like NXPowerLite Desktop.
It is also worth noting that it could be possible to reduce the size of the PDF by converting the text to images as described in the Large images section, although this will not work for all files.
Hidden Application Data (Piece info)
Applications that create PDFs files, such as Adobe Photoshop or Acrobat, are able to store information within a PDF file that they can use when opening or editing the file. This information can only be used by the application which created the file and is not needed to display a PDF file. For example, if you output from Adobe InDesign to PDF the resulting file will contain all the editing data that InDesign needs, but it will in general mean very large PDF files. We have seen this type of data make up 90% of the file size!
Resolution: Delete Hidden Application Data in PDF files
Unless you need this data for editing the file, deleting this data as it will have no effect on most uses of the PDF file. If you have Acrobat Pro then you can use this to save an ‘Optimized’ PDF file. However, for a simple way to delete Private Application Data use either WeCompress online or NXPowerLite Desktop Offline PDF compressor.
Large page size
This is something that is not usually considered by users because the PDF viewing or editing tool makes it easy to miss this property.
While most page sizes for PDFs are either standard A4 or A4 Letter (US) some have significantly higher page sizes. Take a design for a banner which needs to be printed. These need to be designed at the actual size needed for printing. In order for the PDF to print well, high-resolution graphics and content are likely to be used. This obviously drives up the file size significantly and you’ll have to be more careful in your compression options if you want to reduce the file size without adversely affecting the quality of print. Most PDF viewing or editing applications don’t highlight the page size so you’ll have to delve into the properties of the file to find this information.
Resolution: Resize PDF page size
If you decide you don’t need the PDF at full page size, you have limited options. I found a couple of online tools to resize (scale) PDF pages (e.g. Docupub), or if you have Adobe Acrobat Pro you can scale PDF pages using their preflight tool. I would be very careful to check the content afterwards because scaling content in PDF files is an not easy process and can result in the content looking weird.
Number of pages
It stands to reason that the more pages of content that are included in a PDF file the larger the file size. When most people look at the PDF file they don’t usually take account of the number of pages.
Resolution: Reduce the number of PDF pages by splitting
As per WeCompress’s Analyze service, I use a metric to work out whether a PDF is bloated and in need of compression. By dividing the file size by the number of pages you get a Size Per Page metric. For most files an average size of around 100KB per page is about optimal. So if you find that your PDF comes in significantly higher than this you could benefit from using a PDF compressor.
In the event that PDF compressors do not create smaller files then an option is to look into splitting the PDF file into 2 or more parts. This will create smaller files but they are significantly less convenient if sharing with others.