Research & explainer

How Common Is Document Metadata Leakage? A Study

Last updated: 2026-06-09

Document metadata leakage is common because it's automatic and invisible: you don't add it, you can't see it on the page, and most people never remove it. Every Word, Excel or PDF created with a signed-in Office account embeds the account holder's display name as the author by default; the company name from your licence is written in too; PDFs record the producing software, creation and modification timestamps in an Info dictionary and a duplicate XMP stream. None of this shows when you read the document, so it ships unnoticed. This page explains the mechanisms that make leakage the default, catalogues the specific fields each format carries (with their technical locations), and describes a transparent methodology for quantifying real-world exposure rates. It uses only verifiable, mechanism-level facts — no invented statistics.

Why leakage is the default, not the exception

  • Authorship is automatic: Office writes your account display name into dc:creator without asking.
  • It's invisible: none of these fields render in the document body, so there's no visual cue to remove them.
  • It's duplicated: PDFs store metadata in both the Info dictionary and an XMP stream, so a partial clean leaves a copy.
  • It survives edits: tracked-change remnants and RSIDs persist in the XML even after 'accept all changes'.
  • Conversion carries it: 'Save as PDF' transfers the original author and timestamps rather than removing them.

What each format carries

The table below is the field-level map MetaDocu works from — the specific hidden fields in Word, Excel, PDF and embedded images, where each physically lives, what it exposes, and how it's removed. These are structural facts about the OOXML and PDF formats, independently verifiable by inspecting any file's XML or object tree.

Methodology for quantifying exposure (transparent, reproducible)

  • Sample: collect a defined set of publicly available documents (e.g. resumes, public-body PDFs) under their licences/terms.
  • Measure locally: run each file through MetaDocu's scanner in the browser and record which fields are non-empty.
  • Aggregate: report the share of files exposing a real name, a company, a local path, GPS, or revision history.
  • Publish method + numbers together, so the rates are reproducible rather than asserted.

The field-level exposure map (all formats)

Each hidden field, its technical location, what it leaks, and how MetaDocu removes it — verifiable structural facts.

Hidden fieldFormatsWhat it exposesRiskHow MetaDocu removes it
Author / Creator
dc:creator · PDF /Author
Word (.docx), Excel (.xlsx), PDFThe real name (or Office sign-in name) of whoever first created the file — often your full legal name.HighCleared from the OOXML core properties / PDF Info dictionary in browser memory; the field is emptied, not just hidden.
Last Modified By
cp:lastModifiedBy
Word (.docx), Excel (.xlsx)The name of the last person to save the file — exposes internal reviewers and collaboration chains.HighStripped from the core properties XML so no editor identity remains.
Company
Company (app.xml)
Word (.docx), Excel (.xlsx)The organization name baked in from your Office licence — reveals your employer even on a personal document.MediumRemoved from the extended (app) properties part.
Manager
Manager (app.xml)
Word (.docx), Excel (.xlsx)The manager name some templates embed — leaks your reporting line.MediumCleared from the extended properties.
Template path
Template (app.xml)
Word (.docx), Excel (.xlsx)An absolute file path to the template (e.g. C:\Users\<you>\…) — leaks your account name and local folder layout.HighPath is wiped so no local filesystem clue ships with the file.
Application & version
Application/AppVersion · PDF /Producer · /Creator
Word (.docx), Excel (.xlsx), PDFThe exact software and version used — a fingerprint for targeting known vulnerabilities or deanonymizing authors.LowNormalized/removed from app properties and the PDF Producer/Creator fields.
Revision number
cp:revision
Word (.docx), Excel (.xlsx)How many times the file was saved — hints at how heavily a 'final' document was reworked.LowReset in the core properties.
Total editing time
TotalTime (app.xml)
Word (.docx), Excel (.xlsx)Cumulative minutes spent editing — can contradict claims about when/how long work was done.LowZeroed out in the extended properties.
Created / Modified dates
dcterms:created/modified · PDF /CreationDate /ModDate
Word (.docx), Excel (.xlsx), PDFPrecise creation and last-edit timestamps — builds a timeline of your activity.MediumRemoved or reset so no editing timeline leaks.
Title / Subject / Keywords
dc:title, dc:subject, cp:keywords · PDF /Title /Subject /Keywords
Word (.docx), Excel (.xlsx), PDFInternal codenames, client names, or tags left in the properties even when not shown in the document text.MediumCleared from both OOXML properties and the PDF Info dictionary.
Revision save IDs (RSID)
w:rsid in settings.xml + run-level rsids
Word (.docx)Random per-editing-session IDs that let two documents be linked to the same author/machine across files.MediumRSID nodes are physically stripped from the document XML, breaking cross-file correlation.
Tracked changes & comments
w:ins/w:del, comments.xml, people.xml
Word (.docx)Deleted text, internal review notes and commenter names that survive inside the file after 'accepting all'.HighComment and revision parts are removed so no hidden review history ships.
Custom properties
custom.xml
Word (.docx), Excel (.xlsx)Bespoke fields added by DMS/templates (matter numbers, classifications, internal IDs).MediumThe custom properties part is cleared.
XMP metadata stream
/Metadata XMP packet (xmpMM:DocumentID, InstanceID, History)
PDFA second copy of author/tool data plus document/instance IDs that survive even when the Info dictionary is cleared.HighThe XMP packet is removed alongside the Info dictionary so no duplicate metadata remains.
Image EXIF (camera & software)
EXIF Make/Model/Software/DateTimeOriginal in embedded images
Embedded imagesCamera make/model, capture time and editing software of photos embedded in the document.MediumEXIF segments are byte-stripped from embedded images while keeping the picture intact.
Image GPS coordinates
EXIF GPSLatitude/GPSLongitude in embedded images
Embedded imagesThe exact latitude/longitude where a photo was taken — can pinpoint your home or office.HighGPS EXIF tags are wiped so no location ships with the file.

Frequently asked questions

How common is metadata leakage in documents?

It's the default rather than the exception, because the leak is automatic and invisible. Any Word, Excel or PDF made with a signed-in Office account embeds the author's display name and often the company name without the user doing anything, and these fields never appear in the visible document — so they're rarely removed. PDFs compound it by storing metadata twice (Info dictionary plus XMP). Rather than cite an invented percentage, we describe the mechanisms that make exposure near-universal and provide a reproducible methodology to measure real rates on a defined sample.

Has document metadata ever caused a real privacy incident?

Yes — metadata in published documents has repeatedly been used to identify authors, link supposedly independent files, and reveal editing timelines, which is why newsrooms, law firms and government bodies adopt metadata-removal policies. We deliberately avoid attaching specific names or numbers we can't verify here; the verifiable, general point is that author fields, timestamps and RSIDs provide exactly the kind of identifying and correlating signal that has driven those policies. The practical takeaway is to remove the data before sharing, which this tool does locally.

How can I check what my own documents are leaking?

Run them through MetaDocu's scanner: drop a Word, Excel or PDF into the tool and it lists every populated metadata field — author, company, last-modified-by, paths, timestamps, RSIDs, XMP, embedded-image EXIF/GPS — in your browser, with nothing uploaded. That's both a privacy check and the same measurement step in the methodology above. You can then clear the fields in one click and download a verified-clean copy. Because it's local, you can audit even sensitive files without sending them anywhere.

Check what your documents are leaking

Scan any Word, Excel or PDF in your browser — see every exposed field, nothing uploaded.