Research & explainer
How Common Is Document Metadata Leakage? A Study
Last updated: 2026-06-09
Document metadata leakage is common because it's automatic and invisible: you don't add it, you can't see it on the page, and most people never remove it. Every Word, Excel or PDF created with a signed-in Office account embeds the account holder's display name as the author by default; the company name from your licence is written in too; PDFs record the producing software, creation and modification timestamps in an Info dictionary and a duplicate XMP stream. None of this shows when you read the document, so it ships unnoticed. This page explains the mechanisms that make leakage the default, catalogues the specific fields each format carries (with their technical locations), and describes a transparent methodology for quantifying real-world exposure rates. It uses only verifiable, mechanism-level facts — no invented statistics.
Why leakage is the default, not the exception
- Authorship is automatic: Office writes your account display name into dc:creator without asking.
- It's invisible: none of these fields render in the document body, so there's no visual cue to remove them.
- It's duplicated: PDFs store metadata in both the Info dictionary and an XMP stream, so a partial clean leaves a copy.
- It survives edits: tracked-change remnants and RSIDs persist in the XML even after 'accept all changes'.
- Conversion carries it: 'Save as PDF' transfers the original author and timestamps rather than removing them.
What each format carries
The table below is the field-level map MetaDocu works from — the specific hidden fields in Word, Excel, PDF and embedded images, where each physically lives, what it exposes, and how it's removed. These are structural facts about the OOXML and PDF formats, independently verifiable by inspecting any file's XML or object tree.
Methodology for quantifying exposure (transparent, reproducible)
- Sample: collect a defined set of publicly available documents (e.g. resumes, public-body PDFs) under their licences/terms.
- Measure locally: run each file through MetaDocu's scanner in the browser and record which fields are non-empty.
- Aggregate: report the share of files exposing a real name, a company, a local path, GPS, or revision history.
- Publish method + numbers together, so the rates are reproducible rather than asserted.
The field-level exposure map (all formats)
Each hidden field, its technical location, what it leaks, and how MetaDocu removes it — verifiable structural facts.
| Hidden field | Formats | What it exposes | Risk | How MetaDocu removes it |
|---|---|---|---|---|
Author / Creator dc:creator · PDF /Author | Word (.docx), Excel (.xlsx), PDF | The real name (or Office sign-in name) of whoever first created the file — often your full legal name. | High | Cleared from the OOXML core properties / PDF Info dictionary in browser memory; the field is emptied, not just hidden. |
Last Modified By cp:lastModifiedBy | Word (.docx), Excel (.xlsx) | The name of the last person to save the file — exposes internal reviewers and collaboration chains. | High | Stripped from the core properties XML so no editor identity remains. |
Company Company (app.xml) | Word (.docx), Excel (.xlsx) | The organization name baked in from your Office licence — reveals your employer even on a personal document. | Medium | Removed from the extended (app) properties part. |
Manager Manager (app.xml) | Word (.docx), Excel (.xlsx) | The manager name some templates embed — leaks your reporting line. | Medium | Cleared from the extended properties. |
Template path Template (app.xml) | Word (.docx), Excel (.xlsx) | An absolute file path to the template (e.g. C:\Users\<you>\…) — leaks your account name and local folder layout. | High | Path is wiped so no local filesystem clue ships with the file. |
Application & version Application/AppVersion · PDF /Producer · /Creator | Word (.docx), Excel (.xlsx), PDF | The exact software and version used — a fingerprint for targeting known vulnerabilities or deanonymizing authors. | Low | Normalized/removed from app properties and the PDF Producer/Creator fields. |
Revision number cp:revision | Word (.docx), Excel (.xlsx) | How many times the file was saved — hints at how heavily a 'final' document was reworked. | Low | Reset in the core properties. |
Total editing time TotalTime (app.xml) | Word (.docx), Excel (.xlsx) | Cumulative minutes spent editing — can contradict claims about when/how long work was done. | Low | Zeroed out in the extended properties. |
Created / Modified dates dcterms:created/modified · PDF /CreationDate /ModDate | Word (.docx), Excel (.xlsx), PDF | Precise creation and last-edit timestamps — builds a timeline of your activity. | Medium | Removed or reset so no editing timeline leaks. |
Title / Subject / Keywords dc:title, dc:subject, cp:keywords · PDF /Title /Subject /Keywords | Word (.docx), Excel (.xlsx), PDF | Internal codenames, client names, or tags left in the properties even when not shown in the document text. | Medium | Cleared from both OOXML properties and the PDF Info dictionary. |
Revision save IDs (RSID) w:rsid in settings.xml + run-level rsids | Word (.docx) | Random per-editing-session IDs that let two documents be linked to the same author/machine across files. | Medium | RSID nodes are physically stripped from the document XML, breaking cross-file correlation. |
Tracked changes & comments w:ins/w:del, comments.xml, people.xml | Word (.docx) | Deleted text, internal review notes and commenter names that survive inside the file after 'accepting all'. | High | Comment and revision parts are removed so no hidden review history ships. |
Custom properties custom.xml | Word (.docx), Excel (.xlsx) | Bespoke fields added by DMS/templates (matter numbers, classifications, internal IDs). | Medium | The custom properties part is cleared. |
XMP metadata stream /Metadata XMP packet (xmpMM:DocumentID, InstanceID, History) | A second copy of author/tool data plus document/instance IDs that survive even when the Info dictionary is cleared. | High | The XMP packet is removed alongside the Info dictionary so no duplicate metadata remains. | |
Image EXIF (camera & software) EXIF Make/Model/Software/DateTimeOriginal in embedded images | Embedded images | Camera make/model, capture time and editing software of photos embedded in the document. | Medium | EXIF segments are byte-stripped from embedded images while keeping the picture intact. |
Image GPS coordinates EXIF GPSLatitude/GPSLongitude in embedded images | Embedded images | The exact latitude/longitude where a photo was taken — can pinpoint your home or office. | High | GPS EXIF tags are wiped so no location ships with the file. |
Frequently asked questions
How common is metadata leakage in documents?
It's the default rather than the exception, because the leak is automatic and invisible. Any Word, Excel or PDF made with a signed-in Office account embeds the author's display name and often the company name without the user doing anything, and these fields never appear in the visible document — so they're rarely removed. PDFs compound it by storing metadata twice (Info dictionary plus XMP). Rather than cite an invented percentage, we describe the mechanisms that make exposure near-universal and provide a reproducible methodology to measure real rates on a defined sample.
Has document metadata ever caused a real privacy incident?
Yes — metadata in published documents has repeatedly been used to identify authors, link supposedly independent files, and reveal editing timelines, which is why newsrooms, law firms and government bodies adopt metadata-removal policies. We deliberately avoid attaching specific names or numbers we can't verify here; the verifiable, general point is that author fields, timestamps and RSIDs provide exactly the kind of identifying and correlating signal that has driven those policies. The practical takeaway is to remove the data before sharing, which this tool does locally.
How can I check what my own documents are leaking?
Run them through MetaDocu's scanner: drop a Word, Excel or PDF into the tool and it lists every populated metadata field — author, company, last-modified-by, paths, timestamps, RSIDs, XMP, embedded-image EXIF/GPS — in your browser, with nothing uploaded. That's both a privacy check and the same measurement step in the methodology above. You can then clear the fields in one click and download a verified-clean copy. Because it's local, you can audit even sensitive files without sending them anywhere.
Check what your documents are leaking
Scan any Word, Excel or PDF in your browser — see every exposed field, nothing uploaded.