This page provides practical reference metrics aligned to the Electronic Discovery Reference Model (EDRM). It is designed for attorneys, paralegals, and litigation support teams who need real-world planning data for digital evidence, email collections, and eDiscovery production.

These benchmarks help answer:

Important: These figures represent industry-standard planning estimates used in litigation support and eDiscovery operations. Actual volumes and timelines vary based on case scope, data types, custodians, and technical environments. This page is provided for operational reference only and does not constitute legal advice.

EDRM Framework Overview

The Electronic Discovery Reference Model defines the lifecycle of digital evidence in litigation. This page provides reference metrics and operational standards for the stages most relevant to volume, cost, defensibility, and trial production.

Identification
Preservation
Collection
Processing
Review
Analysis
Production
Presentation

Identification & Preservation Reference Metrics

Identification and preservation establish the scope of data subject to legal hold. These metrics help estimate the complexity and breadth of ESI collection efforts.

Metric Typical Range Notes
Custodian count per case 1 – 50+ Employees, executives, vendors, and third parties
Email accounts per custodian 1 – 4 Includes work, personal, and legacy accounts
Cloud platforms per case 2 – 10 Microsoft 365, Google Workspace, iCloud, Slack, Dropbox, etc.
Preservation window (time span) 3 – 10+ years Litigation holds often span multiple matters and employment changes
Metadata fields preserved 100 – 300+ Sender, recipient, timestamps, IPs, file hashes, system properties
Mobile devices per custodian 1 – 3 Work phone, personal phone, tablets
Preservation Standard

All ESI must be preserved in a forensically defensible format that maintains timestamps, authorship, file integrity, and native metadata. Preservation must prevent spoliation through automated deletion, overwrites, or normal business operations.

Why This Matters

Accurate identification determines collection scope and cost. Underestimating custodian count or data sources can lead to inadequate preservation, missed deadlines, and sanctions. A case with 20 custodians across 5 platforms may require coordination with IT, HR, and external vendors to ensure complete preservation.

Collection Reference Metrics

Collection involves extracting ESI from source systems while maintaining integrity and defensibility. These metrics help estimate data volumes and storage requirements.

Email Collection Metrics

Metric Typical Range Notes
Emails per GB (without attachments) 75,000 – 100,000 Plain text and simple HTML emails
Emails per GB (with attachments) 10,000 – 30,000 Realistic estimate for typical business email
Attachments per email (average) 1 – 3 Large impact on storage size and processing complexity
Email storage per custodian 5 – 50 GB Varies by role, tenure, and archiving practices
Legacy PST files per custodian 2 – 10 Historical archives, offline storage, personal backups

File System Collection Metrics

Metric Typical Range Notes
Files per GB 3,000 – 20,000 Depends on file types and compression
Network share storage per custodian 10 – 100 GB Department shares, project folders, personal drives
Cloud account exports (per custodian) 1 – 500+ GB Email, drive, chat, calendar, and backups
Mobile device storage 32 – 512 GB Includes deleted data, hidden files, app data
Slack/Teams chat data (per user per year) 500 MB – 5 GB Messages, files, channels; varies by usage patterns
Collection Standard

Collections must generate cryptographic hash values (MD5, SHA-1, or SHA-256), maintain chain-of-custody logs, preserve custodian records, and create audit trails for every file collected. All collection tools must operate in read-only mode to prevent data alteration.

Real-World Note: Data Variance

Email volumes vary dramatically by role. Executives and sales personnel often have 50–100 GB per mailbox, while operational staff may have 5–10 GB. Cloud platform usage has increased average data volumes by 2–3x over the past five years. Always collect test samples before estimating full-case volumes.

Why This Matters

Collection metrics drive storage planning, processing timelines, and cost estimates. A case with 10 custodians at 30 GB each yields 300 GB of raw data—which may expand to 600–900 GB after processing and deduplication. Underestimating collection volumes can delay case timelines and exceed budgets.

Processing & Normalization Metrics

Processing transforms raw ESI into reviewable, searchable documents while maintaining metadata and applying filters to reduce volume.

Processing Efficiency Metrics

Metric Typical Range Notes
Deduplication reduction 10% – 70% Depends on custodian overlap and document types
Near-duplicate reduction 5% – 30% Email threading and similar files with minor variations
Data expansion (archive unpacking) 1.2× – 3× ZIP, PST, RAR, and compressed archives inflate during extraction
System files filtered 15% – 40% Operating system files, temp files, executables
Processing speed 50 – 500 GB/day Based on infrastructure, file complexity, and quality checks

Text Extraction & OCR Metrics

Document Type OCR Success Rate Notes
Modern scanned documents (clean, B&W) 95% – 99% Standard office documents, good scan quality
Color documents with graphics 90% – 95% Charts, images, mixed content
Legacy or degraded documents 85% – 92% Fading, skew, poor contrast
Faxes and photocopies 75% – 85% Noise, distortion, multiple-generation copies
Handwritten documents Low / Unreliable Typically requires manual review or specialized tools
Processing Standard

All files must be decompressed, deduplicated, text-extracted, and normalized into searchable, review-ready formats. Processing logs must document every transformation, extraction failure, and quality check. Hash values must be preserved throughout processing to ensure file integrity.

Real-World Note: Processing Variability

Processing timelines vary significantly based on file mix. A dataset with many PST archives and ZIP files will take 2–3x longer than native Office documents. OCR quality directly impacts review efficiency—poor OCR often requires manual review or reprocessing, adding weeks to case timelines.

Why This Matters

Processing efficiency determines review costs and timelines. A 500 GB collection that deduplicates to 200 GB (60% reduction) saves substantial review time and cost. However, data expansion from archives can offset these gains. Understanding processing metrics helps set realistic deadlines and budgets.

Processing Output & Volume

Metric Typical Range Notes
Documents per GB (after processing) 20,000 – 100,000 Varies by file type mix and metadata
Text extraction size increase +5% – 15% Extracted text stored separately from native files
Processing error rate 0.1% – 2% Corrupted files, password-protected documents, unsupported formats
Metadata completeness 85% – 98% Depends on source systems and collection methods

Review & Analysis Benchmarks

Review involves human assessment of processed documents for relevance, privilege, and responsiveness. These benchmarks help estimate review timelines and resource requirements.

Review Volume Metrics

Metric Typical Range Notes
Documents per GB (reviewable) 20,000 – 100,000 After processing and filtering
Reviewer throughput (simple documents) 1,500 – 2,000 docs/day Email, short memos, simple correspondence
Reviewer throughput (complex documents) 500 – 1,000 docs/day Contracts, technical documents, spreadsheets
Quality control (QC) sampling rate 5% – 10% Documents reviewed by senior reviewers for consistency

Review Yield Metrics

Category Typical Rate Notes
Relevance rate 2% – 20% Documents responsive to discovery requests
Privilege rate 1% – 10% Attorney-client, work product, confidential communications
Hot documents (key evidence) 0.1% – 2% Critical documents for case strategy
Not relevant 70% – 95% Documents not responsive to requests

Review Duration Estimates

Review Size Typical Duration Assumptions
Small review (10,000 – 50,000 docs) 2 – 4 weeks 2–3 reviewers, moderate complexity
Medium review (50,000 – 250,000 docs) 6 – 12 weeks 5–10 reviewers, mixed document types
Large review (250,000 – 1M docs) 3 – 6 months 15–30 reviewers, complex case issues
Mega review (1M+ docs) 6 – 18+ months Large teams, may use AI/TAR for acceleration
Review Standard

All documents must be searchable, filterable, taggable, and auditable. Review platforms must track reviewer decisions, timestamps, and changes. Quality control processes must verify consistency and accuracy. Privilege logs must be maintained for all withheld documents.

Real-World Note: Review Variability

Review speed varies dramatically by document complexity and case familiarity. First-pass reviewers on unfamiliar topics may achieve 500 docs/day, while experienced reviewers on routine matters can exceed 2,000 docs/day. Technology-assisted review (TAR) can reduce review volumes by 40–70% when properly deployed.

Why This Matters

Review represents the largest variable cost in eDiscovery. A 500,000 document review with a 5% relevance rate produces 25,000 responsive documents. Understanding yield rates helps budget accurately and plan production timelines. Low relevance rates may indicate over-collection or broad preservation.

Production & Presentation Metrics

Production involves packaging responsive documents with metadata for delivery to opposing counsel or regulatory agencies. Presentation focuses on trial exhibits and courtroom use.

Production Volume & Format

Metric Typical Range Notes
Load file size per 100,000 docs 1 – 10 GB Includes metadata, extracted text, and control files
Native file production ratio 5% – 30% Spreadsheets, databases, CAD files, videos preserved in native format
Image file production ratio 70% – 95% TIFF or PDF images with metadata and text files
Pages per GB (PDF production) 20,000 – 40,000 After flattening, OCR, and Bates stamping
Bates numbering speed 10,000 – 50,000 pages/hour Depends on image quality and numbering complexity

Trial Presentation Metrics

Item Typical Range Notes
Color pages in trial sets 5% – 25% Photos, charts, highlights, key documents
Trial binders per case 10 – 500+ Depends on exhibits, parties, and trial duration
Exhibits per day of trial 10 – 50 Varies by case complexity and attorney style
Trial exhibit preparation time 2 – 6 weeks From exhibit list finalization to courtroom delivery
Production Standard

All productions must be reproducible, hash-verifiable, metadata-preserving, and court-compliant. Productions must include load files, metadata fields required by requesting party, and technical specifications documentation. Bates numbering must be consistent, sequential, and properly prefixed/suffixed.

Real-World Note: Production Formats

Production format requirements vary by jurisdiction and agreement. While TIFF with load files was once standard, PDF with metadata is increasingly common for its convenience and reduced file size. Native production of spreadsheets with embedded formulas is now routine. Always confirm format specifications before processing begins.

Why This Matters

Production metrics determine delivery timelines and storage requirements. A 100,000 page production in TIFF format might be 50 GB, while the same production in PDF could be 25 GB. Load file preparation and QC typically add 2–5 days to production timelines. Trial exhibit preparation should begin 4–6 weeks before trial to allow for revisions and contingencies.

Production File Formats

Format Typical Use Case Considerations
Single-Page TIFF + Load File Traditional litigation production Large file sizes, universal compatibility, established standard
Multi-Page Searchable PDF Modern productions, smaller files Reduced storage, easier handling, requires PDF reader
Native Files with Metadata Spreadsheets, databases, CAD, video Preserves functionality, requires specialized software
ESI Protocol Specified Format As agreed by parties Follow court orders and agreements precisely

Chain-of-Custody Standards

At every stage of the EDRM lifecycle, defensibility requires comprehensive documentation:

  • Every file is logged: Complete inventory with hash values, source locations, and custodian attribution
  • Every handoff is recorded: Transfer documentation between collection, processing, review, and production teams
  • Every transformation is documented: Processing logs showing deduplication, filtering, OCR, and format conversions
  • Every production is reproducible: Complete audit trail allowing recreation of production at any future date

This ensures defensibility in the event of a court challenge, sanctions motion, or production dispute. Proper chain-of-custody documentation protects against spoliation allegations and demonstrates good-faith compliance with discovery obligations.

Glossary

Common terms and definitions used in eDiscovery and digital evidence management:

ESI
Electronically Stored Information – Any data created, stored, or transmitted in electronic form
Custodian
A person or system that controls data subject to preservation or collection
Deduplication
Removal of identical files or documents based on hash value comparison
Hash Value
Cryptographic fingerprint (MD5, SHA-1, SHA-256) uniquely identifying a file
OCR
Optical Character Recognition – Technology converting images to searchable text
Native File
Original file format (Excel, Word, MSG, etc.) preserving functionality and metadata
Load File
Structured data file (DAT, OPT, LFP) used by review platforms to import documents and metadata
Metadata
System information describing a file: author, dates, recipients, file properties, etc.
Bates Number
Sequential identifier stamped on each page for citation and tracking purposes
Chain of Custody
Documentation of evidence handling from collection through production
Privilege Log
List of documents withheld from production due to privilege or work product protection
TAR
Technology-Assisted Review – Machine learning tools to accelerate document review