RefPack Specification
Mission
RefPack establishes a standardized, secure, and self-contained format for distributing structured datasets. Our mission is to enable seamless data sharing across organizations, platforms, and tools while maintaining cryptographic integrity and eliminating dependency on external infrastructure.
By providing a ZIP-based packaging format with embedded cryptographic signatures, RefPack ensures that datasets can be validated, trusted, and consumed reliably in any environment - from air-gapped systems to cloud-native applications.
Why RefPack?
Modern data distribution faces several critical challenges:
- Fragmented formats: Teams use inconsistent packaging approaches, making data exchange difficult
- Security gaps: Many distribution methods lack cryptographic verification of data integrity and authenticity
- Dependency complexity: External key servers and validation infrastructure create single points of failure
- Version chaos: Inconsistent versioning makes it difficult to track dataset evolution and compatibility
- Trust issues: No standardized way to verify data provenance and publisher identity
RefPack solves these problems by providing:
- Standardized packaging: Consistent ZIP-based format with defined file structures
- Self-contained security: Embedded public keys eliminate external dependencies
- Semantic versioning: Built-in SemVer support for clear version management
- Offline validation: Complete verification without network connectivity
- Developer-friendly tooling: CLI tools that integrate into existing workflows
What's Inside a RefPack?
A RefPack is fundamentally a ZIP archive with a standardized internal structure:
/ ← Package root (no nested folders, except `assets/`)
├── data.meta.json ← REQUIRED, signed manifest
├── data.meta.json.jws ← REQUIRED, JWS signature over exact `data.meta.json` bytes
├── data.json ← REQUIRED, JSON array of objects
├── data.schema.json ← OPTIONAL, JSON-Schema for `data.json`
├── data.changelog.json ← OPTIONAL, versioned changelog
├── data.readme.md ← OPTIONAL, human-readable documentation
└── assets/ ← OPTIONAL, flat folder of supplemental files
├── image.png
└── lookup.csv
Every RefPack contains three core components that work together to provide security, structure, and usability:
- Signed Manifest (
data.meta.json
+data.meta.json.jws
): Package metadata with cryptographic integrity - Structured Payload (
data.json
): The actual dataset as a JSON array - Optional Documentation (schema, readme, changelog, assets): Supporting materials for understanding and using the data
Deep Dive: Core Components
1. Manifest: data.meta.json
The manifest serves as the authoritative metadata source for every RefPack. It contains essential information about the package identity, versioning, and provenance.
JSON Schema
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "RefPack DatasetMeta",
"type": "object",
"required": ["id","version","title","createdUtc"],
"properties": {
"id": {
"type": "string",
"pattern": "^[A-Za-z0-9](?:[A-Za-z0-9_-]*[A-Za-z0-9])?$",
"description": "Package identifier (alphanumeric, -, _, no spaces)."
},
"version": {
"type": "string",
"pattern": "^\\d+\\.\\d+\\.\\d+(?:-[0-9A-Za-z\\.]+)?$",
"description": "SemVer 2.0.0 version string."
},
"title": {
"type": "string",
"minLength": 1,
"description": "Human-readable title."
},
"description": {
"type": "string",
"description": "Optional long description."
},
"authors": {
"type": "array",
"items": { "type": "string" },
"description": "List of author names or organizations."
},
"createdUtc": {
"type": "string",
"format": "date-time",
"description": "UTC timestamp of package creation."
},
"tags": {
"type": "array",
"items": { "type": "string" },
"description": "Free-form tags."
},
"license": {
"type": "string",
"description": "SPDX license identifier."
}
},
"additionalProperties": false
}
Required Fields
id
: Unique package identifier following a strict naming conventionversion
: SemVer 2.0.0 compliant version stringtitle
: Human-readable package namecreatedUtc
: ISO 8601 UTC timestamp of package creation
Optional Fields
description
: Detailed package descriptionauthors
: Array of author names or organizationstags
: Free-form classification tagslicense
: SPDX license identifier for legal clarity
2. Signature: data.meta.json.jws
RefPack uses JSON Web Signature (JWS) with embedded public keys to provide cryptographic integrity without external dependencies.
JWS Header Structure
The JWS header must contain an embedded JSON Web Key (JWK) with the public key:
{
"alg": "ES256", // Algorithm (ES256 recommended)
"kid": "publisher-key-1", // Key identifier
"jwk": { // Embedded public key (JWK format)
"kty": "EC",
"crv": "P-256",
"x": "...", // Public key X coordinate
"y": "...", // Public key Y coordinate
"use": "sig", // Key usage: signing
"key_ops": ["verify"] // Allowed operations: verification only
},
"typ": "JWT" // Token type
}
JWS Payload (Claims)
The JWS contains standard JWT claims for security:
{
"iat": 1640995200, // Issued at (Unix timestamp)
"exp": 1641002400, // Expiration time (Unix timestamp, typically 2 hours)
"jti": "refpack" // JWT ID (must be "refpack")
}
Signature Generation Process
- Create JWS header with embedded public key (JWK format)
- Create JWS payload with standard JWT claims
- Sign over manifest bytes: The signature covers the exact bytes of
data.meta.json
- Generate compact JWS: Standard RFC 7515 compact serialization
3. Payload: data.json
The payload contains the actual dataset and must always be a JSON array of objects. This constraint ensures consistent data structure across all RefPacks.
Structure Requirements
- Root type: Must be a JSON array (
[]
) - Array elements: Must be JSON objects (
{}
) - No top-level object: The root cannot be a single object
Example
[
{ "id": "US", "name": "United States", "population": 331002651 },
{ "id": "CA", "name": "Canada", "population": 37742154 },
{ "id": "MX", "name": "Mexico", "population": 128932753 }
]
This structure enables:
- Streaming processing of individual records
- Consistent iteration patterns across different tools
- Schema validation at the array level
- Efficient JSON parsing and manipulation
Security & Validation
RefPack implements multiple layers of security validation to ensure package integrity and authenticity.
Cryptographic Validation
JWS Signature Verification
Clients and servers must:
- Parse JWS header and extract the embedded
jwk
field - Validate JWK structure ensuring it contains only public key components
- Verify signature using the embedded public key over
BASE64URL(header) . BASE64URL(payload)
- Validate JWT claims:
- Check
exp
(expiration) if present - Verify
iat
(issued at) is not in the future (allow 5min clock skew) - Ensure
jti
equals "refpack"
- Check
- Verify manifest integrity: Ensure the JWS was signed over the exact bytes of
data.meta.json
Key Security Requirements
- Embedded JWK must contain only public key components (no private key material)
- Private keys must never be included in packages or transmitted
- Key validation must verify JWK structure and key usage constraints
- Algorithm restrictions should limit to approved signing algorithms (ES256, ES384, ES512, EdDSA)
Structural Validation
ZIP Archive Security
- ZIP sanitization: Reject any entry with path traversal (
../
) or unexpected root paths - Entry validation: Ensure only allowed files are present in the archive
- Size limits: Implement reasonable limits on archive and individual file sizes
- Compression validation: Verify ZIP compression integrity
Schema Validation
- Manifest schema: Enforce JSON-Schema validation on
data.meta.json
- Payload schema: Validate
data.json
againstdata.schema.json
if present - File format validation: Verify JSON syntax and structure for all JSON files
Versioning Security
- SemVer compliance: Enforce strict SemVer 2.0.0 format validation
- Version progression: New versions must be strictly greater than existing versions
- Pre-release handling: Implement appropriate restrictions on pre-release versions
Trust Model
RefPack implements a decentralized trust model that doesn't rely on central authorities:
- Self-contained verification: All validation can occur offline using embedded keys
- Publisher identity: Trust is established through key fingerprints and out-of-band verification
- Key rotation: Publishers can rotate keys using new Key IDs (
kid
) - Multi-signature support: Future versions may support multiple signatures for enhanced trust
CLI Toolchain
The RefPack CLI provides a complete toolchain for creating, validating, and distributing RefPack archives.
Core Commands
Command | Description |
---|---|
pack |
Validate folder, then create <id>-<version>.refpack.zip |
validate |
Open a .refpack.zip , verify layout, schemas, and JWS. |
push |
POST ZIP to /packages , expect JSON {"success":true} . |
pull |
GET /packages/{id}?version={v} , saves ZIP or extracts folder. |
meta |
GET /packages/{id}/meta?version={v} , prints JSON manifest. |
Packaging Workflow
# 1. Pack & sign locally with private key
refpack pack \
--input ./country-data/ \
--output country-1.0.0.refpack.zip \
--sign-key ~/.keys/publisher.pem \
--key-id publisher-2025-05-20
# 2. Validate (no JWKS URL needed - uses embedded public key)
refpack validate \
--package country-1.0.0.refpack.zip
# 3. Push to registry
refpack push \
--package country-1.0.0.refpack.zip \
--api-url https://api.refpack.example.com \
--api-key $REFPACK_TOKEN
# 4. Later, pull & inspect
refpack pull --id country --version 1.0.0 --dest ./downloads/
refpack meta --id country --version 1.0.0
Key Management Commands
The CLI includes specialized commands for cryptographic key management:
# Generate new signing key pair
refpack keygen \
--algorithm ES256 \
--key-id publisher-2025-05-20 \
--output ~/.keys/publisher.pem
# Extract public key for verification
refpack pubkey \
--private-key ~/.keys/publisher.pem \
--output ~/.keys/publisher.pub.json
# Verify signature with explicit public key
refpack verify \
--package country-1.0.0.refpack.zip \
--public-key ~/.keys/publisher.pub.json
Configuration Management
The CLI supports configuration files for streamlined workflows:
{
"publisher": {
"name": "Acme Data Corp",
"keyId": "acme-2025-05-20",
"keyFile": "~/.keys/acme.pem"
},
"registry": {
"url": "https://api.refpack.example.com",
"tokenFile": "~/.refpack/token"
},
"validation": {
"strictMode": true,
"allowPrerelease": false,
"maxPackageSize": "100MB"
}
}
Typical Use Cases
RefPack addresses a wide range of data distribution scenarios across different industries and use cases.
Data Science & Analytics
Scenario: Research teams sharing cleaned datasets for reproducible analysis
- Package: Clean, processed datasets with schema validation
- Benefits: Version control for data, cryptographic integrity, offline validation
- Workflow: Data scientists can pull specific dataset versions and trust their authenticity
API Reference Data
Scenario: Distributing country codes, currency lists, or other reference data for applications
- Package: Structured reference data with regular updates
- Benefits: Consistent format across applications, automated updates, schema enforcement
- Workflow: Applications can pull and cache reference data with confidence in data quality
Machine Learning Models
Scenario: Sharing training datasets and model artifacts between ML teams
- Package: Training data, validation sets, and model metadata
- Benefits: Reproducible model training, dataset lineage tracking, secure distribution
- Workflow: ML pipelines can pull specific dataset versions for consistent model training
Configuration Distribution
Scenario: Distributing application configuration or feature flags across environments
- Package: Environment-specific configuration data with change tracking
- Benefits: Auditable configuration changes, rollback capabilities, secure deployment
- Workflow: Deployment systems can pull and validate configuration updates
Compliance & Audit
Scenario: Financial institutions sharing regulatory data with audit trails
- Package: Regulatory datasets with cryptographic signatures for audit purposes
- Benefits: Non-repudiation, tamper evidence, compliance with data integrity requirements
- Workflow: Auditors can independently verify data authenticity and integrity
Open Data Publishing
Scenario: Government agencies publishing public datasets
- Package: Public datasets with comprehensive metadata and documentation
- Benefits: Standardized format, built-in documentation, version tracking
- Workflow: Citizens and researchers can access and trust official data sources
Getting Started (Spec)
This section provides implementation guidance for developers building RefPack-compatible tools and libraries.
Implementation Checklist
Core Requirements
- ZIP Archive Handling: Implement secure ZIP creation and extraction
- JSON Processing: Support for JSON parsing, validation, and schema enforcement
- JWS Cryptography: Implement JWS signing and verification with embedded JWK
- SemVer Validation: Enforce semantic versioning rules and comparison
- File Structure Validation: Verify RefPack internal structure requirements
Security Implementation
- Path Traversal Protection: Sanitize ZIP entry paths to prevent directory traversal
- Key Validation: Verify JWK structure and key usage constraints
- Signature Verification: Implement complete JWS verification workflow
- Schema Enforcement: Validate all JSON files against their respective schemas
- Size Limits: Implement reasonable limits on package and file sizes
Optional Features
- CLI Tool: Command-line interface for pack, validate, push, pull operations
- Registry Integration: HTTP client for package registry operations
- Streaming Support: Handle large packages with streaming extraction
- Multi-signature: Support for multiple signatures per package (future)
Library Integration
RefPack is designed to integrate seamlessly with existing data processing libraries:
Python Integration
import refpack
import pandas as pd
# Load RefPack into pandas DataFrame
with refpack.open('countries-1.0.0.refpack.zip') as package:
df = pd.DataFrame(package.data)
print(f"Loaded {package.meta.title} v{package.meta.version}")
JavaScript Integration
const refpack = require('refpack');
// Load and validate RefPack
const package = await refpack.load('countries-1.0.0.refpack.zip');
console.log(`Package: ${package.meta.title}`);
console.log(`Records: ${package.data.length}`);
REST API Integration
POST /packages HTTP/1.1
Content-Type: application/zip
Authorization: Bearer <token>
[ZIP file binary data]
Testing Strategy
Comprehensive testing is essential for RefPack implementations:
Unit Tests
- JSON schema validation with valid/invalid inputs
- JWS signature generation and verification
- ZIP archive creation and extraction
- SemVer parsing and comparison
- File structure validation
Integration Tests
- End-to-end pack/validate/unpack workflows
- Registry push/pull operations
- Key rotation scenarios
- Error handling and recovery
Security Tests
- Path traversal attack prevention
- Malformed ZIP handling
- Invalid signature rejection
- Key validation edge cases
- Size limit enforcement
Join the Community
RefPack is an open specification designed to grow through community collaboration and feedback.
Contributing to the Specification
The RefPack specification evolves through community input and real-world usage:
- GitHub Repository: Submit issues and pull requests for specification improvements
- RFC Process: Major changes follow a Request for Comments process
- Implementation Feedback: Share experiences from building RefPack-compatible tools
- Use Case Documentation: Contribute examples of successful RefPack deployments
Implementation Registry
We maintain a registry of RefPack-compatible tools and libraries:
- Reference Implementations: Official CLI and library implementations
- Community Libraries: Third-party implementations in various languages
- Registry Services: Compatible package registry implementations
- Integration Examples: Sample code and integration patterns
Support Channels
- Technical Discussion: GitHub Discussions for specification questions
- Implementation Help: Stack Overflow with
refpack
tag - Security Reports: Private disclosure process for security issues
- Community Chat: Discord server for real-time discussion
Roadmap & Future Development
Current development priorities include:
- Multi-signature Support: Enhanced security through multiple signatures
- Streaming Extensions: Better support for large dataset distribution
- Registry Federation: Distributed registry networks
- Compression Options: Alternative compression algorithms
- Metadata Extensions: Enhanced metadata for specific use cases
Optional Components
Schema: data.schema.json
(Optional)
When present, must validate an array of objects to match data.json
.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "RefPack Data Schema",
"type": "array",
"items": {
"type": "object",
"properties": {
"id": { "type": "string" },
"name": { "type": "string" },
"population":{ "type": "integer", "minimum": 0 }
},
"required": ["id","name","population"],
"additionalProperties": false
}
}
- Root
type: "array"
guarantees payload shape. - Clients must validate
data.json
against this schema when present.
Changelog & Readme (Optional)
data.changelog.json
: Array of version entries[ { "version": "1.0.0", "date": "2024-12-01", "description": "Initial release" }, { "version": "1.1.0", "date": "2025-02-15", "description": "Added fields" } ]
data.readme.md
: Markdown file for human-oriented notes.
Assets Folder (Optional)
assets/
: Flat directory of supplemental files (images, CSVs, etc.).- Clients preserve the entire folder; do not interpret its contents.
Advanced Topics
Versioning
- SemVer 2.0.0 required.
- New version must be strictly greater than any published under the same
id
. - Clients may reject pre-releases unless invoked with
--allow-prerelease
.
Key Management Best Practices
Private Key Security
- Store private keys securely (HSMs, encrypted files, secure key stores)
- Use different keys for different publishers/organizations
- Implement key rotation policies
- Never embed private keys in packages or version control
Public Key Distribution
- Public keys are automatically distributed via the embedded JWK in each package
- No separate key distribution infrastructure needed
- Keys are self-validating through cryptographic signatures
Trust Model
- Trust is established through out-of-band verification of the first package from a publisher
- Subsequent packages from the same
kid
(Key ID) maintain trust chain - Publishers can rotate keys by using new
kid
values
Extensibility
- Additional metadata: Clients may add custom fields under a vendor extension namespace (e.g.
"x-my-field": ...
). - Alternate signing algorithms: Support multiple JWS algorithms (ES256, ES384, ES512, EdDSA).
- Multiple signatures: Future versions may support multiple JWS signatures for multi-party signing.
- Streaming: For very large packages, clients may implement chunked uploads but must reassemble a valid zip before validation.
- Event Hooks: Define optional hooks (e.g.
pre-pack
,post-pull
) in arefpack.config.json
for custom workflows.